[
  {
    "path": ".clang-format",
    "content": "# Google C/C++ Code Style settings (with 4-space)\n# Refered to https://github.com/kehanXue/google-style-clang-format/blob/master/.clang-format\n\nLanguage: Cpp\nBasedOnStyle: Google\nAccessModifierOffset: -1\nAlignAfterOpenBracket: Align\nAlignConsecutiveAssignments: None\nAlignOperands: Align\nAllowAllArgumentsOnNextLine: true\nAllowAllConstructorInitializersOnNextLine: true\nAllowAllParametersOfDeclarationOnNextLine: false\nAllowShortBlocksOnASingleLine: Empty\nAllowShortCaseLabelsOnASingleLine: false\nAllowShortFunctionsOnASingleLine: Inline\nAllowShortIfStatementsOnASingleLine: Never  # To avoid conflict, set this \"Never\" and each \"if statement\" should include brace when coding\nAllowShortLambdasOnASingleLine: Inline\nAllowShortLoopsOnASingleLine: false\nAlwaysBreakAfterReturnType: None\nAlwaysBreakTemplateDeclarations: Yes\nBinPackArguments: true\nBreakBeforeBraces: Custom\nBraceWrapping:\n  AfterCaseLabel: false\n  AfterClass: false\n  AfterStruct: false\n  AfterControlStatement: Never\n  AfterEnum: false\n  AfterFunction: false\n  AfterNamespace: false\n  AfterUnion: false\n  AfterExternBlock: false\n  BeforeCatch: false\n  BeforeElse: false\n  BeforeLambdaBody: false\n  IndentBraces: false\n  SplitEmptyFunction: false\n  SplitEmptyRecord: false\n  SplitEmptyNamespace: false\nBreakBeforeBinaryOperators: None\nBreakBeforeTernaryOperators: true\nBreakConstructorInitializers: BeforeColon\nBreakInheritanceList: BeforeColon\nColumnLimit: 120\nCompactNamespaces: false\nContinuationIndentWidth: 8\nCpp11BracedListStyle: true\nDerivePointerAlignment: false  # Make sure the * or & align on the left\nEmptyLineBeforeAccessModifier: LogicalBlock\nFixNamespaceComments: true\nIncludeBlocks: Preserve\nIndentCaseLabels: true\nIndentPPDirectives: None\nIndentWidth: 4\nKeepEmptyLinesAtTheStartOfBlocks: true\nMaxEmptyLinesToKeep: 1\nNamespaceIndentation: None\nObjCSpaceAfterProperty: false\nObjCSpaceBeforeProtocolList: true\nPointerAlignment: Left\nReflowComments: false\n# SeparateDefinitionBlocks: Always   # Only support since clang-format 14\nSpaceAfterCStyleCast: false\nSpaceAfterLogicalNot: false\nSpaceAfterTemplateKeyword: true\nSpaceBeforeAssignmentOperators: true\nSpaceBeforeCpp11BracedList: false\nSpaceBeforeCtorInitializerColon: true\nSpaceBeforeInheritanceColon: true\nSpaceBeforeParens: ControlStatements\nSpaceBeforeRangeBasedForLoopColon: true\nSpaceBeforeSquareBrackets: false\nSpaceInEmptyParentheses: false\nSpacesBeforeTrailingComments: 2\nSpacesInAngles: false\nSpacesInCStyleCastParentheses: false\nSpacesInContainerLiterals: false\nSpacesInParentheses: false\nSpacesInSquareBrackets: false\nStandard: c++11\nTabWidth: 8\nUseTab: Never\n"
  },
  {
    "path": ".cmake-format.yaml",
    "content": "_help_parse: Options affecting listfile parsing\nparse:\n  _help_additional_commands:\n  - Specify structure for custom cmake functions\n  additional_commands:\n    foo:\n      flags:\n      - BAR\n      - BAZ\n      kwargs:\n        HEADERS: '*'\n        SOURCES: '*'\n        DEPENDS: '*'\n  _help_override_spec:\n  - Override configurations per-command where available\n  override_spec: {}\n  _help_vartags:\n  - Specify variable tags.\n  vartags: []\n  _help_proptags:\n  - Specify property tags.\n  proptags: []\n_help_format: Options affecting formatting.\nformat:\n  _help_disable:\n  - Disable formatting entirely, making cmake-format a no-op\n  disable: false\n  _help_line_width:\n  - How wide to allow formatted cmake files\n  line_width: 80\n  _help_tab_size:\n  - How many spaces to tab for indent\n  tab_size: 2\n  _help_use_tabchars:\n  - If true, lines are indented using tab characters (utf-8\n  - 0x09) instead of <tab_size> space characters (utf-8 0x20).\n  - In cases where the layout would require a fractional tab\n  - character, the behavior of the  fractional indentation is\n  - governed by <fractional_tab_policy>\n  use_tabchars: false\n  _help_fractional_tab_policy:\n  - If <use_tabchars> is True, then the value of this variable\n  - indicates how fractional indentions are handled during\n  - whitespace replacement. If set to 'use-space', fractional\n  - indentation is left as spaces (utf-8 0x20). If set to\n  - '`round-up` fractional indentation is replaced with a single'\n  - tab character (utf-8 0x09) effectively shifting the column\n  - to the next tabstop\n  fractional_tab_policy: use-space\n  _help_max_subgroups_hwrap:\n  - If an argument group contains more than this many sub-groups\n  - (parg or kwarg groups) then force it to a vertical layout.\n  max_subgroups_hwrap: 2\n  _help_max_pargs_hwrap:\n  - If a positional argument group contains more than this many\n  - arguments, then force it to a vertical layout.\n  max_pargs_hwrap: 6\n  _help_max_rows_cmdline:\n  - If a cmdline positional group consumes more than this many\n  - lines without nesting, then invalidate the layout (and nest)\n  max_rows_cmdline: 2\n  _help_separate_ctrl_name_with_space:\n  - If true, separate flow control names from their parentheses\n  - with a space\n  separate_ctrl_name_with_space: false\n  _help_separate_fn_name_with_space:\n  - If true, separate function names from parentheses with a\n  - space\n  separate_fn_name_with_space: false\n  _help_dangle_parens:\n  - If a statement is wrapped to more than one line, than dangle\n  - the closing parenthesis on its own line.\n  dangle_parens: false\n  _help_dangle_align:\n  - If the trailing parenthesis must be 'dangled' on its on\n  - 'line, then align it to this reference: `prefix`: the start'\n  - 'of the statement,  `prefix-indent`: the start of the'\n  - 'statement, plus one indentation  level, `child`: align to'\n  - the column of the arguments\n  dangle_align: prefix\n  _help_min_prefix_chars:\n  - If the statement spelling length (including space and\n  - parenthesis) is smaller than this amount, then force reject\n  - nested layouts.\n  min_prefix_chars: 4\n  _help_max_prefix_chars:\n  - If the statement spelling length (including space and\n  - parenthesis) is larger than the tab width by more than this\n  - amount, then force reject un-nested layouts.\n  max_prefix_chars: 10\n  _help_max_lines_hwrap:\n  - If a candidate layout is wrapped horizontally but it exceeds\n  - this many lines, then reject the layout.\n  max_lines_hwrap: 2\n  _help_line_ending:\n  - What style line endings to use in the output.\n  line_ending: unix\n  _help_command_case:\n  - Format command names consistently as 'lower' or 'upper' case\n  command_case: canonical\n  _help_keyword_case:\n  - Format keywords consistently as 'lower' or 'upper' case\n  keyword_case: unchanged\n  _help_always_wrap:\n  - A list of command names which should always be wrapped\n  always_wrap: []\n  _help_enable_sort:\n  - If true, the argument lists which are known to be sortable\n  - will be sorted lexicographicall\n  enable_sort: true\n  _help_autosort:\n  - If true, the parsers may infer whether or not an argument\n  - list is sortable (without annotation).\n  autosort: false\n  _help_require_valid_layout:\n  - By default, if cmake-format cannot successfully fit\n  - everything into the desired linewidth it will apply the\n  - last, most agressive attempt that it made. If this flag is\n  - True, however, cmake-format will print error, exit with non-\n  - zero status code, and write-out nothing\n  require_valid_layout: false\n  _help_layout_passes:\n  - A dictionary mapping layout nodes to a list of wrap\n  - decisions. See the documentation for more information.\n  layout_passes: {}\n_help_markup: Options affecting comment reflow and formatting.\nmarkup:\n  _help_bullet_char:\n  - What character to use for bulleted lists\n  bullet_char: '*'\n  _help_enum_char:\n  - What character to use as punctuation after numerals in an\n  - enumerated list\n  enum_char: .\n  _help_first_comment_is_literal:\n  - If comment markup is enabled, don't reflow the first comment\n  - block in each listfile. Use this to preserve formatting of\n  - your copyright/license statements.\n  first_comment_is_literal: false\n  _help_literal_comment_pattern:\n  - If comment markup is enabled, don't reflow any comment block\n  - which matches this (regex) pattern. Default is `None`\n  - (disabled).\n  literal_comment_pattern: null\n  _help_fence_pattern:\n  - Regular expression to match preformat fences in comments\n  - default= ``r'^\\s*([`~]{3}[`~]*)(.*)$'``\n  fence_pattern: ^\\s*([`~]{3}[`~]*)(.*)$\n  _help_ruler_pattern:\n  - Regular expression to match rulers in comments default=\n  - '``r''^\\s*[^\\w\\s]{3}.*[^\\w\\s]{3}$''``'\n  ruler_pattern: ^\\s*[^\\w\\s]{3}.*[^\\w\\s]{3}$\n  _help_explicit_trailing_pattern:\n  - If a comment line matches starts with this pattern then it\n  - is explicitly a trailing comment for the preceeding\n  - argument. Default is '#<'\n  explicit_trailing_pattern: '#<'\n  _help_hashruler_min_length:\n  - If a comment line starts with at least this many consecutive\n  - hash characters, then don't lstrip() them off. This allows\n  - for lazy hash rulers where the first hash char is not\n  - separated by space\n  hashruler_min_length: 10\n  _help_canonicalize_hashrulers:\n  - If true, then insert a space between the first hash char and\n  - remaining hash chars in a hash ruler, and normalize its\n  - length to fill the column\n  canonicalize_hashrulers: true\n  _help_enable_markup:\n  - enable comment markup parsing and reflow\n  enable_markup: true\n_help_lint: Options affecting the linter\nlint:\n  _help_disabled_codes:\n  - a list of lint codes to disable\n  disabled_codes: []\n  _help_function_pattern:\n  - regular expression pattern describing valid function names\n  function_pattern: '[0-9a-z_]+'\n  _help_macro_pattern:\n  - regular expression pattern describing valid macro names\n  macro_pattern: '[0-9A-Z_]+'\n  _help_global_var_pattern:\n  - regular expression pattern describing valid names for\n  - variables with global (cache) scope\n  global_var_pattern: '[A-Z][0-9A-Z_]+'\n  _help_internal_var_pattern:\n  - regular expression pattern describing valid names for\n  - variables with global scope (but internal semantic)\n  internal_var_pattern: _[A-Z][0-9A-Z_]+\n  _help_local_var_pattern:\n  - regular expression pattern describing valid names for\n  - variables with local scope\n  local_var_pattern: '[a-z][a-z0-9_]+'\n  _help_private_var_pattern:\n  - regular expression pattern describing valid names for\n  - privatedirectory variables\n  private_var_pattern: _[0-9a-z_]+\n  _help_public_var_pattern:\n  - regular expression pattern describing valid names for public\n  - directory variables\n  public_var_pattern: '[A-Z][0-9A-Z_]+'\n  _help_argument_var_pattern:\n  - regular expression pattern describing valid names for\n  - function/macro arguments and loop variables.\n  argument_var_pattern: '[a-z][a-z0-9_]+'\n  _help_keyword_pattern:\n  - regular expression pattern describing valid names for\n  - keywords used in functions or macros\n  keyword_pattern: '[A-Z][0-9A-Z_]+'\n  _help_max_conditionals_custom_parser:\n  - In the heuristic for C0201, how many conditionals to match\n  - within a loop in before considering the loop a parser.\n  max_conditionals_custom_parser: 2\n  _help_min_statement_spacing:\n  - Require at least this many newlines between statements\n  min_statement_spacing: 1\n  _help_max_statement_spacing:\n  - Require no more than this many newlines between statements\n  max_statement_spacing: 2\n  max_returns: 6\n  max_branches: 12\n  max_arguments: 5\n  max_localvars: 15\n  max_statements: 50\n_help_encode: Options affecting file encoding\nencode:\n  _help_emit_byteorder_mark:\n  - If true, emit the unicode byte-order mark (BOM) at the start\n  - of the file\n  emit_byteorder_mark: false\n  _help_input_encoding:\n  - Specify the encoding of the input file. Defaults to utf-8\n  input_encoding: utf-8\n  _help_output_encoding:\n  - Specify the encoding of the output file. Defaults to utf-8.\n  - Note that cmake only claims to support utf-8 so be careful\n  - when using anything else\n  output_encoding: utf-8\n_help_misc: Miscellaneous configurations options.\nmisc:\n  _help_per_command:\n  - A dictionary containing any per-command configuration\n  - overrides. Currently only `command_case` is supported.\n  per_command: {}\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/tensorrtx-issue-template.md",
    "content": "---\nname: tensorrtx issue template\nabout: To understand your issue better\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n## Env\n\n- GPU, e.g. V100, RTX2080, TX2, Xavier NX, Nano, etc.\n- OS, e.g. Ubuntu16.04, Win10, etc.\n- Cuda version\n- TensorRT version\n\n## About this repo\n\n- which branch/tag/commit are you using?\n- which model? yolov5, retinaface?\n\n## Your problem\n\n- what is your command? e.g. `sudo ./yolov5 -s`\n- what's your output?\n- what output do you expect?\n"
  },
  {
    "path": ".github/stale.yml",
    "content": "# Number of days of inactivity before an issue becomes stale\ndaysUntilStale: 60\n# Number of days of inactivity before a stale issue is closed\ndaysUntilClose: 7\n# Issues with these labels will never be considered stale\nexemptLabels:\n  - pinned\n  - security\n# Label to use when marking an issue as stale\nstaleLabel: wontfix\n# Comment to post when marking an issue as stale. Set to `false` to disable\nmarkComment: >\n  This issue has been automatically marked as stale because it has not had\n  recent activity. It will be closed if no further activity occurs. Thank you\n  for your contributions.\n# Comment to post when closing a stale issue. Set to `false` to disable\ncloseComment: false\n"
  },
  {
    "path": ".github/workflows/pre-commit.yml",
    "content": "name: pre-commit\n\non:\n  pull_request:\n    branches:\n      - master\n      - trt10\n\n  push:\n    branches:\n      - master\n      - trt10\n\njobs:\n  pre-commit:\n    runs-on: ubuntu-latest\n\n    steps:\n      - uses: actions/checkout@v5\n        with:\n          # grab the history of the PR\n          fetch-depth: 0\n\n      - name: Fetch commits\n        run: |\n          git fetch origin ${{ github.event.before }} || true\n          git fetch origin ${{ github.sha }}\n\n      - uses: actions/setup-python@v4\n\n      - uses: pre-commit/action@v3.0.1\n        if: github.event_name == 'push'\n        with:\n          extra_args: >\n            --from-ref ${{ github.event.before }}\n            --to-ref   ${{ github.sha }}\n            --show-diff-on-failure --color=always\n\n      - uses: pre-commit/action@v3.0.1\n        if: github.event_name == 'pull_request'\n        with:\n          extra_args: >\n            --from-ref ${{ github.event.pull_request.base.sha }}\n            --to-ref   ${{ github.event.pull_request.head.sha }}\n            --show-diff-on-failure --color=always\n"
  },
  {
    "path": ".gitignore",
    "content": "models\nbuild\n*.wts\n*.engine\n*.tpymodel\n*/*.ppm\n*idea*\n\n.vscode/*\n!.vscode/settings.json\n!.vscode/tasks.json\n!.vscode/launch.json\n!.vscode/extensions.json\n!.vscode/*.code-snippets\n\n# Local History for Visual Studio Code\n.history/\n\n# Built Visual Studio Code Extensions\n*.vsix\n\n.vscode/*\n!.vscode/settings.json\n!.vscode/tasks.json\n!.vscode/launch.json\n!.vscode/extensions.json\n!.vscode/*.code-snippets\n\n# Local History for Visual Studio Code\n.history/\n\n# Built Visual Studio Code Extensions\n*.vsix\n\n# Prerequisites\n*.d\n\n# Compiled Object files\n*.slo\n*.lo\n*.o\n*.obj\n\n# Precompiled Headers\n*.gch\n*.pch\n\n# Compiled Dynamic libraries\n*.so\n*.dylib\n*.dll\n\n# Fortran module files\n*.mod\n*.smod\n\n# Compiled Static libraries\n*.lai\n*.la\n*.a\n*.lib\n\n# Executables\n*.exe\n*.out\n*.app\n\nCMakeLists.txt.user\nCMakeCache.txt\nCMakeFiles\nCMakeScripts\nTesting\ncmake_install.cmake\ninstall_manifest.txt\ncompile_commands.json\nCTestTestfile.cmake\n_deps\nCMakeUserPresets.json\n"
  },
  {
    "path": ".pre-commit-config.yaml",
    "content": "repos:\n  - repo: https://github.com/pre-commit/pre-commit-hooks\n    rev: v4.5.0\n    hooks:\n      - id: check-merge-conflict\n      - id: check-symlinks\n      - id: end-of-file-fixer\n        types: [python]\n      - id: trailing-whitespace\n        types: [python]\n      - id: check-added-large-files\n  - repo: https://github.com/pre-commit/mirrors-clang-format\n    rev: v18.1.3\n    hooks:\n      - id: clang-format\n        types_or: [c++, c, cuda]\n        args: [-style=file]\n  - repo: https://github.com/PyCQA/flake8\n    rev: 7.0.0\n    hooks:\n      - id: flake8\n        args: [--max-line-length=120]\n  - repo: https://github.com/cheshirekow/cmake-format-precommit\n    rev: v0.6.13\n    hooks:\n      - id: cmake-format\n        additional_dependencies: [pyyaml]\n        args: [--in-place, -c, .cmake-format.yaml]\n        types: [file]\n        files: (\\.cmake|CMakeLists.txt)(.in)?$\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2019-2020 Wang Xinyu\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# TensorRTx\n\nTensorRTx aims to implement popular deep learning networks with TensorRT network definition API.\n\nWhy don't we use a parser (ONNX parser, UFF parser, caffe parser, etc), but use complex APIs to build a network from scratch? I have summarized the advantages in the following aspects.\n\n- **Flexible**, easy to modify the network, add/delete a layer or input/output tensor, replace a layer, merge layers, integrate preprocessing and postprocessing into network, etc.\n- **Debuggable**, construct the entire network in an incremental development manner, easy to get middle layer results.\n- **Educational**, learn about the network structure during this development, rather than treating everything as a black box.\n\nThe basic workflow of TensorRTx is:\n\n1. Get the trained models from pytorch, mxnet or tensorflow, etc. Some pytorch models can be found in my repo [pytorchx](https://github.com/wang-xinyu/pytorchx), the remaining are from popular open-source repos.\n2. Export the weights to a plain text file -- [.wts file](./tutorials/getting_started.md#the-wts-content-format).\n3. Load weights in TensorRT, define the network, build a TensorRT engine.\n4. Load the TensorRT engine and run inference.\n\n## News\n\n- `3 Mar 2026`. [zgjja](https://github.com/zgjja) Add Vision Transformer\n- `2 Feb 2026`. [fazligorkembal](https://github.com/fazligorkembal) Yolo26-Det, Yolo26-Obb, Yolo26-Cls\n- `15 Jan 2026`. [zgjja](https://github.com/zgjja) Refactor multiple old CV models to support TensorRT SDK through 7~10.\n- `8 Jan 2026`. [ydk61](https://github.com/ydk61): YOLOv13\n- `10 May 2025`. [pranavm-nvidia](https://github.com/pranavm-nvidia): [YOLO11](./yolo11_tripy) writen in [Tripy](https://github.com/NVIDIA/TensorRT-Incubator/tree/main/tripy).\n- `2 May 2025`. [fazligorkembal](https://github.com/fazligorkembal): YOLO12\n- `12 Apr 2025`. [pranavm-nvidia](https://github.com/pranavm-nvidia): First [Lenet](https://github.com/wang-xinyu/tensorrtx/tree/master/lenet#tripy-new-tensorrt-python-programming-model) example writen in [Tripy](https://github.com/NVIDIA/TensorRT-Incubator/tree/main/tripy).\n- `11 Apr 2025`. [mpj1234](https://github.com/mpj1234): [YOLO11-obb](https://github.com/wang-xinyu/tensorrtx/tree/master/yolo11)\n- `22 Oct 2024`. [lindsayshuo](https://github.com/lindsayshuo): YOLOv8-obb\n- `18 Oct 2024`. [zgjja](https://github.com/zgjja): Refactor docker image.\n- `11 Oct 2024`. [mpj1234](https://github.com/mpj1234): YOLO11\n- `9 Oct 2024`. [Phoenix8215](https://github.com/Phoenix8215): GhostNet V1 and V2.\n- `21 Aug 2024`. [Lemonononon](https://github.com/Lemonononon): real-esrgan-general-x4v3\n- `29 Jul 2024`. [mpj1234](https://github.com/mpj1234): Check the YOLOv5, YOLOv8 & YOLOv10 in TensorRT 10.x API, branch → [trt10](https://github.com/wang-xinyu/tensorrtx/tree/trt10)\n- `29 Jul 2024`. [mpj1234](https://github.com/mpj1234): YOLOv10\n- `21 Jun 2024`. [WuxinrongY](https://github.com/WuxinrongY): YOLOv9-T, YOLOv9-S, YOLOv9-M\n- `28 Apr 2024`. [lindsayshuo](https://github.com/lindsayshuo): YOLOv8-pose\n- `22 Apr 2024`. [B1SH0PP](https://github.com/B1SH0PP): EfficientAd: Accurate Visual Anomaly Detection at Millisecond-Level Latencies.\n- `18 Apr 2024`. [lindsayshuo](https://github.com/lindsayshuo): YOLOv8-p2\n\n## Tutorials\n\n- [How to make contribution](./tutorials/contribution.md)\n- [Install the dependencies.](./tutorials/install.md)\n- [A guide for quickly getting started, taking lenet5 as a demo.](./tutorials/getting_started.md)\n- [The .wts file content format](./tutorials/getting_started.md#the-wts-content-format)\n- [Frequently Asked Questions (FAQ)](./tutorials/faq.md)\n- [Migration Guide](./tutorials/migration_guide.md)\n- [How to implement multi-GPU processing, taking YOLOv4 as example](./tutorials/multi_GPU_processing.md)\n- [Check if Your GPU support FP16/INT8](./tutorials/check_fp16_int8_support.md)\n- [How to Compile and Run on Windows](./tutorials/run_on_windows.md)\n- [Deploy YOLOv4 with Triton Inference Server](https://github.com/isarsoft/yolov4-triton-tensorrt)\n- [From pytorch to trt step by step, hrnet as example(Chinese)](./tutorials/from_pytorch_to_trt_stepbystep_hrnet.md)\n\n## Test Environment\n\n1. (**NOT recommended**) TensorRT 7.x\n2. (**Recommended**)TensorRT 8.x\n3. (**NOT recommended**) TensorRT 10.x\n\n### Note\n\n1. For history reason, some of the models are limited to specific TensorRT version, please check the README.md or code for the model you want to use.\n2. Currently, TensorRT 8.x has better compatibility and the most of the features supported.\n\n## How to run\n\n**Note**: this project support to build each network by the `CMakeLists.txt` in its subfolder, or you can build them together by the `CMakeLists.txt` on top of this project.\n\n- General procedures before building and running:\n\n```bash\n# 1. generate xxx.wts from https://github.com/wang-xinyu/pytorchx/tree/master/lenet\n# ...\n\n# 2. put xxx.wts on top of this folder\n# ...\n```\n\n- (_Option 1_) To build a single subproject in this project, do:\n\n```bash\n## enter the subfolder\ncd tensorrtx/xxx\n\n## configure & build\ncmake -S . -B build\nmake -C build\n```\n\n- (_Option 2_) To build many subprojects, firstly, in the top `CMakeLists.txt`, **uncomment** the project you don't want to build or not suppoted by your TensorRT version, e.g., you cannot build subprojects in `${TensorRT_8_Targets}` if your TensorRT is `7.x`. Then:\n\n```bash\n## enter the top of this project\ncd tensorrtx\n\n## configure & build\n# you may use \"Ninja\" rather than \"make\" to significantly boost the build speed\ncmake -G Ninja -S . -B build\nninja -C build\n```\n\n**WARNING**: This part is still under development, most subprojects are not adapted yet.\n\n- run the generated executable, e.g.:\n\n```bash\n# serialize model to plan file i.e. 'xxx.engine'\nbuild/xxx -s\n\n# deserialize plan file and run inference\nbuild/xxx -d\n\n# (Optional) check if the output is same as pytorchx/lenet\n# ...\n\n# (Optional) customize the project\n# ...\n```\n\nFor more details, each subfolder may contain a `README.md` inside, which explains more.\n\n## Models\n\nFollowing models are implemented.\n\n| Name                                     | Description                                                                                                                                                                                                                                                       |\n| ---------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| [mlp](./mlp)                             | the very basic model for starters, properly documented                                                                                                                                                                                                            |\n| [lenet](./lenet)                         | the simplest, as a \"hello world\" of this project                                                                                                                                                                                                                  |\n| [alexnet](./alexnet)                     | easy to implement, all layers are supported in tensorrt                                                                                                                                                                                                           |\n| [googlenet](./googlenet)                 | GoogLeNet (Inception v1)                                                                                                                                                                                                                                          |\n| [inception](./inception)                 | Inception v3, v4                                                                                                                                                                                                                                                  |\n| [mnasnet](./mnasnet)                     | MNASNet with depth multiplier of 0.5 from the paper                                                                                                                                                                                                               |\n| [mobilenet](./mobilenet)                 | MobileNet v2, v3-small, v3-large                                                                                                                                                                                                                                  |\n| [resnet](./resnet)                       | resnet-18, resnet-50 and resnext50-32x4d are implemented                                                                                                                                                                                                          |\n| [senet](./senet)                         | se-resnet50                                                                                                                                                                                                                                                       |\n| [shufflenet](./shufflenetv2)             | ShuffleNet v2 with 0.5x output channels                                                                                                                                                                                                                           |\n| [squeezenet](./squeezenet)               | SqueezeNet 1.1 model                                                                                                                                                                                                                                              |\n| [vgg](./vgg)                             | VGG 11-layer model                                                                                                                                                                                                                                                |\n| [ViT](./vit)                             | vision transformer, using weight and model from huggingface                                                                                                                                                                                                       |\n| [yolov3-tiny](./yolov3-tiny)             | weights and pytorch implementation from [ultralytics/yolov3](https://github.com/ultralytics/yolov3)                                                                                                                                                               |\n| [yolov3](./yolov3)                       | darknet-53, weights and pytorch implementation from [ultralytics/yolov3](https://github.com/ultralytics/yolov3)                                                                                                                                                   |\n| [yolov3-spp](./yolov3-spp)               | darknet-53, weights and pytorch implementation from [ultralytics/yolov3](https://github.com/ultralytics/yolov3)                                                                                                                                                   |\n| [yolov4](./yolov4)                       | CSPDarknet53, weights from [AlexeyAB/darknet](https://github.com/AlexeyAB/darknet#pre-trained-models), pytorch implementation from [ultralytics/yolov3](https://github.com/ultralytics/yolov3)                                                                    |\n| [yolov5](./yolov5)                       | yolov5 v1.0-v7.0 of [ultralytics/yolov5](https://github.com/ultralytics/yolov5), detection, classification and instance segmentation                                                                                                                              |\n| [yolov7](./yolov7)                       | yolov7 v0.1, pytorch implementation from [WongKinYiu/yolov7](https://github.com/WongKinYiu/yolov7)                                                                                                                                                                |\n| [yolov8](./yolov8)                       | yolov8, pytorch implementation from [ultralytics](https://github.com/ultralytics/ultralytics)                                                                                                                                                                     |\n| [yolov9](./yolov9)                       | The Pytorch implementation is [WongKinYiu/yolov9](https://github.com/WongKinYiu/yolov9).                                                                                                                                                                          |\n| [yolov10](./yolov10)                     | The Pytorch implementation is [THU-MIG/yolov10](https://github.com/THU-MIG/yolov10).                                                                                                                                                                              |\n| [yolo11](./yolo11)                       | The Pytorch implementation is [ultralytics](https://github.com/ultralytics/ultralytics).                                                                                                                                                                          |\n| [yolo12](./yolov12)                      | The Pytorch implementation is [ultralytics](https://github.com/ultralytics/ultralytics).                                                                                                                                                                          |\n| [yolop](./yolop)                         | yolop, pytorch implementation from [hustvl/YOLOP](https://github.com/hustvl/YOLOP)                                                                                                                                                                                |\n| [retinaface](./retinaface)               | resnet50 and mobilnet0.25, weights from [biubug6/Pytorch_Retinaface](https://github.com/biubug6/Pytorch_Retinaface)                                                                                                                                               |\n| [arcface](./arcface)                     | LResNet50E-IR, LResNet100E-IR and MobileFaceNet, weights from [deepinsight/insightface](https://github.com/deepinsight/insightface)                                                                                                                               |\n| [retinafaceAntiCov](./retinafaceAntiCov) | mobilenet0.25, weights from [deepinsight/insightface](https://github.com/deepinsight/insightface), retinaface anti-COVID-19, detect face and mask attribute                                                                                                       |\n| [dbnet](./dbnet)                         | Scene Text Detection, weights from [BaofengZan/DBNet.pytorch](https://github.com/BaofengZan/DBNet.pytorch)                                                                                                                                                        |\n| [crnn](./crnn)                           | pytorch implementation from [meijieru/crnn.pytorch](https://github.com/meijieru/crnn.pytorch)                                                                                                                                                                     |\n| [ufld](./ufld)                           | pytorch implementation from [Ultra-Fast-Lane-Detection](https://github.com/cfzd/Ultra-Fast-Lane-Detection), ECCV2020                                                                                                                                              |\n| [hrnet](./hrnet)                         | hrnet-image-classification and hrnet-semantic-segmentation, pytorch implementation from [HRNet-Image-Classification](https://github.com/HRNet/HRNet-Image-Classification) and [HRNet-Semantic-Segmentation](https://github.com/HRNet/HRNet-Semantic-Segmentation) |\n| [psenet](./psenet)                       | PSENet Text Detection, tensorflow implementation from [liuheng92/tensorflow_PSENet](https://github.com/liuheng92/tensorflow_PSENet)                                                                                                                               |\n| [ibnnet](./ibnnet)                       | IBN-Net, pytorch implementation from [XingangPan/IBN-Net](https://github.com/XingangPan/IBN-Net), ECCV2018                                                                                                                                                        |\n| [unet](./unet)                           | U-Net, pytorch implementation from [milesial/Pytorch-UNet](https://github.com/milesial/Pytorch-UNet)                                                                                                                                                              |\n| [repvgg](./repvgg)                       | RepVGG, pytorch implementation from [DingXiaoH/RepVGG](https://github.com/DingXiaoH/RepVGG)                                                                                                                                                                       |\n| [lprnet](./lprnet)                       | LPRNet, pytorch implementation from [xuexingyu24/License_Plate_Detection_Pytorch](https://github.com/xuexingyu24/License_Plate_Detection_Pytorch)                                                                                                                 |\n| [refinedet](./refinedet)                 | RefineDet, pytorch implementation from [luuuyi/RefineDet.PyTorch](https://github.com/luuuyi/RefineDet.PyTorch)                                                                                                                                                    |\n| [densenet](./densenet)                   | DenseNet-121, from torchvision.models                                                                                                                                                                                                                             |\n| [rcnn](./rcnn)                           | FasterRCNN and MaskRCNN, model from [detectron2](https://github.com/facebookresearch/detectron2)                                                                                                                                                                  |\n| [tsm](./tsm)                             | TSM: Temporal Shift Module for Efficient Video Understanding, ICCV2019                                                                                                                                                                                            |\n| [scaled-yolov4](./scaled-yolov4)         | yolov4-csp, pytorch from [WongKinYiu/ScaledYOLOv4](https://github.com/WongKinYiu/ScaledYOLOv4)                                                                                                                                                                    |\n| [centernet](./centernet)                 | CenterNet DLA-34, pytorch from [xingyizhou/CenterNet](https://github.com/xingyizhou/CenterNet)                                                                                                                                                                    |\n| [efficientnet](./efficientnet)           | EfficientNet b0-b8 and l2, pytorch from [lukemelas/EfficientNet-PyTorch](https://github.com/lukemelas/EfficientNet-PyTorch)                                                                                                                                       |\n| [detr](./detr)                           | DE⫶TR, pytorch from [facebookresearch/detr](https://github.com/facebookresearch/detr)                                                                                                                                                                             |\n| [swin-transformer](./swin-transformer)   | Swin Transformer - Semantic Segmentation, only support Swin-T. The Pytorch implementation is [microsoft/Swin-Transformer](https://github.com/microsoft/Swin-Transformer.git)                                                                                      |\n| [real-esrgan](./real-esrgan)             | Real-ESRGAN. The Pytorch implementation is [real-esrgan](https://github.com/xinntao/Real-ESRGAN)                                                                                                                                                                  |\n| [superpoint](./superpoint)               | SuperPoint. The Pytorch model is from [magicleap/SuperPointPretrainedNetwork](https://github.com/magicleap/SuperPointPretrainedNetwork)                                                                                                                           |\n| [csrnet](./csrnet)                       | CSRNet. The Pytorch implementation is [leeyeehoo/CSRNet-pytorch](https://github.com/leeyeehoo/CSRNet-pytorch)                                                                                                                                                     |\n| [EfficientAd](./efficient_ad)            | EfficientAd: Accurate Visual Anomaly Detection at Millisecond-Level Latencies. From [anomalib](https://github.com/openvinotoolkit/anomalib)                                                                                                                       |\n\n## Model Zoo\n\nThe .wts files can be downloaded from model zoo for quick evaluation. But it is recommended to convert .wts from pytorch/mxnet/tensorflow model, so that you can retrain your own model.\n\n[GoogleDrive](https://drive.google.com/drive/folders/1Ri0IDa5OChtcA3zjqRTW57uG6TnfN4Do?usp=sharing) | [BaiduPan](https://pan.baidu.com/s/19s6hO8esU7-TtZEXN7G3OA) pwd: uvv2\n\n## Tricky Operations\n\nSome tricky operations encountered in these models, already solved, but might have better solutions.\n\n| Name                      | Description                                                                                           |\n| ------------------------- | ----------------------------------------------------------------------------------------------------- |\n| BatchNorm                 | Implement by a scale layer, used in resnet, googlenet, mobilenet, etc.                                |\n| MaxPool2d(ceil_mode=True) | use a padding layer before maxpool to solve ceil_mode=True, see googlenet.                            |\n| average pool with padding | use setAverageCountExcludesPadding() when necessary, see inception.                                   |\n| relu6                     | use `Relu6(x) = Relu(x) - Relu(x-6)`, see mobilenet.                                                  |\n| torch.chunk()             | implement the 'chunk(2, dim=C)' by tensorrt plugin, see shufflenet.                                   |\n| channel shuffle           | use two shuffle layers to implement `channel_shuffle`, see shufflenet.                                |\n| adaptive pool             | use fixed input dimension, and use regular average pooling, see shufflenet.                           |\n| leaky relu                | I wrote a leaky relu plugin, but PRelu in `NvInferPlugin.h` can be used, see yolov3 in branch `trt4`. |\n| yolo layer v1             | yolo layer is implemented as a plugin, see yolov3 in branch `trt4`.                                   |\n| yolo layer v2             | three yolo layers implemented in one plugin, see yolov3-spp.                                          |\n| upsample                  | replaced by a deconvolution layer, see yolov3.                                                        |\n| hsigmoid                  | hard sigmoid is implemented as a plugin, hsigmoid and hswish are used in mobilenetv3                  |\n| retinaface output decode  | implement a plugin to decode bbox, confidence and landmarks, see retinaface.                          |\n| mish                      | mish activation is implemented as a plugin, mish is used in yolov4                                    |\n| prelu                     | mxnet's prelu activation with trainable gamma is implemented as a plugin, used in arcface             |\n| HardSwish                 | hard_swish = x \\* hard_sigmoid, used in yolov5 v3.0                                                   |\n| LSTM                      | Implemented pytorch nn.LSTM() with tensorrt api                                                       |\n\n## Speed Benchmark\n\n| Models                    | Device               | BatchSize | Mode | Input Shape(HxW) | FPS  |\n| ------------------------- | -------------------- | :-------: | :--: | :--------------: | :--: |\n| YOLOv3-tiny               | Xeon E5-2620/GTX1080 |     1     | FP32 |     608x608      | 333  |\n| YOLOv3(darknet53)         | Xeon E5-2620/GTX1080 |     1     | FP32 |     608x608      | 39.2 |\n| YOLOv3(darknet53)         | Xeon E5-2620/GTX1080 |     1     | INT8 |     608x608      | 71.4 |\n| YOLOv3-spp(darknet53)     | Xeon E5-2620/GTX1080 |     1     | FP32 |     608x608      | 38.5 |\n| YOLOv4(CSPDarknet53)      | Xeon E5-2620/GTX1080 |     1     | FP32 |     608x608      | 35.7 |\n| YOLOv4(CSPDarknet53)      | Xeon E5-2620/GTX1080 |     4     | FP32 |     608x608      | 40.9 |\n| YOLOv4(CSPDarknet53)      | Xeon E5-2620/GTX1080 |     8     | FP32 |     608x608      | 41.3 |\n| YOLOv5-s v3.0             | Xeon E5-2620/GTX1080 |     1     | FP32 |     608x608      | 142  |\n| YOLOv5-s v3.0             | Xeon E5-2620/GTX1080 |     4     | FP32 |     608x608      | 173  |\n| YOLOv5-s v3.0             | Xeon E5-2620/GTX1080 |     8     | FP32 |     608x608      | 190  |\n| YOLOv5-m v3.0             | Xeon E5-2620/GTX1080 |     1     | FP32 |     608x608      |  71  |\n| YOLOv5-l v3.0             | Xeon E5-2620/GTX1080 |     1     | FP32 |     608x608      |  43  |\n| YOLOv5-x v3.0             | Xeon E5-2620/GTX1080 |     1     | FP32 |     608x608      |  29  |\n| YOLOv5-s v4.0             | Xeon E5-2620/GTX1080 |     1     | FP32 |     608x608      | 142  |\n| YOLOv5-m v4.0             | Xeon E5-2620/GTX1080 |     1     | FP32 |     608x608      |  71  |\n| YOLOv5-l v4.0             | Xeon E5-2620/GTX1080 |     1     | FP32 |     608x608      |  40  |\n| YOLOv5-x v4.0             | Xeon E5-2620/GTX1080 |     1     | FP32 |     608x608      |  27  |\n| RetinaFace(resnet50)      | Xeon E5-2620/GTX1080 |     1     | FP32 |     480x640      |  90  |\n| RetinaFace(resnet50)      | Xeon E5-2620/GTX1080 |     1     | INT8 |     480x640      | 204  |\n| RetinaFace(mobilenet0.25) | Xeon E5-2620/GTX1080 |     1     | FP32 |     480x640      | 417  |\n| ArcFace(LResNet50E-IR)    | Xeon E5-2620/GTX1080 |     1     | FP32 |     112x112      | 333  |\n| CRNN                      | Xeon E5-2620/GTX1080 |     1     | FP32 |      32x100      | 1000 |\n\nHelp wanted, if you got speed results, please add an issue or PR.\n\n## Acknowledgments & Contact\n\nAny contributions, questions and discussions are welcomed, contact me by following info.\n\nE-mail: wangxinyu_es@163.com\n\nWeChat ID: wangxinyu0375 (可加我微信进 tensorrtx 交流群，**备注：tensorrtx**)\n"
  },
  {
    "path": "alexnet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.14)\n\nproject(\n  alexnet\n  VERSION 0.1\n  LANGUAGES C CXX CUDA)\n\nif(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)\n  set(CMAKE_CUDA_ARCHITECTURES\n      75\n      80\n      86\n      89\n      90\n      100\n      120)\nendif()\n\nset(CMAKE_CXX_STANDARD 17)\nset(CMAKE_CXX_STANDARD_REQUIRED ON)\nset(CMAKE_CUDA_STANDARD 17)\nset(CMAKE_CUDA_STANDARD_REQUIRED ON)\nset(CMAKE_EXPORT_COMPILE_COMMANDS ON)\nset(CMAKE_INCLUDE_CURRENT_DIR TRUE)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME \"Use static cudaruntime library\" OFF)\n\nfind_package(Threads REQUIRED)\nfind_package(CUDAToolkit REQUIRED)\nfind_package(OpenCV REQUIRED)\n\nif(NOT TARGET TensorRT::TensorRT)\n  include(FindTensorRT.cmake)\nelse()\n  message(\"TensorRT has been found, skipping for ${PROJECT_NAME}\")\nendif()\n\nadd_executable(${PROJECT_NAME} alexnet.cc)\n\ntarget_include_directories(${PROJECT_NAME} PRIVATE ${CMAKE_CURRENT_LIST_DIR}\n                                                   ${OpenCV_INCLUDE_DIRS})\n\ntarget_link_libraries(\n  ${PROJECT_NAME} PRIVATE Threads::Threads TensorRT::TensorRT CUDA::cudart\n                          ${OpenCV_LIBS})\n"
  },
  {
    "path": "alexnet/FindTensorRT.cmake",
    "content": "cmake_minimum_required(VERSION 3.17.0)\n\nfunction(_guess_path var_name required_files)\n  set(_result \"\")\n\n  foreach(path_entry IN LISTS ARGN)\n    if(NOT EXISTS \"${path_entry}\")\n      message(DEBUG \"skip non-existing path '${path_entry}'\")\n      continue()\n    endif()\n\n    set(_ok TRUE)\n    foreach(required_file IN LISTS required_files)\n      if(NOT EXISTS \"${path_entry}/${required_file}\")\n        set(_ok FALSE)\n        message(DEBUG \"'${path_entry}' missing '${required_file}'\")\n        break()\n      endif()\n    endforeach()\n\n    if(_ok)\n      list(APPEND _result \"${path_entry}\")\n      message(DEBUG \"accept '${path_entry}'\")\n    else()\n      message(DEBUG \"reject '${path_entry}'\")\n    endif()\n  endforeach()\n\n  if(_result STREQUAL \"\")\n    message(\n      FATAL_ERROR\n        \"_guess_path(${var_name}) failed: no valid path found. required_files='${required_files}' candidates='${ARGN}'\"\n    )\n  endif()\n\n  set(${var_name}\n      \"${_result}\"\n      PARENT_SCOPE)\nendfunction()\n\n# add library\nadd_library(TensorRT IMPORTED INTERFACE)\nadd_library(TensorRT::TensorRT ALIAS TensorRT)\n\nset(TRT_VERSION\n    CACHE\n      STRING\n      \"TensorRT version, e.g. \\\"8.6.1.6\\\" or \\\"8.6.1.6+cuda12.0.1.011\\\", \\\"8.6.1.6.Windows10.x86_64.cuda-12.0\\\" etc\"\n)\n\nif(NOT TRT_VERSION STREQUAL \"\" AND NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  message(\n    WARNING\n      \"TRT_VERSION defined by cmake and environment variable both, using the later one\"\n  )\nendif()\n\nif(NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  set(TRT_VERSION $ENV{TRT_VERSION})\nendif()\n\nstring(REGEX MATCH \"([0-9]+)\" _match ${TRT_VERSION})\nset(TRT_MAJOR_VERSION \"${_match}\")\nunset(_match)\n\nif(WIN32)\n  set(TensorRT_DIR \"C:/Program Files/TensorRT-${TRT_VERSION}\")\n  if(NOT EXISTS \"${TensorRT_DIR}\")\n    message(\n      FATAL_ERROR\n        \"TensorRT_DIR=${TensorRT_DIR} does not exist!\"\n    )\n  endif()\n\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 10)\n    set(_modules nvinfer_10 nvinfer_plugin_10 nvinfer_vc_plugin_10\n                 nvinfer_dispatch_10 nvinfer_lean_10)\n    message(DEBUG \"Using ${_modules}\")\n  else()\n    set(_modules nvinfer nvinfer_plugin nvinfer_vc_plugin nvinfer_dispatch\n                 nvinfer_lean)\n  endif()\n\n  set(TensorRT_LIBRARY_DIR \"${TensorRT_DIR}/lib\")\n  set(TensorRT_INCLUDE_DIR \"${TensorRT_DIR}/include\")\nelseif(UNIX)\n  string(TOLOWER \"${CMAKE_SYSTEM_PROCESSOR}\" _trt_arch)\n  set(_trt_include_candidates)\n  if(_trt_arch MATCHES \"^(aarch64|arm64|arch64)$\")\n    set(_trt_include_candidates \"/usr/include/aarch64-linux-gnu\" \"/usr/include\"\n                                \"/usr/local/cuda/targets/aarch64-linux/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/aarch64-linux-gnu/lib\"\n        \"/usr/lib/aarch64-linux-gnu\" \"/usr/lib/aarch64-linux-gnu/tegra\"\n        \"/usr/lib\")\n  elseif(_trt_arch MATCHES \"^(x86_64|amd64)$\")\n    set(_trt_include_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/include\"\n        \"/usr/include/x86_64-linux-gnu\" \"/usr/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/lib\"\n        \"/usr/lib/x86_64-linux-gnu\" \"/usr/lib\")\n  else()\n    message(FATAL_ERROR \"Unknown architecture\")\n  endif()\n\n  set(_modules nvinfer nvinfer_plugin)\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 8)\n    list(APPEND _modules nvinfer_vc_plugin nvinfer_dispatch nvinfer_lean)\n  endif()\n\n  _guess_path(TensorRT_LIBRARY_DIR \"libnvinfer.so;libnvinfer_plugin.so\"\n              ${_trt_library_candidates})\n  message(STATUS \"TensorRT libraries: ${TensorRT_LIBRARY_DIR}\")\n  _guess_path(TensorRT_INCLUDE_DIR \"NvInfer.h\" ${_trt_include_candidates})\n  message(STATUS \"TensorRT includes: ${TensorRT_INCLUDE_DIR}\")\nendif()\n\nforeach(lib IN LISTS _modules)\n  find_library(\n    TensorRT_${lib}_LIBRARY\n    NAMES ${lib}\n    HINTS ${TensorRT_LIBRARY_DIR})\n  list(APPEND TensorRT_LIBRARIES ${TensorRT_${lib}_LIBRARY})\nendforeach()\n\ntarget_link_libraries(TensorRT INTERFACE ${TensorRT_LIBRARIES})\n\nmessage(STATUS \"Found TensorRT libs: ${TensorRT_LIBRARIES}\")\n\nset_target_properties(\n  TensorRT\n  PROPERTIES C_STANDARD 17\n             CXX_STANDARD 17\n             POSITION_INDEPENDENT_CODE ON\n             SKIP_BUILD_RPATH TRUE\n             BUILD_WITH_INSTALL_RPATH TRUE\n             INSTALL_RPATH \"$ORIGIN\"\n             INTERFACE_INCLUDE_DIRECTORIES \"${TensorRT_INCLUDE_DIR}\")\n\nunset(TRT_MAJOR_VERSION)\nunset(_modules)\nunset(_trt_include_candidates)\nunset(_trt_library_candidates)\nunset(_trt_arch)\n"
  },
  {
    "path": "alexnet/README.md",
    "content": "# alexnet\n\n## Introduction\n\nAlexNet model architecture comes from this paper: [One weird trick for parallelizing convolutional neural networks](https://arxiv.org/abs/1404.5997). To generate `.wts` file, you can refer to [pytorchx/alexnet](https://github.com/wang-xinyu/pytorchx/tree/master/alexnet). To check the pytorch implementation of AlexNet, refer to [HERE](https://github.com/pytorch/vision/blob/main/torchvision/models/alexnet.py#L17)\n\nAlexNet consists of 3 major parts: features, adaptive average pooling, and classifier:\n\n- features: just several stacked `CRP`(conv-relu-pool) and `CR` layers\n- adaptive average pooling: pytorch can decide its inner parameters, but we need to calculate it ourselves in TensorRT API\n- classifier: just several `fc-relu` layers. All layers can be implemented by tensorrt api, including `addConvolution`, `addActivation`, `addPooling`, `addMatrixMultiply`, `addElementWise` etc.\n\n## Use AlexNet from PyTorch\n\nWe can use torchvision to load the pretrained alexnet model:\n\n```python\nalexnet = torchvision.models.alexnet(pretrained=True)\n```\n\nThe model structure is:\n\n```bash\nAlexNet(\n  (features): Sequential(\n    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))\n    (1): ReLU(inplace=True)\n    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)\n    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))\n    (4): ReLU(inplace=True)\n    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)\n    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n    (7): ReLU(inplace=True)\n    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n    (9): ReLU(inplace=True)\n    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n    (11): ReLU(inplace=True)\n    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)\n  )\n  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))\n  (classifier): Sequential(\n    (0): Dropout(p=0.5, inplace=False)\n    (1): Linear(in_features=9216, out_features=4096, bias=True)\n    (2): ReLU(inplace=True)\n    (3): Dropout(p=0.5, inplace=False)\n    (4): Linear(in_features=4096, out_features=4096, bias=True)\n    (5): ReLU(inplace=True)\n    (6): Linear(in_features=4096, out_features=1000, bias=True)\n  )\n)\n```\n\n## Usage\n\n1. use `gen_wts.py` to generate wts file.\n\n```bash\npython3 gen_wts.py\n```\n\n2. build C++ code\n\n```bash\npushd tensorrtx/alexnet\ncmake -S . -B build -G Ninja --fresh\ncmake --build build\n```\n\n3. serialize wts model to engine file.\n\n```bash\n./build/alexnet -s\n```\n\n4. run inference\n\n```bash\n./build/alexnet -d\n```\n\noutput looks like:\n\n```txt\n...\n====\nExecution time: 1ms\n0.1234, -0.5678, ...\n====\nprediction result:\nTop: 0 idx: 285, logits: 9.9, label: Egyptian cat\nTop: 1 idx: 281, logits: 8.304, label: tabby, tabby cat\nTop: 2 idx: 282, logits: 6.859, label: tiger cat\n```\n\n## FAQ\n\n### How to align the output with Pytorch?\n\nIf your output is different from pytorch, you have to check which TensorRT API or your code cause this. A simple solution would be check the `.engine` output part by part, e.g., you can set the early layer of alexnet as output:\n\n```c++\nfc3_1->getOutput(0)->setName(OUTPUT_NAME);\nnetwork->markOutput(*pool3->getOutput(0)); // original is: \"*fc3_1->getOutput(0)\"\n```\n\nFor this line of code, i use the output from \"feature\" part of alexnet, ignoring the rest of the model, then, don't forget to change the `OUTPUT_SIZE` macro on top of the file, lastly, build the `.engine` file to apply the changes.\n\nYou can sum up all output from C++ code, and compare it with Pytorch output, for Pytorch, you can do this by: `torch.sum(x)` at debug phase. The ideal value deviation between 2 values would be $[10^{-1}, 10^{-2}]$, for this example, since the output elements for \"feature\" is $256 * 6 * 6$ (bacth = 1), the final error would roughly be $10^{-4}$.\n\nNote: This is a quick check, for more accurate check, you have to save the output tensor into a file to compare them value by value, but this situation is rare.\n"
  },
  {
    "path": "alexnet/alexnet.cc",
    "content": "#include <array>\n#include <chrono>\n#include <cmath>\n#include <opencv2/opencv.hpp>\n#include <vector>\n#include \"logging.h\"\n#include \"utils.h\"\n\n// stuff we know about alexnet\nconstexpr const int32_t N = 1;\nconstexpr const int32_t INPUT_H = 224;\nconstexpr const int32_t INPUT_W = 224;\nconstexpr const std::array<int64_t, 3> SIZES = {3ll * INPUT_H * INPUT_W, 1000};\n\nconstexpr const std::array<const char*, 2> NAMES = {\"data\", \"prob\"};\nconstexpr const char* ENGINE_PATH = \"../models/alexnet.engine\";\nconstexpr const char* WTS_PATH = \"../models/alexnet.wts\";\nconstexpr const char* LABELS_PATH = \"../assets/imagenet1000_clsidx_to_labels.txt\";\nstatic constexpr const bool TRT_PREPROCESS = TRT_VERSION >= 8510 ? true : false;\nstatic constexpr const std::array<const float, 3> mean = {0.485f, 0.456f, 0.406f};\nstatic constexpr const std::array<const float, 3> stdv = {0.229f, 0.224f, 0.225f};\n\nusing WeightMap = std::map<std::string, Weights>;\nusing M = nvinfer1::MatrixOperation;\nusing E = nvinfer1::ElementWiseOperation;\nusing NDCF = nvinfer1::NetworkDefinitionCreationFlag;\n\nstatic Logger gLogger;\n\n/**\n * @brief Create the engine using TensorRT API and without any parser.\n *\n * @param N max batch size\n * @param builder\n * @param config\n * @param dt\n * @return ICudaEngine*\n */\nICudaEngine* createEngine(int32_t N, IRuntime* runtime, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    WeightMap weightMap = loadWeights(WTS_PATH);\n\n#if TRT_VERSION >= 11200\n    auto flag = 1U << static_cast<int>(NDCF::kSTRONGLY_TYPED);\n#elif TRT_VERSION >= 10000\n    auto flag = 0U;\n#else\n    auto flag = 1U << static_cast<int>(NDCF::kEXPLICIT_BATCH);\n#endif\n    auto* network = builder->createNetworkV2(flag);\n\n    // Create input tensor\n    ITensor* input{nullptr};\n    if constexpr (TRT_PREPROCESS) {\n        dt = DataType::kUINT8;\n        input = network->addInput(NAMES[0], dt, Dims4{N, INPUT_H, INPUT_W, 3});\n        auto* trans = addTransformLayer(network, *input, true, mean, stdv);\n        input = trans->getOutput(0);\n    } else {\n        input = network->addInput(NAMES[0], dt, Dims4{N, 3, INPUT_H, INPUT_W});\n    }\n    assert(input);\n\n    // CRP (Conv-Relu-Pool)\n    auto* conv1 = network->addConvolutionNd(*input, 64, DimsHW{11, 11}, weightMap[\"features.0.weight\"],\n                                            weightMap[\"features.0.bias\"]);\n    auto* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    auto* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(conv1 && relu1 && pool1);\n    conv1->setStrideNd(DimsHW{4, 4});\n    conv1->setPaddingNd(DimsHW{2, 2});\n    pool1->setStrideNd(DimsHW{2, 2});\n\n    // CRP\n    auto* conv2 = network->addConvolutionNd(*pool1->getOutput(0), 192, DimsHW{5, 5}, weightMap[\"features.3.weight\"],\n                                            weightMap[\"features.3.bias\"]);\n    auto* relu2 = network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);\n    auto* pool2 = network->addPoolingNd(*relu2->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(conv2 && pool2 && relu2);\n    conv2->setPaddingNd(DimsHW{2, 2});\n    pool2->setStrideNd(DimsHW{2, 2});\n\n    // CR\n    auto* conv3 = network->addConvolutionNd(*pool2->getOutput(0), 384, DimsHW{3, 3}, weightMap[\"features.6.weight\"],\n                                            weightMap[\"features.6.bias\"]);\n    auto* relu3 = network->addActivation(*conv3->getOutput(0), ActivationType::kRELU);\n    assert(conv3 && relu3);\n    conv3->setPaddingNd(DimsHW{1, 1});\n\n    // CR\n    auto* conv4 = network->addConvolutionNd(*relu3->getOutput(0), 256, DimsHW{3, 3}, weightMap[\"features.8.weight\"],\n                                            weightMap[\"features.8.bias\"]);\n    auto* relu4 = network->addActivation(*conv4->getOutput(0), ActivationType::kRELU);\n    assert(conv4 && relu4);\n    conv4->setPaddingNd(DimsHW{1, 1});\n\n    // CRP\n    auto* conv5 = network->addConvolutionNd(*relu4->getOutput(0), 256, DimsHW{3, 3}, weightMap[\"features.10.weight\"],\n                                            weightMap[\"features.10.bias\"]);\n    auto* relu5 = network->addActivation(*conv5->getOutput(0), ActivationType::kRELU);\n    assert(conv5);\n    auto* pool3 = network->addPoolingNd(*relu5->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(conv5 && relu5 && pool3);\n    conv5->setPaddingNd(DimsHW{1, 1});\n    pool3->setStrideNd(DimsHW{2, 2});\n\n    // adaptive avgerage pooling\n    auto* adaptive_pool = network->addPoolingNd(*pool3->getOutput(0), PoolingType::kAVERAGE, DimsHW{1, 1});\n    assert(adaptive_pool);\n    IShuffleLayer* shuffle = network->addShuffle(*adaptive_pool->getOutput(0));\n    assert(shuffle);\n    shuffle->setReshapeDimensions(Dims2{N, -1});  // \"-1\" means \"256 * 6 * 6\"\n\n    // all classifier tensors\n    int64_t in_feat = 256ll * 6 * 6;\n    auto* fc1w = network->addConstant(DimsHW{4096, in_feat}, weightMap[\"classifier.1.weight\"])->getOutput(0);\n    auto* fc1b = network->addConstant(DimsHW{1, 4096}, weightMap[\"classifier.1.bias\"])->getOutput(0);\n    auto* fc2w = network->addConstant(DimsHW{4096, 4096}, weightMap[\"classifier.4.weight\"])->getOutput(0);\n    auto* fc2b = network->addConstant(DimsHW{1, 4096}, weightMap[\"classifier.4.bias\"])->getOutput(0);\n    auto* fc3w = network->addConstant(DimsHW{1000, 4096}, weightMap[\"classifier.6.weight\"])->getOutput(0);\n    auto* fc3b = network->addConstant(DimsHW{1, 1000}, weightMap[\"classifier.6.bias\"])->getOutput(0);\n    assert(fc1w && fc1b && fc2w && fc2b && fc3w && fc3b);\n\n    // all layers in classifier\n    auto* fc1_0 = network->addMatrixMultiply(*shuffle->getOutput(0), M::kNONE, *fc1w, M::kTRANSPOSE);\n    auto* fc1_1 = network->addElementWise(*fc1_0->getOutput(0), *fc1b, E::kSUM);\n    auto* relu6 = network->addActivation(*fc1_1->getOutput(0), ActivationType::kRELU);\n    assert(fc1_0 && fc1_1 && relu6);\n    fc1_0->setName(\"fc1_0\");  // set name here, only for debug purpose\n    auto* fc2_0 = network->addMatrixMultiply(*relu6->getOutput(0), M::kNONE, *fc2w, M::kTRANSPOSE);\n    auto* fc2_1 = network->addElementWise(*fc2_0->getOutput(0), *fc2b, E::kSUM);\n    auto* relu7 = network->addActivation(*fc2_1->getOutput(0), ActivationType::kRELU);\n    assert(fc2_0 && fc2_1 && relu7);\n    fc2_0->setName(\"fc2_0\");\n    auto* fc3_0 = network->addMatrixMultiply(*relu7->getOutput(0), M::kNONE, *fc3w, M::kTRANSPOSE);\n    auto* fc3_1 = network->addElementWise(*fc3_0->getOutput(0), *fc3b, E::kSUM);\n    assert(fc3_0 && fc3_1);\n    fc3_0->setName(\"fc3_0\");\n\n    fc3_1->getOutput(0)->setName(NAMES[1]);\n    network->markOutput(*fc3_1->getOutput(0));\n\n    // Build engine\n#if TRT_VERSION >= 8000\n    config->setMemoryPoolLimit(MemoryPoolType::kWORKSPACE, WORKSPACE_SIZE);\n    auto* host_mem = builder->buildSerializedNetwork(*network, *config);\n    auto* engine = runtime->deserializeCudaEngine(host_mem->data(), host_mem->size());\n    delete network;\n#else\n    builder->setMaxBatchSize(N);\n    config->setMaxWorkspaceSize(WORKSPACE_SIZE);\n    auto* engine = builder->buildEngineWithConfig(*network, *config);\n    network->destroy();\n#endif\n\n    std::cout << \"build finished\\n\";\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(int32_t N, IRuntime* runtime, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(N, runtime, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n#if TRT_VERSION >= 8000\n    delete engine;\n    delete config;\n    delete builder;\n#else\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n#endif\n}\n\nstd::vector<std::vector<float>> doInference(IExecutionContext& context, const std::string& img_path,\n                                            std::size_t batchSize) {\n    static std::vector<float> flat_img;\n    auto img = cv::imread(img_path, cv::IMREAD_COLOR);\n    void* input = nullptr;\n\n    // use preprocess from gpu(TensorRT) or cpu(OpenCV)\n    if constexpr (TRT_PREPROCESS) {\n        // for simplicity, resize image on cpu side\n        cv::resize(img, img, cv::Size(INPUT_W, INPUT_H), 0, 0, cv::INTER_LINEAR);\n        input = static_cast<void*>(img.data);\n    } else {\n        flat_img = preprocess_img(img, true, mean, stdv, N, INPUT_H, INPUT_W);\n        input = flat_img.data();\n    }\n    assert(input);\n\n    const ICudaEngine& engine = context.getEngine();\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n    std::vector<void*> buffers;\n\n#if TRT_VERSION >= 8000\n    const int32_t nIO = engine.getNbIOTensors();\n#else\n    const int32_t nIO = engine.getNbBindings();\n#endif\n\n    buffers.resize(nIO);\n    for (auto i = 0; i < nIO; ++i) {\n#if TRT_VERSION >= 8000\n        auto* tensor_name = engine.getIOTensorName(i);\n        auto s = getSize(engine.getTensorDataType(tensor_name));\n        std::size_t size = s * batchSize * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n        context.setTensorAddress(tensor_name, buffers[i]);\n#else\n        const int32_t idx = engine.getBindingIndex(NAMES[i]);\n        auto s = getSize(engine.getBindingDataType(idx));\n        assert(idx == i);\n        std::size_t size = s * batchSize * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n#endif\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n    }\n\n#if TRT_VERSION >= 8000\n    assert(context.enqueueV3(stream));\n#else\n    assert(context.enqueueV2(buffers.data(), stream, nullptr));\n#endif\n\n    std::vector<std::vector<float>> prob;\n    for (int i = 1; i < nIO; ++i) {\n        std::vector<float> tmp(batchSize * SIZES[i], std::nanf(\"\"));\n        std::size_t size = batchSize * SIZES[i] * sizeof(float);\n        CHECK(cudaMemcpyAsync(tmp.data(), buffers[i], size, cudaMemcpyDeviceToHost, stream));\n        prob.emplace_back(tmp);\n    }\n    CHECK(cudaStreamSynchronize(stream));\n\n    cudaStreamDestroy(stream);\n    for (auto i = 0; i < nIO; ++i) {\n        CHECK(cudaFree(buffers[i]));\n    }\n    return prob;\n}\n\nint main(int argc, char** argv) {\n    checkTrtEnv();\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\\n\";\n        std::cerr << \"./alexnet -s   // serialize model to plan file\\n\";\n        std::cerr << \"./alexnet -d   // deserialize plan file and run inference\\n\";\n        return -1;\n    }\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n\n    // create a model using the API directly and serialize it to a stream\n    char* trtModelStream{nullptr};\n    std::streamsize size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(N, runtime, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(ENGINE_PATH, std::ios::binary | std::ios::trunc);\n        if (!p) {\n            std::cerr << \"could not open plan output file\\n\";\n            return -1;\n        }\n        if (modelStream->size() > static_cast<std::size_t>(std::numeric_limits<std::streamsize>::max())) {\n            std::cerr << \"this model is too large to serialize\\n\";\n            return -1;\n        }\n        const auto* data_ptr = reinterpret_cast<const char*>(modelStream->data());\n        auto data_size = static_cast<std::streamsize>(modelStream->size());\n        p.write(data_ptr, data_size);\n\n#if TRT_VERSION >= 8000\n        delete modelStream;\n#else\n        modelStream->destroy();\n#endif\n        return 0;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(ENGINE_PATH, std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n#if TRT_VERSION >= 8000\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n#else\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n#endif\n    assert(engine != nullptr);\n\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n\n    const std::string img_path = \"../assets/cats.jpg\";\n    for (int32_t i = 0; i < 100; ++i) {\n        auto _start = std::chrono::system_clock::now();\n        auto prob = doInference(*context, img_path, N);\n        auto _end = std::chrono::system_clock::now();\n        auto _time = std::chrono::duration_cast<std::chrono::milliseconds>(_end - _start).count();\n        std::cout << \"Execution time: \" << _time << \"ms\\n\";\n\n        for (const auto& vector : prob) {\n            int idx = 0;\n            for (auto v : vector) {\n                std::cout << std::setprecision(4) << v << \", \" << std::flush;\n                if (++idx > 20) {\n                    std::cout << \"\\n====\\n\";\n                    break;\n                }\n            }\n        }\n\n        if (i == 99) {\n            std::cout << \"prediction result:\\n\";\n            auto labels = loadImagenetLabelMap(LABELS_PATH);\n            int _top = 0;\n            for (auto& [idx, logits] : topk(prob[0], 3)) {\n                std::cout << \"Top: \" << _top++ << \" idx: \" << idx << \", logits: \" << logits\n                          << \", label: \" << labels[idx] << \"\\n\";\n            }\n        }\n    }\n\n#if TRT_VERSION >= 8000\n    delete context;\n    delete engine;\n    delete runtime;\n#else\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n#endif\n    return 0;\n}\n"
  },
  {
    "path": "alexnet/alexnet.py",
    "content": "import os\nimport sys\nimport struct\nimport argparse\n\nimport numpy as np\nimport pycuda.autoinit\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nBATCH_SIZE = 1\nINPUT_H = 224\nINPUT_W = 224\nOUTPUT_SIZE = 1000\nINPUT_BLOB_NAME = \"data\"\nOUTPUT_BLOB_NAME = \"prob\"\n\nWEIGHT_PATH = \"./alexnet.wts\"\nENGINE_PATH = \"./alexnet.engine\"\n\nTRT_LOGGER = trt.Logger(trt.Logger.INFO)\n\n\ndef load_weights(file):\n    print(f\"Loading weights: {file}\")\n\n    assert os.path.exists(file), 'Unable to load weight file.'\n\n    weight_map = {}\n    with open(file, \"r\") as f:\n        lines = [line.strip() for line in f]\n    count = int(lines[0])\n    assert count == len(lines) - 1\n    for i in range(1, count + 1):\n        splits = lines[i].split(\" \")\n        name = splits[0]\n        cur_count = int(splits[1])\n        assert cur_count + 2 == len(splits)\n        values = []\n        for j in range(2, len(splits)):\n            # hex string to bytes to float\n            values.append(struct.unpack(\">f\", bytes.fromhex(splits[j])))\n        weight_map[name] = np.array(values, dtype=np.float32)\n\n    return weight_map\n\n\ndef create_engine(max_batch_size, builder, config, dt):\n    weight_map = load_weights(WEIGHT_PATH)\n    network = builder.create_network()\n\n    data = network.add_input(INPUT_BLOB_NAME, dt, (3, INPUT_H, INPUT_W))\n    assert data\n\n    conv1 = network.add_convolution(input=data,\n                                    num_output_maps=64,\n                                    kernel_shape=(11, 11),\n                                    kernel=weight_map[\"features.0.weight\"],\n                                    bias=weight_map[\"features.0.bias\"])\n    assert conv1\n    conv1.stride = (4, 4)\n    conv1.padding = (2, 2)\n\n    relu1 = network.add_activation(conv1.get_output(0), type=trt.ActivationType.RELU)\n    assert relu1\n\n    pool1 = network.add_pooling(input=relu1.get_output(0),\n                                type=trt.PoolingType.MAX,\n                                window_size=trt.DimsHW(3, 3))\n    assert pool1\n    pool1.stride_nd = (2, 2)\n\n    conv2 = network.add_convolution(input=pool1.get_output(0),\n                                    num_output_maps=192,\n                                    kernel_shape=(5, 5),\n                                    kernel=weight_map[\"features.3.weight\"],\n                                    bias=weight_map[\"features.3.bias\"])\n    assert conv2\n    conv2.padding = (2, 2)\n\n    relu2 = network.add_activation(conv2.get_output(0), type=trt.ActivationType.RELU)\n    assert relu2\n\n    pool2 = network.add_pooling(input=relu2.get_output(0),\n                                type=trt.PoolingType.MAX,\n                                window_size=trt.DimsHW(3, 3))\n    assert pool2\n    pool2.stride_nd = (2, 2)\n\n    conv3 = network.add_convolution(input=pool2.get_output(0),\n                                    num_output_maps=384,\n                                    kernel_shape=(3, 3),\n                                    kernel=weight_map[\"features.6.weight\"],\n                                    bias=weight_map[\"features.6.bias\"])\n    assert conv3\n    conv3.padding = (1, 1)\n\n    relu3 = network.add_activation(conv3.get_output(0), type=trt.ActivationType.RELU)\n    assert relu3\n\n    conv4 = network.add_convolution(input=relu3.get_output(0),\n                                    num_output_maps=256,\n                                    kernel_shape=(3, 3),\n                                    kernel=weight_map[\"features.8.weight\"],\n                                    bias=weight_map[\"features.8.bias\"])\n    assert conv4\n    conv4.padding = (1, 1)\n\n    relu4 = network.add_activation(conv4.get_output(0), type=trt.ActivationType.RELU)\n    assert relu4\n\n    conv5 = network.add_convolution(input=relu4.get_output(0),\n                                    num_output_maps=256,\n                                    kernel_shape=(3, 3),\n                                    kernel=weight_map[\"features.10.weight\"],\n                                    bias=weight_map[\"features.10.bias\"])\n    assert conv5\n    conv5.padding = (1, 1)\n\n    relu5 = network.add_activation(conv5.get_output(0), type=trt.ActivationType.RELU)\n    assert relu5\n\n    pool3 = network.add_pooling(input=relu5.get_output(0),\n                                type=trt.PoolingType.MAX,\n                                window_size=trt.DimsHW(3, 3))\n    assert pool3\n    pool3.stride_nd = (2, 2)\n\n    fc1 = network.add_fully_connected(input=pool3.get_output(0),\n                                      num_outputs=4096,\n                                      kernel=weight_map[\"classifier.1.weight\"],\n                                      bias=weight_map[\"classifier.1.bias\"])\n    assert fc1\n\n    relu6 = network.add_activation(fc1.get_output(0), type=trt.ActivationType.RELU)\n    assert relu6\n\n    fc2 = network.add_fully_connected(input=relu6.get_output(0),\n                                      num_outputs=4096,\n                                      kernel=weight_map[\"classifier.4.weight\"],\n                                      bias=weight_map[\"classifier.4.bias\"])\n    assert fc2\n\n    relu7 = network.add_activation(fc2.get_output(0), type=trt.ActivationType.RELU)\n    assert relu7\n\n    fc3 = network.add_fully_connected(input=relu7.get_output(0),\n                                      num_outputs=1000,\n                                      kernel=weight_map[\"classifier.6.weight\"],\n                                      bias=weight_map[\"classifier.6.bias\"])\n    assert fc3\n\n    fc3.get_output(0).name = OUTPUT_BLOB_NAME\n    network.mark_output(fc3.get_output(0))\n\n    # Build Engine\n    builder.max_batch_size = max_batch_size\n    builder.max_workspace_size = 1 << 20\n    engine = builder.build_engine(network, config)\n\n    del network\n    del weight_map\n\n    return engine\n\n\ndef API_to_model(max_batch_size):\n    builder = trt.Builder(TRT_LOGGER)\n    config = builder.create_builder_config()\n    engine = create_engine(max_batch_size, builder, config, trt.float32)\n    assert engine\n    with open(ENGINE_PATH, \"wb\") as f:\n        f.write(engine.serialize())\n\n    del engine\n    del builder\n    del config\n\n\nclass HostDeviceMem(object):\n    def __init__(self, host_mem, device_mem):\n        self.host = host_mem\n        self.device = device_mem\n\n    def __str__(self):\n        return \"Host:\\n\" + str(self.host) + \"\\nDevice:\\n\" + str(self.device)\n\n    def __repr__(self):\n        return self.__str__()\n\n\ndef allocate_buffers(engine):\n    inputs = []\n    outputs = []\n    bindings = []\n    stream = cuda.Stream()\n    for binding in engine:\n        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n        dtype = trt.nptype(engine.get_binding_dtype(binding))\n        # Allocate host and device buffers\n        host_mem = cuda.pagelocked_empty(size, dtype)\n        device_mem = cuda.mem_alloc(host_mem.nbytes)\n        # Append the device buffer to device bindings.\n        bindings.append(int(device_mem))\n        # Append to the appropriate list.\n        if engine.binding_is_input(binding):\n            inputs.append(HostDeviceMem(host_mem, device_mem))\n        else:\n            outputs.append(HostDeviceMem(host_mem, device_mem))\n    return inputs, outputs, bindings, stream\n\n\ndef do_inference(context, bindings, inputs, outputs, stream, batch_size=1):\n    # Transfer input data to the GPU.\n    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]\n    # Run inference.\n    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)\n    # Transfer predictions back from the GPU.\n    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]\n    # Synchronize the stream\n    stream.synchronize()\n    # Return only the host outputs.\n    return [out.host for out in outputs]\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"-s\", action='store_true')\n    parser.add_argument(\"-d\", action='store_true')\n    args = parser.parse_args()\n\n    if not (args.s ^ args.d):\n        print(\n            \"arguments not right!\\n\"\n            \"python alexnet.py -s   # serialize model to plan file\\n\"\n            \"python alexnet.py -d   # deserialize plan file and run inference\"\n        )\n        sys.exit()\n\n    if args.s:\n        API_to_model(BATCH_SIZE)\n    else:\n        runtime = trt.Runtime(TRT_LOGGER)\n        assert runtime\n\n        with open(ENGINE_PATH, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        assert engine\n\n        context = engine.create_execution_context()\n        assert context\n\n        data = np.ones((BATCH_SIZE * 3 * INPUT_H * INPUT_W), dtype=np.float32)\n        inputs, outputs, bindings, stream = allocate_buffers(engine)\n        inputs[0].host = data\n\n        trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)\n\n        print(f'Output: \\n{trt_outputs[0][:10]}\\n{trt_outputs[0][-10:]}')\n"
  },
  {
    "path": "alexnet/gen_wts.py",
    "content": "import struct\n\nimport cv2\nimport numpy as np\nimport torch\nfrom torchvision.models import alexnet\n\n\ndef read_imagenet_labels() -> dict[int, str]:\n    \"\"\"\n    read ImageNet 1000 labels\n\n    Returns:\n        dict[int, str]: labels dict\n    \"\"\"\n    clsid2label = {}\n    with open(\"../assets/imagenet1000_clsidx_to_labels.txt\", \"r\") as f:\n        for i in f.readlines():\n            k, v = i.split(\": \")\n            clsid2label.setdefault(int(k), v[1:-3])\n    return clsid2label\n\n\ndef preprocess(img: np.array) -> torch.Tensor:\n    \"\"\"\n    a preprocess method align with ImageNet dataset\n\n    Args:\n        img (np.array): input image\n\n    Returns:\n        torch.Tensor: preprocessed image in `NCHW` layout\n    \"\"\"\n    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0\n    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_LINEAR)\n    mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)\n    std = np.array([0.229, 0.224, 0.225], dtype=np.float32)\n    img = (img - mean) / std\n    img = img.transpose(2, 0, 1)[None, ...]\n    return torch.from_numpy(img)\n\n\nif __name__ == \"__main__\":\n    img = cv2.imread(\"../assets/cats.jpg\", cv2.IMREAD_COLOR)\n    img = preprocess(img)\n    model = alexnet(pretrained=True)\n    model.eval()\n    output = model(img)\n    labels = read_imagenet_labels()\n    for batch in torch.topk(output, k=3).indices:\n        for i, j in enumerate(batch, 1):\n            print(f\"top: {i:<2}, confidence: {float(output[0, j]):.4f}, label: {labels[int(j)]}\")\n\n    print(\"writing alexnet wts\")\n    with open(\"../models/alexnet.wts\", \"w\") as f:\n        f.write(\"{}\\n\".format(len(model.state_dict().keys())))\n        for k, v in model.state_dict().items():\n            print(f\"key: {k}\\tvalue: {v.shape}\")\n            vr = v.reshape(-1).cpu().numpy()\n            f.write(\"{} {}\".format(k, len(vr)))\n            for vv in vr:\n                f.write(\" \")\n                f.write(struct.pack(\">f\", float(vv)).hex())\n            f.write(\"\\n\")\n"
  },
  {
    "path": "alexnet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <cstdint>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include <utility>\n#include \"NvInferRuntime.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(std::move(prefix)), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) noexcept\n        : mOutput(other.mOutput), mPrefix(std::move(other.mPrefix)), mShouldLog(other.mShouldLog) {}\n\n    ~LogStreamConsumerBuffer() override {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    int sync() override {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mBuffer(stream, std::move(prefix), shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other) noexcept\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   private:\n    struct TestInfo;\n\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult : std::uint8_t {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << '\\n';\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, TestInfo info)\n            : mStarted(started), mName(std::move(info.name)), mCmdline(std::move(info.cmdline)) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom{false, TestInfo{name, cmdline}};\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    [[nodiscard]] Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    struct TestInfo {\n        std::string name;\n        std::string cmdline;\n    };\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << '\\n';\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kVERBOSE};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINFO};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kWARNING};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kERROR};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINTERNAL_ERROR};\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "alexnet/macros.h",
    "content": "#pragma once\n\n#include <NvInfer.h>\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#define TRT_VERSION \\\n    ((NV_TENSORRT_MAJOR * 1000) + (NV_TENSORRT_MINOR * 100) + (NV_TENSORRT_PATCH * 10) + NV_TENSORRT_BUILD)\n\n#if TRT_VERSION >= 8000\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n"
  },
  {
    "path": "alexnet/utils.h",
    "content": "#pragma once\n#include <cuda_runtime_api.h>\n#include <algorithm>\n#include <cassert>\n#include <cstddef>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <memory>\n#include <numeric>\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\nusing namespace nvinfer1;\n\nconstexpr const std::size_t WORKSPACE_SIZE = 16 << 20;\n\n#define CHECK(status)                                     \\\n    do {                                                  \\\n        auto ret = (status);                              \\\n        if (ret != cudaSuccess) {                         \\\n            std::cerr << \"Cuda failure: \" << ret << \"\\n\"; \\\n            std::abort();                                 \\\n        }                                                 \\\n    } while (0)\n\nstatic void checkTrtEnv(int device = 0) {\n#if TRT_VERSION < 8000\n    CHECK(cudaGetDevice(&device));\n    cudaDeviceProp prop{};\n    CHECK(cudaGetDeviceProperties(&prop, device));\n    const int sm = prop.major * 10 + prop.minor;\n    if (sm > 86) {\n        std::cerr << \"TensorRT < 8 does not support SM > 86 on this GPU.\";\n        std::abort();\n    }\n#endif\n}\n\n/**\n * @brief TensorRT weight files have a simple space delimited format:\n * [type] [size] <data x size in hex>\n * \n * @param file input weight file path\n * @return std::map<std::string, nvinfer1::Weights> \n */\nstatic auto loadWeights(const std::string& file) {\n    std::cout << \"Loading weights: \" << file << \"\\n\";\n    std::map<std::string, nvinfer1::Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> wt.count;\n\n        // Load blob\n        auto* val = new uint32_t[wt.count];\n        input >> std::hex;\n        for (auto x = 0ll; x < wt.count; ++x) {\n            input >> val[x];\n        }\n        wt.values = val;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\n/**\n * @brief a preprocess function aligning with ImageNet preprocess in torchvision, only support 3-channel image\n * \n * @param img opencv image with BGR layout\n * @param bgr2rgb whether to convert BGR to RGB\n * @param mean subtract mean\n * @param std divide std\n * @param n batch size\n * @param h resize height\n * @param w resize width\n * @return std::vector<float> contiguous flatten image data in float32 type\n */\nstatic std::vector<float> preprocess_img(cv::Mat& img, bool bgr2rgb, const std::array<const float, 3>& mean,\n                                         const std::array<const float, 3>& std, int n, int h, int w) {\n    const auto c = img.channels();\n    const auto size = c * h * w;\n    if (c != 3) {\n        std::cerr << \"this demo only supports 3 channel input image.\\n\";\n        std::abort();\n    }\n    if (bgr2rgb) {\n        cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n    }\n    cv::resize(img, img, cv::Size(w, h), 0, 0, cv::INTER_LINEAR);\n    img.convertTo(img, CV_32FC3, 1.f / 255);\n    img = (img - cv::Scalar(mean[0], mean[1], mean[2])) / cv::Scalar(std[0], std[1], std[2]);\n    std::vector<float> chw(static_cast<std::size_t>(n) * c * h * w, 0.f);\n\n    // fill all batch with the same input image\n    for (int i = 0; i < n; ++i) {\n        for (int y = 0; y < h; ++y) {\n            for (int x = 0; x < w; ++x) {\n                const cv::Vec3f v = img.at<cv::Vec3f>(y, x);\n                chw[i * size + 0 * h * w + y * w + x] = v[0];\n                chw[i * size + 1 * h * w + y * w + x] = v[1];\n                chw[i * size + 2 * h * w + y * w + x] = v[2];\n            }\n        }\n    }\n    return chw;\n}\n\nstatic auto topk(const std::vector<float>& v, int k) -> std::vector<std::pair<int, float>> {\n    if (k <= 0)\n        return {};\n    auto stride = std::min<std::ptrdiff_t>(k, static_cast<int64_t>(v.size()));\n\n    std::vector<int> idx(v.size());\n    std::iota(idx.begin(), idx.end(), 0);\n\n    std::partial_sort(idx.begin(), idx.begin() + k, idx.end(), [&](int a, int b) { return v[a] > v[b]; });\n\n    std::vector<std::pair<int, float>> out;\n    out.reserve(stride);\n    for (auto i = 0; i < stride; ++i)\n        out.emplace_back(idx[i], v[idx[i]]);\n    return out;\n}\n\nstatic std::map<int, std::string> loadImagenetLabelMap(const std::string& path) {\n    std::map<int, std::string> labels;\n    std::ifstream in(path);\n    if (!in.is_open()) {\n        return labels;\n    }\n    std::string line;\n    while (std::getline(in, line)) {\n        auto colon = line.find(':');\n        if (colon == std::string::npos) {\n            continue;\n        }\n        auto first_quote = line.find('\\'', colon);\n        if (first_quote == std::string::npos) {\n            continue;\n        }\n        auto second_quote = line.find('\\'', first_quote + 1);\n        if (second_quote == std::string::npos) {\n            continue;\n        }\n        int idx = std::stoi(line.substr(0, colon));\n        labels[idx] = line.substr(first_quote + 1, second_quote - first_quote - 1);\n    }\n    return labels;\n}\n\nstatic ILayer* addTransformLayer(INetworkDefinition* network, ITensor& input, bool bgr2rgb,\n                                 const std::array<const float, 3>& mean, const std::array<const float, 3>& std) {\n    struct ScaleParams {\n        std::array<float, 3> shift;\n        std::array<float, 3> scale;\n    };\n    static std::vector<std::unique_ptr<ScaleParams>> gScaleParams;\n    auto params = std::make_unique<ScaleParams>();\n    params->shift = {-mean[0] / std[0], -mean[1] / std[1], -mean[2] / std[2]};\n    params->scale = {1.f / (std[0] * 255.f), 1.f / (std[1] * 255.f), 1.f / (std[2] * 255.f)};\n\n    static const Weights empty{DataType::kFLOAT, nullptr, 0ll};\n    const Weights shift{DataType::kFLOAT, params->shift.data(), 3ll};\n    const Weights scale{DataType::kFLOAT, params->scale.data(), 3ll};\n\n    gScaleParams.emplace_back(std::move(params));\n\n    ITensor* in = &input;\n    if (input.getType() != DataType::kFLOAT) {\n#if TRT_VERSION >= 8000\n        auto* cast = network->addCast(input, DataType::kFLOAT);\n        assert(cast);\n        cast->setName(\"Cast to FP32\");\n        in = cast->getOutput(0);\n#else\n        auto* identity = network->addIdentity(input);\n        assert(identity);\n        identity->setName(\"Convert to FP32\");\n        identity->setOutputType(0, DataType::kFLOAT);\n        in = identity->getOutput(0);\n#endif\n    }\n    // Convert from NHWC to NCHW\n    auto* perm = network->addShuffle(*in);\n    assert(perm);\n    perm->setName(\"NHWC -> NCHW\");\n    perm->setFirstTranspose(Permutation{0, 3, 1, 2});\n\n    // Convert from BGR to RGB (optional)\n    ITensor* data{nullptr};\n    if (bgr2rgb) {\n        auto add_slice = [&](int c, const char* name) -> ITensor* {\n            auto dims = perm->getOutput(0)->getDimensions();\n            Dims4 start = {0, c, 0, 0}, stride = {1, 1, 1, 1};\n            Dims4 size = {dims.d[0], 1, dims.d[2], dims.d[3]};\n            auto* _slice = network->addSlice(*perm->getOutput(0), start, size, stride);\n            _slice->setName(name);\n            assert(_slice && _slice->getNbOutputs() == 1);\n            return _slice->getOutput(0);\n        };\n        std::array<ITensor*, 3> channels = {add_slice(2, \"R\"), add_slice(1, \"G\"), add_slice(0, \"B\")};\n        auto* cat = network->addConcatenation(channels.data(), 3);\n        assert(cat);\n        cat->setName(\"RGB\");\n        cat->setAxis(1);\n        data = cat->getOutput(0);\n    } else {\n        data = perm->getOutput(0);\n    }\n\n    // Normalize\n    auto* trans = network->addScale(*data, ScaleMode::kCHANNEL, shift, scale, empty);\n    assert(trans);\n    trans->setName(\"mean & std\");\n#if TRT_VERSION >= 8000\n    trans->setChannelAxis(1);\n#endif\n    return trans;\n}\n\nstatic size_t getSize(DataType dt) {\n    switch (dt) {\n#if TRT_VERSION >= 8510\n        case DataType::kUINT8:\n#endif\n        case DataType::kINT8:\n            return sizeof(int8_t);\n        case DataType::kFLOAT:\n            return sizeof(float);\n        case DataType::kHALF:\n            return sizeof(int16_t);\n        case DataType::kINT32:\n            return sizeof(int32_t);\n        default: {\n            std::cerr << \"Unsupported data type\\n\";\n            std::abort();\n        }\n    }\n}\n"
  },
  {
    "path": "arcface/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(arcface)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\nif (CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n    message(\"embed_platform on\")\n    include_directories(/usr/local/cuda/targets/aarch64-linux/include)\n    link_directories(/usr/local/cuda/targets/aarch64-linux/lib)\nelse()\n    message(\"embed_platform off\")\n    include_directories(/usr/local/cuda/include)\n    link_directories(/usr/local/cuda/lib64)\nendif()\n\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\ncuda_add_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/prelu.cu)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(arcface-r50 ${PROJECT_SOURCE_DIR}/arcface-r50.cpp)\ntarget_link_libraries(arcface-r50 nvinfer)\ntarget_link_libraries(arcface-r50 cudart)\ntarget_link_libraries(arcface-r50 myplugins)\ntarget_link_libraries(arcface-r50 ${OpenCV_LIBS})\n\nadd_executable(arcface-mobilefacenet ${PROJECT_SOURCE_DIR}/arcface-mobilefacenet.cpp)\ntarget_link_libraries(arcface-mobilefacenet nvinfer)\ntarget_link_libraries(arcface-mobilefacenet cudart)\ntarget_link_libraries(arcface-mobilefacenet myplugins)\ntarget_link_libraries(arcface-mobilefacenet ${OpenCV_LIBS})\n\nadd_executable(arcface-r100 ${PROJECT_SOURCE_DIR}/arcface-r100.cpp)\ntarget_link_libraries(arcface-r100 nvinfer)\ntarget_link_libraries(arcface-r100 cudart)\ntarget_link_libraries(arcface-r100 myplugins)\ntarget_link_libraries(arcface-r100 ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "arcface/README.md",
    "content": "# arcface\n### TensortRT 8\n\nThe mxnet implementation is from [deepinsight/insightface.](https://github.com/deepinsight/insightface)\n\n**Updated Pretrained Weights:** ArcFace-R100 [Insight Face Google Drive](https://drive.google.com/file/d/1Hc5zUfBATaXUgcU2haUNa7dcaZSw95h2/view)\n\n---\n\n**Previous Pre-trained models:** The pretrained models are from [LResNet50E-IR,ArcFace@ms1m-refine-v1](https://github.com/deepinsight/insightface/wiki/Model-Zoo#32-lresnet50e-irarcfacems1m-refine-v1), [LResNet100E-IR,ArcFace@ms1m-refine-v2](https://github.com/deepinsight/insightface/wiki/Model-Zoo#31-lresnet100e-irarcfacems1m-refine-v2) and [MobileFaceNet,ArcFace@ms1m-refine-v1](https://github.com/deepinsight/insightface/wiki/Model-Zoo#34-mobilefacenetarcfacems1m-refine-v1)\n\n---\n\nThe two input images used in this project are joey0.ppm and joey1.ppm, download them from [Google Drive.](https://drive.google.com/drive/folders/1ctqpkRCRKyBZRCNwo9Uq4eUoMRLtFq1e). The input image is 112x112, and generated from `get_input()` in `insightface/deploy/face_model.py`, which is cropped and aligned face image.\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/83122953-f45f8d80-a106-11ea-84b0-4f6ff91b5924.jpg\">\n</p>\n\n## Config\n\n- FP16/FP32 can be selected by the macro `USE_FP16` in arcface-r50/r100/mobilefacenet.cpp\n- GPU id can be selected by the macro `DEVICE` in arcface-r50/r100/mobilefacenet.cpp\n\n## Run\n\n1.Generate .wts file from mxnet implementation of pretrained model. The following example described how to generate arcface-r100.wts from mxnet implementation of LResNet100E-IR,ArcFace@ms1m-refine-v1.\n```\ngit clone https://github.com/deepinsight/insightface\ncd insightface\ngit checkout 3866cd77a6896c934b51ed39e9651b791d78bb57\ncd deploy\n// copy tensorrtx/arcface/gen_wts.py to here(insightface/deploy)\n// download model-r100-ii.zip and unzip here(insightface/deploy)\npython gen_wts.py\n// a file 'arcface-r100.wts' will be generated.\n// the master branch of insightface should work, if not, you can checkout 94ad870abb3203d6f31b049b70dd080dc8f33fca\n// arcface-r50.wts/arcface-mobilefacenet.wts can be generated in similar way from mxnet implementation of LResNet50E-IR,ArcFace@ms1m-refine-v1/MobileFaceNet,ArcFace@ms1m-refine-v1 pretrained model.\n\n```\n2.Put .wts file into tensorrtx/arcface, build and run\n\n```\ncd tensorrtx/arcface\n// download joey0.ppm and joey1.ppm, and put here(tensorrtx/arcface)\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./arcface-r100 -s    // serialize model to plan file i.e. 'arcface-r100.engine'\nsudo ./arcface-r100 -d    // deserialize plan file and run inference\n\nor\n\nsudo ./arcface-r50 -s   // serialize model to plan file i.e. 'arcface-r50.engine'\nsudo ./arcface-r50 -d   // deserialize plan file and run inference\n\n\nor\n\nsudo ./arcface-mobilefacenet -s   // serialize model to plan file i.e. 'arcface-mobilefacenet.engine'\nsudo ./arcface-mobilefacenet -d   // deserialize plan file and run inference\n```\n\n3.Check the output log, latency and similarity score.\n\n## More Information\n\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n"
  },
  {
    "path": "arcface/arcface-mobilefacenet.cpp",
    "content": "#include <fstream>\r\n#include <iostream>\r\n#include <map>\r\n#include <sstream>\r\n#include <vector>\r\n#include <chrono>\r\n#include <opencv2/opencv.hpp>\r\n#include <dirent.h>\r\n#include \"NvInfer.h\"\r\n#include \"cuda_runtime_api.h\"\r\n#include \"logging.h\"\r\n\r\n#define CHECK(status) \\\r\n    do\\\r\n    {\\\r\n        auto ret = (status);\\\r\n        if (ret != 0)\\\r\n        {\\\r\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\r\n            abort();\\\r\n        }\\\r\n    } while (0)\r\n\r\n//#define USE_FP16  // comment out this if want to use FP32\r\n#define DEVICE 0  // GPU id\r\n#define BATCH_SIZE 1  // currently, only support BATCH=1\r\n\r\nusing namespace nvinfer1;\r\n\r\n// stuff we know about the network and the input/output blobs\r\nstatic const int INPUT_H = 112;\r\nstatic const int INPUT_W = 112;\r\nstatic const int OUTPUT_SIZE = 128;\r\nconst char* INPUT_BLOB_NAME = \"data\";\r\nconst char* OUTPUT_BLOB_NAME = \"prob\";\r\nstatic Logger gLogger;\r\n\r\n// TensorRT weight files have a simple space delimited format:\r\n// [type] [size] <data x size in hex>\r\nstd::map<std::string, Weights> loadWeights(const std::string file) {\r\n    std::cout << \"Loading weights: \" << file << std::endl;\r\n    std::map<std::string, Weights> weightMap;\r\n\r\n    // Open weights file\r\n    std::ifstream input(file);\r\n    assert(input.is_open() && \"Unable to load weight file.\");\r\n\r\n    // Read number of weight blobs\r\n    int32_t count;\r\n    input >> count;\r\n    assert(count > 0 && \"Invalid weight map file.\");\r\n\r\n    while (count--)\r\n    {\r\n        Weights wt{DataType::kFLOAT, nullptr, 0};\r\n        uint32_t size;\r\n\r\n        // Read name and type of blob\r\n        std::string name;\r\n        input >> name >> std::dec >> size;\r\n        wt.type = DataType::kFLOAT;\r\n\r\n        // Load blob\r\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\r\n        for (uint32_t x = 0, y = size; x < y; ++x)\r\n        {\r\n            input >> std::hex >> val[x];\r\n        }\r\n        wt.values = val;\r\n\r\n        wt.count = size;\r\n        weightMap[name] = wt;\r\n    }\r\n\r\n    return weightMap;\r\n}\r\n\r\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\r\n    float *gamma = (float*)weightMap[lname + \"_gamma\"].values;\r\n    float *beta = (float*)weightMap[lname + \"_beta\"].values;\r\n    float *mean = (float*)weightMap[lname + \"_moving_mean\"].values;\r\n    float *var = (float*)weightMap[lname + \"_moving_var\"].values;\r\n    int len = weightMap[lname + \"_moving_var\"].count;\r\n\r\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights scale{DataType::kFLOAT, scval, len};\r\n\r\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights shift{DataType::kFLOAT, shval, len};\r\n\r\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        pval[i] = 1.0;\r\n    }\r\n    Weights power{DataType::kFLOAT, pval, len};\r\n\r\n    weightMap[lname + \".scale\"] = scale;\r\n    weightMap[lname + \".shift\"] = shift;\r\n    weightMap[lname + \".power\"] = power;\r\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\r\n    assert(scale_1);\r\n    return scale_1;\r\n}\r\n\r\nILayer* addPRelu(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname) {\r\n\tfloat *gamma = (float*)weightMap[lname + \"_gamma\"].values;\r\n\tint len = weightMap[lname + \"_gamma\"].count;\r\n\r\n\tfloat *scval_1 = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n\tfloat *scval_2 = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n\tfor (int i = 0; i < len; i++) {\r\n\t\tscval_1[i] = -1.0;\r\n\t\tscval_2[i] = -gamma[i];\r\n\t}\r\n\tWeights scale_1{ DataType::kFLOAT, scval_1, len };\r\n\tWeights scale_2{ DataType::kFLOAT, scval_2, len };\r\n\r\n\tfloat *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n\tfor (int i = 0; i < len; i++) {\r\n\t\tshval[i] = 0.0;\r\n\t}\r\n\tWeights shift{ DataType::kFLOAT, shval, len };\r\n\r\n\tfloat *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n\tfor (int i = 0; i < len; i++) {\r\n\t\tpval[i] = 1.0;\r\n\t}\r\n\tWeights power{ DataType::kFLOAT, pval, len };\r\n\r\n\tauto relu1 = network->addActivation(input, ActivationType::kRELU);\r\n\tassert(relu1);\r\n\tIScaleLayer* scale1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale_1, power);\r\n\tassert(scale1);\r\n\tauto relu2 = network->addActivation(*scale1->getOutput(0), ActivationType::kRELU);\r\n\tassert(relu2);\r\n\tIScaleLayer* scale2 = network->addScale(*relu2->getOutput(0), ScaleMode::kCHANNEL, shift, scale_2, power);\r\n\tassert(scale2);\r\n\tIElementWiseLayer* ew1 = network->addElementWise(*relu1->getOutput(0), *scale2->getOutput(0), ElementWiseOperation::kSUM);\r\n\tassert(ew1);\r\n\treturn ew1;\r\n}\r\n\r\nILayer* conv_bn_relu(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, int oup, int k = 3, int p = 1, int s = 2, int groups=1) {\r\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\r\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, oup, DimsHW{k, k}, weightMap[lname + \"_conv2d_weight\"], emptywts);\r\n    assert(conv1);\r\n    conv1->setStrideNd(DimsHW{s, s});\r\n    conv1->setPaddingNd(DimsHW{p, p});\r\n    conv1->setNbGroups(groups);\r\n    auto bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"_batchnorm\", 1e-3);\r\n    assert(bn1);\r\n    auto act1 = addPRelu(network, weightMap, *bn1->getOutput(0), lname + \"_relu\");\r\n    assert(act1);\r\n    return act1;\r\n}\r\n\r\nILayer* conv_bn(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, int oup, int k = 3, int p = 1, int s = 1, int groups=1) {\r\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\r\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, oup, DimsHW{k, k}, weightMap[lname + \"_conv2d_weight\"], emptywts);\r\n    assert(conv1);\r\n    conv1->setStrideNd(DimsHW{s, s});\r\n    conv1->setPaddingNd(DimsHW{p, p});\r\n    conv1->setNbGroups(groups);\r\n    auto bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"_batchnorm\", 1e-3);\r\n    assert(bn1);\r\n    return bn1;\r\n}\r\n\r\nILayer* DepthWise(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, int inp, int oup, int groups, int s) {\r\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\r\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, groups, DimsHW{1, 1}, weightMap[lname + \"_conv_sep_conv2d_weight\"], emptywts);\r\n    assert(conv1);\r\n    conv1->setStrideNd(DimsHW{1, 1});\r\n    conv1->setPaddingNd(DimsHW{0, 0});\r\n    conv1->setNbGroups(1);\r\n    auto bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"_conv_sep_batchnorm\", 1e-3);\r\n    assert(bn1);\r\n    auto act1 = addPRelu(network, weightMap, *bn1->getOutput(0), lname + \"_conv_sep_relu\");\r\n    assert(act1);\r\n\r\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*act1->getOutput(0), groups, DimsHW{3, 3}, weightMap[lname + \"_conv_dw_conv2d_weight\"], emptywts);\r\n    assert(conv2);\r\n    conv2->setStrideNd(DimsHW{s, s});\r\n    conv2->setPaddingNd(DimsHW{1, 1});\r\n    conv2->setNbGroups(groups);\r\n    auto bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"_conv_dw_batchnorm\", 1e-3);\r\n    assert(bn2);\r\n    auto act2 = addPRelu(network, weightMap, *bn2->getOutput(0), lname + \"_conv_dw_relu\");\r\n    assert(act2);\r\n\r\n    IConvolutionLayer* conv3 = network->addConvolutionNd(*act2->getOutput(0), oup, DimsHW{1, 1}, weightMap[lname + \"_conv_proj_conv2d_weight\"], emptywts);\r\n    assert(conv3);\r\n    conv3->setStrideNd(DimsHW{1, 1});\r\n    conv3->setPaddingNd(DimsHW{0, 0});\r\n    conv3->setNbGroups(1);\r\n    auto bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"_conv_proj_batchnorm\", 1e-3);\r\n    assert(bn3);\r\n    return bn3;\r\n}\r\n\r\n\r\nILayer* DWResidual(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, int inp, int oup, int groups, int s) {\r\n\r\n    auto dw1 = DepthWise(network, weightMap, input, lname, inp, oup, groups, s);\r\n    IElementWiseLayer* ew1;\r\n    ew1 = network->addElementWise(input, *dw1->getOutput(0), ElementWiseOperation::kSUM);\r\n    assert(ew1);\r\n    return ew1;\r\n}\r\n\r\n\r\n// Creat the engine using only the API and not any parser.\r\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\r\n    INetworkDefinition* network = builder->createNetworkV2(0U);\r\n\r\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\r\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\r\n    assert(data);\r\n\r\n    std::map<std::string, Weights> weightMap = loadWeights(\"../arcface-mobilefacenet.wts\");\r\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\r\n\r\n    auto conv_1 = conv_bn_relu(network, weightMap, *data, \"conv_1\", 64, 3, 1, 2);\r\n    auto conv_2_dw = conv_bn_relu(network, weightMap, *conv_1->getOutput(0), \"conv_2_dw\", 64, 3, 1, 1, 64);\r\n    auto conv_23 = DepthWise(network, weightMap, *conv_2_dw->getOutput(0), \"dconv_23\", 64, 64, 128, 2);\r\n    auto res_3_block0 = DWResidual(network, weightMap, *conv_23->getOutput(0), \"res_3_block0\", 64, 64, 128, 1);\r\n    auto res_3_block1 = DWResidual(network, weightMap, *res_3_block0->getOutput(0), \"res_3_block1\", 64, 64, 128, 1);\r\n    auto res_3_block2 = DWResidual(network, weightMap, *res_3_block1->getOutput(0), \"res_3_block2\", 64, 64, 128, 1);\r\n    auto res_3_block3 = DWResidual(network, weightMap, *res_3_block2->getOutput(0), \"res_3_block3\", 64, 64, 128, 1);\r\n    auto conv_34 = DepthWise(network, weightMap, *res_3_block3->getOutput(0), \"dconv_34\", 64, 128, 256, 2);\r\n    auto res_4_block0 = DWResidual(network, weightMap, *conv_34->getOutput(0), \"res_4_block0\", 128, 128, 256, 1);\r\n    auto res_4_block1 = DWResidual(network, weightMap, *res_4_block0->getOutput(0), \"res_4_block1\", 128, 128, 256, 1);\r\n    auto res_4_block2 = DWResidual(network, weightMap, *res_4_block1->getOutput(0), \"res_4_block2\", 128, 128, 256, 1);\r\n    auto res_4_block3 = DWResidual(network, weightMap, *res_4_block2->getOutput(0), \"res_4_block3\", 128, 128, 256, 1);\r\n    auto res_4_block4 = DWResidual(network, weightMap, *res_4_block3->getOutput(0), \"res_4_block4\", 128, 128, 256, 1);\r\n    auto res_4_block5 = DWResidual(network, weightMap, *res_4_block4->getOutput(0), \"res_4_block5\", 128, 128, 256, 1);\r\n    auto conv_45 = DepthWise(network, weightMap, *res_4_block5->getOutput(0), \"dconv_45\", 128, 128, 512, 2);\r\n    auto res_5_block0 = DWResidual(network, weightMap, *conv_45->getOutput(0), \"res_5_block0\", 128, 128, 256, 1);\r\n    auto res_5_block1 = DWResidual(network, weightMap, *res_5_block0->getOutput(0), \"res_5_block1\", 128, 128, 256, 1);\r\n    auto conv_6_sep = conv_bn_relu(network, weightMap, *res_5_block1->getOutput(0), \"conv_6sep\", 512, 1, 0, 1);\r\n    auto conv_6dw7_7 = conv_bn(network, weightMap, *conv_6_sep->getOutput(0), \"conv_6dw7_7\", 512, 7, 0, 1, 512);\r\n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*conv_6dw7_7->getOutput(0), 128, weightMap[\"fc1_weight\"], weightMap[\"pre_fc1_bias\"]);\r\n    assert(fc1);\r\n    auto bn1 = addBatchNorm2d(network, weightMap, *fc1->getOutput(0), \"fc1\", 2e-5);\r\n    assert(bn1);\r\n    bn1->getOutput(0)->setName(OUTPUT_BLOB_NAME);\r\n    network->markOutput(*bn1->getOutput(0));\r\n\r\n    // Build engine\r\n    builder->setMaxBatchSize(maxBatchSize);\r\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\r\n#ifdef USE_FP16\r\n    config->setFlag(BuilderFlag::kFP16);\r\n#endif\r\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\r\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\r\n    std::cout << \"Build engine successfully!\" << std::endl;\r\n\r\n    // Don't need the network any more\r\n    network->destroy();\r\n\r\n    // Release host memory\r\n    for (auto& mem : weightMap)\r\n    {\r\n        free((void*) (mem.second.values));\r\n    }\r\n\r\n    return engine;\r\n}\r\n\r\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\r\n    // Create builder\r\n    IBuilder* builder = createInferBuilder(gLogger);\r\n    IBuilderConfig* config = builder->createBuilderConfig();\r\n\r\n    // Create model to populate the network, then set the outputs and create an engine\r\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\r\n    assert(engine != nullptr);\r\n\r\n    // Serialize the engine\r\n    (*modelStream) = engine->serialize();\r\n\r\n    // Close everything down\r\n    engine->destroy();\r\n    builder->destroy();\r\n}\r\n\r\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\r\n    const ICudaEngine& engine = context.getEngine();\r\n\r\n    // Pointers to input and output device buffers to pass to engine.\r\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\r\n    assert(engine.getNbBindings() == 2);\r\n    void* buffers[2];\r\n\r\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\r\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\r\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\r\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\r\n\r\n    // Create GPU buffers on device\r\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\r\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\r\n\r\n    // Create stream\r\n    cudaStream_t stream;\r\n    CHECK(cudaStreamCreate(&stream));\r\n\r\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\r\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\r\n    context.enqueue(batchSize, buffers, stream, nullptr);\r\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\r\n    cudaStreamSynchronize(stream);\r\n\r\n    // Release stream and buffers\r\n    cudaStreamDestroy(stream);\r\n    CHECK(cudaFree(buffers[inputIndex]));\r\n    CHECK(cudaFree(buffers[outputIndex]));\r\n}\r\n\r\nint read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\r\n    DIR *p_dir = opendir(p_dir_name);\r\n    if (p_dir == nullptr) {\r\n        return -1;\r\n    }\r\n\r\n    struct dirent* p_file = nullptr;\r\n    while ((p_file = readdir(p_dir)) != nullptr) {\r\n        if (strcmp(p_file->d_name, \".\") != 0 &&\r\n                strcmp(p_file->d_name, \"..\") != 0) {\r\n            //std::string cur_file_name(p_dir_name);\r\n            //cur_file_name += \"/\";\r\n            //cur_file_name += p_file->d_name;\r\n            std::string cur_file_name(p_file->d_name);\r\n            file_names.push_back(cur_file_name);\r\n        }\r\n    }\r\n\r\n    closedir(p_dir);\r\n    return 0;\r\n}\r\n\r\nint main(int argc, char** argv) {\r\n    cudaSetDevice(DEVICE);\r\n    // create a model using the API directly and serialize it to a stream\r\n    char *trtModelStream{nullptr};\r\n    size_t size{0};\r\n\r\n    if (argc == 2 && std::string(argv[1]) == \"-s\") {\r\n        IHostMemory* modelStream{nullptr};\r\n        APIToModel(BATCH_SIZE, &modelStream);\r\n        assert(modelStream != nullptr);\r\n        std::ofstream p(\"arcface-mobilefacenet.engine\", std::ios::binary);\r\n        if (!p) {\r\n            std::cerr << \"could not open plan output file\" << std::endl;\r\n            return -1;\r\n        }\r\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\r\n        modelStream->destroy();\r\n        return 0;\r\n    } else if (argc == 2 && std::string(argv[1]) == \"-d\") {\r\n        std::ifstream file(\"arcface-mobilefacenet.engine\", std::ios::binary);\r\n        if (file.good()) {\r\n            file.seekg(0, file.end);\r\n            size = file.tellg();\r\n            file.seekg(0, file.beg);\r\n            trtModelStream = new char[size];\r\n            assert(trtModelStream);\r\n            file.read(trtModelStream, size);\r\n            file.close();\r\n        }\r\n    } else {\r\n        std::cerr << \"arguments not right!\" << std::endl;\r\n        std::cerr << \"./arcface-mobilefacenet -s  // serialize model to plan file\" << std::endl;\r\n        std::cerr << \"./arcface-mobilefacenet -d  // deserialize plan file and run inference\" << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    // prepare input data ---------------------------\r\n    static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];\r\n    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\r\n    //    data[i] = 1.0;\r\n    static float prob[BATCH_SIZE * OUTPUT_SIZE];\r\n    IRuntime* runtime = createInferRuntime(gLogger);\r\n    assert(runtime != nullptr);\r\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\r\n    assert(engine != nullptr);\r\n    IExecutionContext* context = engine->createExecutionContext();\r\n    assert(context != nullptr);\r\n    delete[] trtModelStream;\r\n\r\n    cv::Mat img = cv::imread(\"../joey0.ppm\");\r\n    for (int i = 0; i < INPUT_H * INPUT_W; i++) {\r\n        data[i] = ((float)img.at<cv::Vec3b>(i)[2] - 127.5) * 0.0078125;\r\n        data[i + INPUT_H * INPUT_W] = ((float)img.at<cv::Vec3b>(i)[1] - 127.5) * 0.0078125;\r\n        data[i + 2 * INPUT_H * INPUT_W] = ((float)img.at<cv::Vec3b>(i)[0] - 127.5) * 0.0078125;\r\n    }\r\n\r\n    // Run inference\r\n    auto start = std::chrono::system_clock::now();\r\n    doInference(*context, data, prob, BATCH_SIZE);\r\n    auto end = std::chrono::system_clock::now();\r\n    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\r\n\r\n    cv::Mat out(128, 1, CV_32FC1, prob);\r\n    cv::Mat out_norm;\r\n    cv::normalize(out, out_norm);\r\n\r\n    img = cv::imread(\"../joey1.ppm\");\r\n    for (int i = 0; i < INPUT_H * INPUT_W; i++) {\r\n        data[i] = ((float)img.at<cv::Vec3b>(i)[2] - 127.5) * 0.0078125;\r\n        data[i + INPUT_H * INPUT_W] = ((float)img.at<cv::Vec3b>(i)[1] - 127.5) * 0.0078125;\r\n        data[i + 2 * INPUT_H * INPUT_W] = ((float)img.at<cv::Vec3b>(i)[0] - 127.5) * 0.0078125;\r\n    }\r\n\r\n    // Run inference\r\n    start = std::chrono::system_clock::now();\r\n    doInference(*context, data, prob, BATCH_SIZE);\r\n    end = std::chrono::system_clock::now();\r\n    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\r\n\r\n    cv::Mat out1(1, 128, CV_32FC1, prob);\r\n    cv::Mat out_norm1;\r\n    cv::normalize(out1, out_norm1);\r\n\r\n    cv::Mat res = out_norm1 * out_norm;\r\n\r\n    std::cout << \"similarity score: \" << *(float*)res.data << std::endl;\r\n\r\n    // Destroy the engine\r\n    context->destroy();\r\n    engine->destroy();\r\n    runtime->destroy();\r\n\r\n    //Print histogram of the output distribution\r\n    //std::cout << \"\\nOutput:\\n\\n\";\r\n    //for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\r\n    //{\r\n    //    std::cout << p_out_norm[i] << \", \";\r\n    //    if (i % 10 == 0) std::cout << i / 10 << std::endl;\r\n    //}\r\n    //std::cout << std::endl;\r\n\r\n    return 0;\r\n}\r\n"
  },
  {
    "path": "arcface/arcface-r100.cpp",
    "content": "#include <fstream>\r\n#include <iostream>\r\n#include <map>\r\n#include <sstream>\r\n#include <vector>\r\n#include <chrono>\r\n#include <opencv2/opencv.hpp>\r\n#include <dirent.h>\r\n#include \"NvInfer.h\"\r\n#include \"cuda_runtime_api.h\"\r\n#include \"logging.h\"\r\n\r\n#define CHECK(status) \\\r\n    do\\\r\n    {\\\r\n        auto ret = (status);\\\r\n        if (ret != 0)\\\r\n        {\\\r\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\r\n            abort();\\\r\n        }\\\r\n    } while (0)\r\n\r\n//#define USE_FP16  // comment out this if want to use FP32\r\n#define DEVICE 0  // GPU id\r\n#define BATCH_SIZE 1  // currently, only support BATCH=1\r\n\r\nusing namespace nvinfer1;\r\n\r\n// stuff we know about the network and the input/output blobs\r\nstatic const int INPUT_H = 112;\r\nstatic const int INPUT_W = 112;\r\nstatic const int OUTPUT_SIZE = 512;\r\nconst char* INPUT_BLOB_NAME = \"data\";\r\nconst char* OUTPUT_BLOB_NAME = \"prob\";\r\nstatic Logger gLogger;\r\n\r\n// TensorRT weight files have a simple space delimited format:\r\n// [type] [size] <data x size in hex>\r\nstd::map<std::string, Weights> loadWeights(const std::string file) {\r\n    std::cout << \"Loading weights: \" << file << std::endl;\r\n    std::map<std::string, Weights> weightMap;\r\n\r\n    // Open weights file\r\n    std::ifstream input(file);\r\n    assert(input.is_open() && \"Unable to load weight file.\");\r\n\r\n    // Read number of weight blobs\r\n    int32_t count;\r\n    input >> count;\r\n    assert(count > 0 && \"Invalid weight map file.\");\r\n\r\n    while (count--)\r\n    {\r\n        Weights wt{DataType::kFLOAT, nullptr, 0};\r\n        uint32_t size;\r\n\r\n        // Read name and type of blob\r\n        std::string name;\r\n        input >> name >> std::dec >> size;\r\n        wt.type = DataType::kFLOAT;\r\n\r\n        // Load blob\r\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\r\n        for (uint32_t x = 0, y = size; x < y; ++x)\r\n        {\r\n            input >> std::hex >> val[x];\r\n        }\r\n        wt.values = val;\r\n\r\n        wt.count = size;\r\n        weightMap[name] = wt;\r\n    }\r\n\r\n    return weightMap;\r\n}\r\n\r\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\r\n    float *gamma = (float*)weightMap[lname + \"_gamma\"].values;\r\n    float *beta = (float*)weightMap[lname + \"_beta\"].values;\r\n    float *mean = (float*)weightMap[lname + \"_moving_mean\"].values;\r\n    float *var = (float*)weightMap[lname + \"_moving_var\"].values;\r\n    int len = weightMap[lname + \"_moving_var\"].count;\r\n\r\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights scale{DataType::kFLOAT, scval, len};\r\n\r\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights shift{DataType::kFLOAT, shval, len};\r\n\r\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        pval[i] = 1.0;\r\n    }\r\n    Weights power{DataType::kFLOAT, pval, len};\r\n\r\n    weightMap[lname + \".scale\"] = scale;\r\n    weightMap[lname + \".shift\"] = shift;\r\n    weightMap[lname + \".power\"] = power;\r\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\r\n    assert(scale_1);\r\n    return scale_1;\r\n}\r\n\r\nILayer* addPRelu(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname) {\r\n\tfloat *gamma = (float*)weightMap[lname + \"_gamma\"].values;\r\n\tint len = weightMap[lname + \"_gamma\"].count;\r\n\r\n\tfloat *scval_1 = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n\tfloat *scval_2 = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n\tfor (int i = 0; i < len; i++) {\r\n\t\tscval_1[i] = -1.0;\r\n\t\tscval_2[i] = -gamma[i];\r\n\t}\r\n\tWeights scale_1{ DataType::kFLOAT, scval_1, len };\r\n\tWeights scale_2{ DataType::kFLOAT, scval_2, len };\r\n\r\n\tfloat *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n\tfor (int i = 0; i < len; i++) {\r\n\t\tshval[i] = 0.0;\r\n\t}\r\n\tWeights shift{ DataType::kFLOAT, shval, len };\r\n\r\n\tfloat *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n\tfor (int i = 0; i < len; i++) {\r\n\t\tpval[i] = 1.0;\r\n\t}\r\n\tWeights power{ DataType::kFLOAT, pval, len };\r\n\r\n\tauto relu1 = network->addActivation(input, ActivationType::kRELU);\r\n\tassert(relu1);\r\n\tIScaleLayer* scale1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale_1, power);\r\n\tassert(scale1);\r\n\tauto relu2 = network->addActivation(*scale1->getOutput(0), ActivationType::kRELU);\r\n\tassert(relu2);\r\n\tIScaleLayer* scale2 = network->addScale(*relu2->getOutput(0), ScaleMode::kCHANNEL, shift, scale_2, power);\r\n\tassert(scale2);\r\n\tIElementWiseLayer* ew1 = network->addElementWise(*relu1->getOutput(0), *scale2->getOutput(0), ElementWiseOperation::kSUM);\r\n\tassert(ew1);\r\n\treturn ew1;\r\n}\r\n\r\nILayer* resUnit(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int num_filters, int s, bool dim_match, std::string lname) {\r\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\r\n    auto bn1 = addBatchNorm2d(network, weightMap, input, lname + \"_bn1\", 2e-5);\r\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*bn1->getOutput(0), num_filters, DimsHW{3, 3}, weightMap[lname + \"_conv1_weight\"], emptywts);\r\n    assert(conv1);\r\n    conv1->setPaddingNd(DimsHW{1, 1});\r\n    auto bn2 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"_bn2\", 2e-5);\r\n    auto act1 = addPRelu(network, weightMap, *bn2->getOutput(0), lname + \"_relu1\");\r\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*act1->getOutput(0), num_filters, DimsHW{3, 3}, weightMap[lname + \"_conv2_weight\"], emptywts);\r\n    assert(conv2);\r\n    conv2->setStrideNd(DimsHW{s, s});\r\n    conv2->setPaddingNd(DimsHW{1, 1});\r\n    auto bn3 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"_bn3\", 2e-5);\r\n\r\n    IElementWiseLayer* ew1;\r\n    if (dim_match) {\r\n        ew1 = network->addElementWise(input, *bn3->getOutput(0), ElementWiseOperation::kSUM);\r\n    } else {\r\n        IConvolutionLayer* conv1sc = network->addConvolutionNd(input, num_filters, DimsHW{1, 1}, weightMap[lname + \"_conv1sc_weight\"], emptywts);\r\n        assert(conv1sc);\r\n        conv1sc->setStrideNd(DimsHW{s, s});\r\n        auto bn1sc = addBatchNorm2d(network, weightMap, *conv1sc->getOutput(0), lname + \"_sc\", 2e-5);\r\n        ew1 = network->addElementWise(*bn1sc->getOutput(0), *bn3->getOutput(0), ElementWiseOperation::kSUM);\r\n    }\r\n    assert(ew1);\r\n    return ew1;\r\n}\r\n\r\n// Creat the engine using only the API and not any parser.\r\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\r\n    INetworkDefinition* network = builder->createNetworkV2(0U);\r\n\r\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\r\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\r\n    assert(data);\r\n\r\n    std::map<std::string, Weights> weightMap = loadWeights(\"../arcface-r100.wts\");\r\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\r\n\r\n    IConvolutionLayer* conv0 = network->addConvolutionNd(*data, 64, DimsHW{3, 3}, weightMap[\"conv0_weight\"], emptywts);\r\n    assert(conv0);\r\n    conv0->setPaddingNd(DimsHW{1, 1});\r\n    auto bn0 = addBatchNorm2d(network, weightMap, *conv0->getOutput(0), \"bn0\", 2e-5);\r\n    auto relu0 = addPRelu(network, weightMap, *bn0->getOutput(0), \"relu0\");\r\n\r\n    auto s1u1 = resUnit(network, weightMap, *relu0->getOutput(0), 64, 2, false, \"stage1_unit1\");\r\n    auto s1u2 = resUnit(network, weightMap, *s1u1->getOutput(0), 64, 1, true, \"stage1_unit2\");\r\n    auto s1u3 = resUnit(network, weightMap, *s1u2->getOutput(0), 64, 1, true, \"stage1_unit3\");\r\n\r\n    auto s2u1 = resUnit(network, weightMap, *s1u3->getOutput(0), 128, 2, false, \"stage2_unit1\");\r\n    auto s2u2 = resUnit(network, weightMap, *s2u1->getOutput(0), 128, 1, true, \"stage2_unit2\");\r\n    auto s2u3 = resUnit(network, weightMap, *s2u2->getOutput(0), 128, 1, true, \"stage2_unit3\");\r\n    auto s2u4 = resUnit(network, weightMap, *s2u3->getOutput(0), 128, 1, true, \"stage2_unit4\");\r\n\r\n\r\n    auto s2u5 = resUnit(network, weightMap, *s2u4->getOutput(0), 128, 1, true, \"stage2_unit5\");\r\n    auto s2u6 = resUnit(network, weightMap, *s2u5->getOutput(0), 128, 1, true, \"stage2_unit6\");\r\n    auto s2u7 = resUnit(network, weightMap, *s2u6->getOutput(0), 128, 1, true, \"stage2_unit7\");\r\n    auto s2u8 = resUnit(network, weightMap, *s2u7->getOutput(0), 128, 1, true, \"stage2_unit8\");\r\n\r\n    auto s2u9 = resUnit(network, weightMap, *s2u8->getOutput(0), 128, 1, true, \"stage2_unit9\");\r\n    auto s2u10 = resUnit(network, weightMap, *s2u9->getOutput(0), 128, 1, true, \"stage2_unit10\");\r\n    auto s2u11 = resUnit(network, weightMap, *s2u10->getOutput(0), 128, 1, true, \"stage2_unit11\");\r\n    auto s2u12 = resUnit(network, weightMap, *s2u11->getOutput(0), 128, 1, true, \"stage2_unit12\");\r\n    auto s2u13 = resUnit(network, weightMap, *s2u12->getOutput(0), 128, 1, true, \"stage2_unit13\");\r\n\r\n    auto s3u1 = resUnit(network, weightMap, *s2u13->getOutput(0), 256, 2, false, \"stage3_unit1\");\r\n    auto s3u2 = resUnit(network, weightMap, *s3u1->getOutput(0), 256, 1, true, \"stage3_unit2\");\r\n    auto s3u3 = resUnit(network, weightMap, *s3u2->getOutput(0), 256, 1, true, \"stage3_unit3\");\r\n    auto s3u4 = resUnit(network, weightMap, *s3u3->getOutput(0), 256, 1, true, \"stage3_unit4\");\r\n    auto s3u5 = resUnit(network, weightMap, *s3u4->getOutput(0), 256, 1, true, \"stage3_unit5\");\r\n    auto s3u6 = resUnit(network, weightMap, *s3u5->getOutput(0), 256, 1, true, \"stage3_unit6\");\r\n    auto s3u7 = resUnit(network, weightMap, *s3u6->getOutput(0), 256, 1, true, \"stage3_unit7\");\r\n    auto s3u8 = resUnit(network, weightMap, *s3u7->getOutput(0), 256, 1, true, \"stage3_unit8\");\r\n    auto s3u9 = resUnit(network, weightMap, *s3u8->getOutput(0), 256, 1, true, \"stage3_unit9\");\r\n    auto s3u10 = resUnit(network, weightMap, *s3u9->getOutput(0), 256, 1, true, \"stage3_unit10\");\r\n    auto s3u11 = resUnit(network, weightMap, *s3u10->getOutput(0), 256, 1, true, \"stage3_unit11\");\r\n    auto s3u12 = resUnit(network, weightMap, *s3u11->getOutput(0), 256, 1, true, \"stage3_unit12\");\r\n    auto s3u13 = resUnit(network, weightMap, *s3u12->getOutput(0), 256, 1, true, \"stage3_unit13\");\r\n    auto s3u14 = resUnit(network, weightMap, *s3u13->getOutput(0), 256, 1, true, \"stage3_unit14\");\r\n\r\n    auto s3u15 = resUnit(network, weightMap, *s3u14->getOutput(0), 256, 1, true, \"stage3_unit15\");\r\n    auto s3u16 = resUnit(network, weightMap, *s3u15->getOutput(0), 256, 1, true, \"stage3_unit16\");\r\n    auto s3u17 = resUnit(network, weightMap, *s3u16->getOutput(0), 256, 1, true, \"stage3_unit17\");\r\n    auto s3u18 = resUnit(network, weightMap, *s3u17->getOutput(0), 256, 1, true, \"stage3_unit18\");\r\n    auto s3u19 = resUnit(network, weightMap, *s3u18->getOutput(0), 256, 1, true, \"stage3_unit19\");\r\n    auto s3u20 = resUnit(network, weightMap, *s3u19->getOutput(0), 256, 1, true, \"stage3_unit20\");\r\n    auto s3u21 = resUnit(network, weightMap, *s3u20->getOutput(0), 256, 1, true, \"stage3_unit21\");\r\n    auto s3u22 = resUnit(network, weightMap, *s3u21->getOutput(0), 256, 1, true, \"stage3_unit22\");\r\n    auto s3u23 = resUnit(network, weightMap, *s3u22->getOutput(0), 256, 1, true, \"stage3_unit23\");\r\n    auto s3u24 = resUnit(network, weightMap, *s3u23->getOutput(0), 256, 1, true, \"stage3_unit24\");\r\n    auto s3u25 = resUnit(network, weightMap, *s3u24->getOutput(0), 256, 1, true, \"stage3_unit25\");\r\n    auto s3u26 = resUnit(network, weightMap, *s3u25->getOutput(0), 256, 1, true, \"stage3_unit26\");\r\n    auto s3u27 = resUnit(network, weightMap, *s3u26->getOutput(0), 256, 1, true, \"stage3_unit27\");\r\n    auto s3u28 = resUnit(network, weightMap, *s3u27->getOutput(0), 256, 1, true, \"stage3_unit28\");\r\n    auto s3u29 = resUnit(network, weightMap, *s3u28->getOutput(0), 256, 1, true, \"stage3_unit29\");\r\n    auto s3u30 = resUnit(network, weightMap, *s3u29->getOutput(0), 256, 1, true, \"stage3_unit30\");\r\n\r\n    auto s4u1 = resUnit(network, weightMap, *s3u30->getOutput(0), 512, 2, false, \"stage4_unit1\");\r\n    auto s4u2 = resUnit(network, weightMap, *s4u1->getOutput(0), 512, 1, true, \"stage4_unit2\");\r\n    auto s4u3 = resUnit(network, weightMap, *s4u2->getOutput(0), 512, 1, true, \"stage4_unit3\");\r\n\r\n    auto bn1 = addBatchNorm2d(network, weightMap, *s4u3->getOutput(0), \"bn1\", 2e-5);\r\n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*bn1->getOutput(0), 512, weightMap[\"pre_fc1_weight\"], weightMap[\"pre_fc1_bias\"]);\r\n    assert(fc1);\r\n    auto bn2 = addBatchNorm2d(network, weightMap, *fc1->getOutput(0), \"fc1\", 2e-5);\r\n\r\n    bn2->getOutput(0)->setName(OUTPUT_BLOB_NAME);\r\n    network->markOutput(*bn2->getOutput(0));\r\n\r\n    // Build engine\r\n    builder->setMaxBatchSize(maxBatchSize);\r\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\r\n#ifdef USE_FP16\r\n    config->setFlag(BuilderFlag::kFP16);\r\n#endif\r\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\r\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\r\n    std::cout << \"Build engine successfully!\" << std::endl;\r\n\r\n    // Don't need the network any more\r\n    network->destroy();\r\n\r\n    // Release host memory\r\n    for (auto& mem : weightMap)\r\n    {\r\n        free((void*) (mem.second.values));\r\n    }\r\n\r\n    return engine;\r\n}\r\n\r\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\r\n    // Create builder\r\n    IBuilder* builder = createInferBuilder(gLogger);\r\n    IBuilderConfig* config = builder->createBuilderConfig();\r\n\r\n    // Create model to populate the network, then set the outputs and create an engine\r\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\r\n    assert(engine != nullptr);\r\n\r\n    // Serialize the engine\r\n    (*modelStream) = engine->serialize();\r\n\r\n    // Close everything down\r\n    engine->destroy();\r\n    builder->destroy();\r\n}\r\n\r\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\r\n    const ICudaEngine& engine = context.getEngine();\r\n\r\n    // Pointers to input and output device buffers to pass to engine.\r\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\r\n    assert(engine.getNbBindings() == 2);\r\n    void* buffers[2];\r\n\r\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\r\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\r\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\r\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\r\n\r\n    // Create GPU buffers on device\r\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\r\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\r\n\r\n    // Create stream\r\n    cudaStream_t stream;\r\n    CHECK(cudaStreamCreate(&stream));\r\n\r\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\r\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\r\n    context.enqueue(batchSize, buffers, stream, nullptr);\r\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\r\n    cudaStreamSynchronize(stream);\r\n\r\n    // Release stream and buffers\r\n    cudaStreamDestroy(stream);\r\n    CHECK(cudaFree(buffers[inputIndex]));\r\n    CHECK(cudaFree(buffers[outputIndex]));\r\n}\r\n\r\nint read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\r\n    DIR *p_dir = opendir(p_dir_name);\r\n    if (p_dir == nullptr) {\r\n        return -1;\r\n    }\r\n\r\n    struct dirent* p_file = nullptr;\r\n    while ((p_file = readdir(p_dir)) != nullptr) {\r\n        if (strcmp(p_file->d_name, \".\") != 0 &&\r\n                strcmp(p_file->d_name, \"..\") != 0) {\r\n            //std::string cur_file_name(p_dir_name);\r\n            //cur_file_name += \"/\";\r\n            //cur_file_name += p_file->d_name;\r\n            std::string cur_file_name(p_file->d_name);\r\n            file_names.push_back(cur_file_name);\r\n        }\r\n    }\r\n\r\n    closedir(p_dir);\r\n    return 0;\r\n}\r\n\r\nint main(int argc, char** argv) {\r\n    cudaSetDevice(DEVICE);\r\n    // create a model using the API directly and serialize it to a stream\r\n    char *trtModelStream{nullptr};\r\n    size_t size{0};\r\n\r\n    if (argc == 2 && std::string(argv[1]) == \"-s\") {\r\n        IHostMemory* modelStream{nullptr};\r\n        APIToModel(256, &modelStream);\r\n        assert(modelStream != nullptr);\r\n        std::ofstream p(\"arcface-r100.engine\", std::ios::binary);\r\n        if (!p) {\r\n            std::cerr << \"could not open plan output file\" << std::endl;\r\n            return -1;\r\n        }\r\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\r\n        modelStream->destroy();\r\n        return 0;\r\n    } else if (argc == 2 && std::string(argv[1]) == \"-d\") {\r\n        std::ifstream file(\"arcface-r100.engine\", std::ios::binary);\r\n        if (file.good()) {\r\n            file.seekg(0, file.end);\r\n            size = file.tellg();\r\n            file.seekg(0, file.beg);\r\n            trtModelStream = new char[size];\r\n            assert(trtModelStream);\r\n            file.read(trtModelStream, size);\r\n            file.close();\r\n        }\r\n    } else {\r\n        std::cerr << \"arguments not right!\" << std::endl;\r\n        std::cerr << \"./arcface-r100 -s  // serialize model to plan file\" << std::endl;\r\n        std::cerr << \"./arcface-r100 -d  // deserialize plan file and run inference\" << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    // prepare input data ---------------------------\r\n    static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];\r\n    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\r\n    //    data[i] = 1.0;\r\n    static float prob[BATCH_SIZE * OUTPUT_SIZE];\r\n    IRuntime* runtime = createInferRuntime(gLogger);\r\n    assert(runtime != nullptr);\r\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\r\n    assert(engine != nullptr);\r\n    IExecutionContext* context = engine->createExecutionContext();\r\n    assert(context != nullptr);\r\n    delete[] trtModelStream;\r\n\r\n    cv::Mat img = cv::imread(\"../joey0.ppm\");\r\n    for (int i = 0; i < INPUT_H * INPUT_W; i++) {\r\n        data[i] = ((float)img.at<cv::Vec3b>(i)[2] - 127.5) * 0.0078125;\r\n        data[i + INPUT_H * INPUT_W] = ((float)img.at<cv::Vec3b>(i)[1] - 127.5) * 0.0078125;\r\n        data[i + 2 * INPUT_H * INPUT_W] = ((float)img.at<cv::Vec3b>(i)[0] - 127.5) * 0.0078125;\r\n    }\r\n\r\n    // Run inference\r\n    auto start = std::chrono::system_clock::now();\r\n    doInference(*context, data, prob, BATCH_SIZE);\r\n    auto end = std::chrono::system_clock::now();\r\n    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\r\n\r\n    cv::Mat out(512, 1, CV_32FC1, prob);\r\n    cv::Mat out_norm;\r\n    cv::normalize(out, out_norm);\r\n\r\n    img = cv::imread(\"../joey1.ppm\");\r\n    for (int i = 0; i < INPUT_H * INPUT_W; i++) {\r\n        data[i] = ((float)img.at<cv::Vec3b>(i)[2] - 127.5) * 0.0078125;\r\n        data[i + INPUT_H * INPUT_W] = ((float)img.at<cv::Vec3b>(i)[1] - 127.5) * 0.0078125;\r\n        data[i + 2 * INPUT_H * INPUT_W] = ((float)img.at<cv::Vec3b>(i)[0] - 127.5) * 0.0078125;\r\n    }\r\n\r\n    // Run inference\r\n    start = std::chrono::system_clock::now();\r\n    doInference(*context, data, prob, BATCH_SIZE);\r\n    end = std::chrono::system_clock::now();\r\n    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\r\n\r\n    cv::Mat out1(1, 512, CV_32FC1, prob);\r\n    cv::Mat out_norm1;\r\n    cv::normalize(out1, out_norm1);\r\n\r\n    cv::Mat res = out_norm1 * out_norm;\r\n\r\n    std::cout << \"similarity score: \" << *(float*)res.data << std::endl;\r\n\r\n    // Destroy the engine\r\n    context->destroy();\r\n    engine->destroy();\r\n    runtime->destroy();\r\n\r\n    //Print histogram of the output distribution\r\n    //std::cout << \"\\nOutput:\\n\\n\";\r\n    //for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\r\n    //{\r\n    //    std::cout << p_out_norm[i] << \", \";\r\n    //    if (i % 10 == 0) std::cout << i / 10 << std::endl;\r\n    //}\r\n    //std::cout << std::endl;\r\n\r\n    return 0;\r\n}"
  },
  {
    "path": "arcface/arcface-r50.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <opencv2/opencv.hpp>\n#include <dirent.h>\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\n//#define USE_FP16  // comment out this if want to use FP32\n#define DEVICE 0  // GPU id\n#define BATCH_SIZE 1  // currently, only support BATCH=1\n\nusing namespace nvinfer1;\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 112;\nstatic const int INPUT_W = 112;\nstatic const int OUTPUT_SIZE = 512;\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\nstatic Logger gLogger;\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \"_gamma\"].values;\n    float *beta = (float*)weightMap[lname + \"_beta\"].values;\n    float *mean = (float*)weightMap[lname + \"_moving_mean\"].values;\n    float *var = (float*)weightMap[lname + \"_moving_var\"].values;\n    int len = weightMap[lname + \"_moving_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* addPRelu(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname) {\n\tfloat *gamma = (float*)weightMap[lname + \"_gamma\"].values;\n\tint len = weightMap[lname + \"_gamma\"].count;\n\n\tfloat *scval_1 = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n\tfloat *scval_2 = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n\tfor (int i = 0; i < len; i++) {\n\t\tscval_1[i] = -1.0;\n\t\tscval_2[i] = -gamma[i];\n\t}\n\tWeights scale_1{ DataType::kFLOAT, scval_1, len };\n\tWeights scale_2{ DataType::kFLOAT, scval_2, len };\n\n\tfloat *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n\tfor (int i = 0; i < len; i++) {\n\t\tshval[i] = 0.0;\n\t}\n\tWeights shift{ DataType::kFLOAT, shval, len };\n\n\tfloat *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n\tfor (int i = 0; i < len; i++) {\n\t\tpval[i] = 1.0;\n\t}\n\tWeights power{ DataType::kFLOAT, pval, len };\n\n\tauto relu1 = network->addActivation(input, ActivationType::kRELU);\n\tassert(relu1);\n\tIScaleLayer* scale1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale_1, power);\n\tassert(scale1);\n\tauto relu2 = network->addActivation(*scale1->getOutput(0), ActivationType::kRELU);\n\tassert(relu2);\n\tIScaleLayer* scale2 = network->addScale(*relu2->getOutput(0), ScaleMode::kCHANNEL, shift, scale_2, power);\n\tassert(scale2);\n\tIElementWiseLayer* ew1 = network->addElementWise(*relu1->getOutput(0), *scale2->getOutput(0), ElementWiseOperation::kSUM);\n\tassert(ew1);\n\treturn ew1;\n}\n\nILayer* resUnit(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int num_filters, int s, bool dim_match, std::string lname) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    auto bn1 = addBatchNorm2d(network, weightMap, input, lname + \"_bn1\", 2e-5);\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*bn1->getOutput(0), num_filters, DimsHW{3, 3}, weightMap[lname + \"_conv1_weight\"], emptywts);\n    assert(conv1);\n    conv1->setPaddingNd(DimsHW{1, 1});\n    auto bn2 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"_bn2\", 2e-5);\n    auto act1 = addPRelu(network, weightMap, *bn2->getOutput(0), lname + \"_relu1\");\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*act1->getOutput(0), num_filters, DimsHW{3, 3}, weightMap[lname + \"_conv2_weight\"], emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{s, s});\n    conv2->setPaddingNd(DimsHW{1, 1});\n    auto bn3 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"_bn3\", 2e-5);\n\n    IElementWiseLayer* ew1;\n    if (dim_match) {\n        ew1 = network->addElementWise(input, *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    } else {\n        IConvolutionLayer* conv1sc = network->addConvolutionNd(input, num_filters, DimsHW{1, 1}, weightMap[lname + \"_conv1sc_weight\"], emptywts);\n        assert(conv1sc);\n        conv1sc->setStrideNd(DimsHW{s, s});\n        auto bn1sc = addBatchNorm2d(network, weightMap, *conv1sc->getOutput(0), lname + \"_sc\", 2e-5);\n        ew1 = network->addElementWise(*bn1sc->getOutput(0), *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    assert(ew1);\n    return ew1;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../arcface-r50.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv0 = network->addConvolutionNd(*data, 64, DimsHW{3, 3}, weightMap[\"conv0_weight\"], emptywts);\n    assert(conv0);\n    conv0->setPaddingNd(DimsHW{1, 1});\n    auto bn0 = addBatchNorm2d(network, weightMap, *conv0->getOutput(0), \"bn0\", 2e-5);\n    auto relu0 = addPRelu(network, weightMap, *bn0->getOutput(0), \"relu0\");\n\n    auto s1u1 = resUnit(network, weightMap, *relu0->getOutput(0), 64, 2, false, \"stage1_unit1\");\n    auto s1u2 = resUnit(network, weightMap, *s1u1->getOutput(0), 64, 1, true, \"stage1_unit2\");\n    auto s1u3 = resUnit(network, weightMap, *s1u2->getOutput(0), 64, 1, true, \"stage1_unit3\");\n\n    auto s2u1 = resUnit(network, weightMap, *s1u3->getOutput(0), 128, 2, false, \"stage2_unit1\");\n    auto s2u2 = resUnit(network, weightMap, *s2u1->getOutput(0), 128, 1, true, \"stage2_unit2\");\n    auto s2u3 = resUnit(network, weightMap, *s2u2->getOutput(0), 128, 1, true, \"stage2_unit3\");\n    auto s2u4 = resUnit(network, weightMap, *s2u3->getOutput(0), 128, 1, true, \"stage2_unit4\");\n\n    auto s3u1 = resUnit(network, weightMap, *s2u4->getOutput(0), 256, 2, false, \"stage3_unit1\");\n    auto s3u2 = resUnit(network, weightMap, *s3u1->getOutput(0), 256, 1, true, \"stage3_unit2\");\n    auto s3u3 = resUnit(network, weightMap, *s3u2->getOutput(0), 256, 1, true, \"stage3_unit3\");\n    auto s3u4 = resUnit(network, weightMap, *s3u3->getOutput(0), 256, 1, true, \"stage3_unit4\");\n    auto s3u5 = resUnit(network, weightMap, *s3u4->getOutput(0), 256, 1, true, \"stage3_unit5\");\n    auto s3u6 = resUnit(network, weightMap, *s3u5->getOutput(0), 256, 1, true, \"stage3_unit6\");\n    auto s3u7 = resUnit(network, weightMap, *s3u6->getOutput(0), 256, 1, true, \"stage3_unit7\");\n    auto s3u8 = resUnit(network, weightMap, *s3u7->getOutput(0), 256, 1, true, \"stage3_unit8\");\n    auto s3u9 = resUnit(network, weightMap, *s3u8->getOutput(0), 256, 1, true, \"stage3_unit9\");\n    auto s3u10 = resUnit(network, weightMap, *s3u9->getOutput(0), 256, 1, true, \"stage3_unit10\");\n    auto s3u11 = resUnit(network, weightMap, *s3u10->getOutput(0), 256, 1, true, \"stage3_unit11\");\n    auto s3u12 = resUnit(network, weightMap, *s3u11->getOutput(0), 256, 1, true, \"stage3_unit12\");\n    auto s3u13 = resUnit(network, weightMap, *s3u12->getOutput(0), 256, 1, true, \"stage3_unit13\");\n    auto s3u14 = resUnit(network, weightMap, *s3u13->getOutput(0), 256, 1, true, \"stage3_unit14\");\n\n    auto s4u1 = resUnit(network, weightMap, *s3u14->getOutput(0), 512, 2, false, \"stage4_unit1\");\n    auto s4u2 = resUnit(network, weightMap, *s4u1->getOutput(0), 512, 1, true, \"stage4_unit2\");\n    auto s4u3 = resUnit(network, weightMap, *s4u2->getOutput(0), 512, 1, true, \"stage4_unit3\");\n\n    auto bn1 = addBatchNorm2d(network, weightMap, *s4u3->getOutput(0), \"bn1\", 2e-5);\n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*bn1->getOutput(0), 512, weightMap[\"pre_fc1_weight\"], weightMap[\"pre_fc1_bias\"]);\n    assert(fc1);\n    auto bn2 = addBatchNorm2d(network, weightMap, *fc1->getOutput(0), \"fc1\", 2e-5);\n\n    bn2->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*bn2->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#ifdef USE_FP16\n    config->setFlag(BuilderFlag::kFP16);\n#endif\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n                strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (argc == 2 && std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(BATCH_SIZE, &modelStream);\n        assert(modelStream != nullptr);\n        std::ofstream p(\"arcface-r50.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    } else if (argc == 2 && std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"arcface-r50.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./arcface-r50 -s  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./arcface-r50 -d  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // prepare input data ---------------------------\n    static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];\n    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n    //    data[i] = 1.0;\n    static float prob[BATCH_SIZE * OUTPUT_SIZE];\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    cv::Mat img = cv::imread(\"../joey0.ppm\");\n    for (int i = 0; i < INPUT_H * INPUT_W; i++) {\n        data[i] = ((float)img.at<cv::Vec3b>(i)[2] - 127.5) * 0.0078125;\n        data[i + INPUT_H * INPUT_W] = ((float)img.at<cv::Vec3b>(i)[1] - 127.5) * 0.0078125;\n        data[i + 2 * INPUT_H * INPUT_W] = ((float)img.at<cv::Vec3b>(i)[0] - 127.5) * 0.0078125;\n    }\n\n    // Run inference\n    auto start = std::chrono::system_clock::now();\n    doInference(*context, data, prob, BATCH_SIZE);\n    auto end = std::chrono::system_clock::now();\n    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n    cv::Mat out(512, 1, CV_32FC1, prob);\n    cv::Mat out_norm;\n    cv::normalize(out, out_norm);\n\n    img = cv::imread(\"../joey1.ppm\");\n    for (int i = 0; i < INPUT_H * INPUT_W; i++) {\n        data[i] = ((float)img.at<cv::Vec3b>(i)[2] - 127.5) * 0.0078125;\n        data[i + INPUT_H * INPUT_W] = ((float)img.at<cv::Vec3b>(i)[1] - 127.5) * 0.0078125;\n        data[i + 2 * INPUT_H * INPUT_W] = ((float)img.at<cv::Vec3b>(i)[0] - 127.5) * 0.0078125;\n    }\n\n    // Run inference\n    start = std::chrono::system_clock::now();\n    doInference(*context, data, prob, BATCH_SIZE);\n    end = std::chrono::system_clock::now();\n    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n    cv::Mat out1(1, 512, CV_32FC1, prob);\n    cv::Mat out_norm1;\n    cv::normalize(out1, out_norm1);\n\n    cv::Mat res = out_norm1 * out_norm;\n\n    std::cout << \"similarity score: \" << *(float*)res.data << std::endl;\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    //Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\n    //{\n    //    std::cout << p_out_norm[i] << \", \";\n    //    if (i % 10 == 0) std::cout << i / 10 << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "arcface/gen_wts.py",
    "content": "import struct\nimport sys\nimport argparse\nimport face_model\nimport cv2\nimport numpy as np\n\nparser = argparse.ArgumentParser(description='face model test')\n# general\nparser.add_argument('--image-size', default='112,112', help='')\nparser.add_argument('--model', default='model-r100-ii/model,0', help='path to load model.')\nparser.add_argument('--ga-model', default='', help='path to load model.')\nparser.add_argument('--gpu', default=0, type=int, help='gpu id')\nparser.add_argument('--det', default=0, type=int, help='mtcnn option, 1 means using R+O, 0 means detect from begining')\nparser.add_argument('--flip', default=0, type=int, help='whether do lr flip aug')\nparser.add_argument('--threshold', default=1.24, type=float, help='ver dist threshold')\nargs = parser.parse_args()\n\nmodel = face_model.FaceModel(args)\n\nf = open('arcface-r100.wts', 'w')\nf.write('{}\\n'.format(len(model.model.get_params()[0].keys()) + len(model.model.get_params()[1].keys())))\nfor k, v in model.model.get_params()[0].items():\n    vr = v.reshape(-1).asnumpy()\n    f.write('{} {} '.format(k, len(vr)))\n    for vv in vr:\n        f.write(' ')\n        f.write(struct.pack('>f',float(vv)).hex())\n    f.write('\\n')\nfor k, v in model.model.get_params()[1].items():\n    vr = v.reshape(-1).asnumpy()\n    f.write('{} {} '.format(k, len(vr)))\n    for vv in vr:\n        f.write(' ')\n        f.write(struct.pack('>f',float(vv)).hex())\n    f.write('\\n')\n\n"
  },
  {
    "path": "arcface/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"macros.h\"\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#else\n#define TRT_NOEXCEPT\n#endif\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "arcface/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H"
  },
  {
    "path": "arcface/prelu.cu",
    "content": "#include <cmath>\n#include <stdio.h>\n#include <cassert>\n#include <iostream>\n#include \"prelu.h\"\n\nnamespace nvinfer1\n{\n    PReluPlugin::PReluPlugin(const std::vector<float>& gamma) : gamma_(gamma)\n    {\n    }\n\n    PReluPlugin::~PReluPlugin()\n    {\n    }\n\n    // create the plugin at runtime from a byte stream\n    PReluPlugin::PReluPlugin(const void* data, size_t length)\n    {\n        char *p = (char*)data;\n        input_size_ = reinterpret_cast<const int*>(p)[0];\n        p += sizeof(int);\n        gamma_.assign((float*)p, (float*)p + (length - sizeof(int)) / sizeof(float));\n    }\n\n    void PReluPlugin::serialize(void* buffer) const TRT_NOEXCEPT \n    {\n        *reinterpret_cast<int*>(buffer) = input_size_;\n        char *p = reinterpret_cast<char*>(buffer);\n        p += sizeof(int);\n        memcpy(p, gamma_.data(), gamma_.size() * sizeof(float));\n    }\n\n    size_t PReluPlugin::getSerializationSize() const TRT_NOEXCEPT\n    {  \n        return sizeof(input_size_) + gamma_.size() * sizeof(float);\n    }\n\n    int PReluPlugin::initialize() TRT_NOEXCEPT\n    { \n        return 0;\n    }\n\n    Dims PReluPlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT\n    {\n        assert(nbInputDims == 1);\n        assert(index == 0);\n        input_size_ = inputs[0].d[0] * inputs[0].d[1] * inputs[0].d[2];\n        // Output dimensions\n        return Dims3(inputs[0].d[0], inputs[0].d[1], inputs[0].d[2]);\n    }\n\n    // Set plugin namespace\n    void PReluPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT\n    {\n        mPluginNamespace = pluginNamespace;\n    }\n\n    const char* PReluPlugin::getPluginNamespace() const TRT_NOEXCEPT\n    {\n        return mPluginNamespace;\n    }\n\n    // Return the DataType of the plugin output at the requested index\n    DataType PReluPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT\n    {\n        return DataType::kFLOAT;\n    }\n\n    // Return true if output tensor is broadcast across a batch.\n    bool PReluPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    // Return true if plugin can use input that is broadcast across batch without replication.\n    bool PReluPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    void PReluPlugin::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT\n    {\n    }\n\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\n    void PReluPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT\n    {\n    }\n\n    // Detach the plugin object from its execution context.\n    void PReluPlugin::detachFromContext() TRT_NOEXCEPT {}\n\n    const char* PReluPlugin::getPluginType() const TRT_NOEXCEPT\n    {\n        return \"PRelu_TRT\";\n    }\n\n    const char* PReluPlugin::getPluginVersion() const TRT_NOEXCEPT\n    {\n        return \"1\";\n    }\n\n    void PReluPlugin::destroy() TRT_NOEXCEPT\n    {\n        delete this;\n    }\n\n    // Clone the plugin\n    IPluginV2IOExt* PReluPlugin::clone() const TRT_NOEXCEPT\n    {\n        PReluPlugin *p = new PReluPlugin(gamma_);\n        p->input_size_ = input_size_;\n        p->setPluginNamespace(mPluginNamespace);\n        return p;\n    }\n\n    __global__ void prelu_kernel(const float *input, float *output, int num_elem, int input_size, int fm_size, const float* gamma) {\n\n        int idx = threadIdx.x + blockDim.x * blockIdx.x;\n        if (idx >= num_elem) return;\n\n        if (input[idx] >= 0.0f) {\n            output[idx] = input[idx];\n            return;\n        }\n        int c = (idx % input_size) / fm_size;\n        output[idx] = input[idx] * gamma[c];\n    }\n\n    void PReluPlugin::forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize) {\n        int block_size = thread_count_;\n        int grid_size = (input_size_ * batchSize + block_size - 1) / block_size;\n        void *dev_gamma;\n        assert(cudaMalloc(&dev_gamma, sizeof(float) * gamma_.size()) == cudaSuccess);\n        assert(cudaMemcpy(dev_gamma, gamma_.data(), sizeof(float) * gamma_.size(), cudaMemcpyHostToDevice)  == cudaSuccess);\n        prelu_kernel<<<grid_size, block_size>>>(inputs[0], output, input_size_ * batchSize, input_size_, input_size_ / gamma_.size(), (const float*)dev_gamma);\n        assert(cudaFree(dev_gamma) == cudaSuccess);\n    }\n\n    int PReluPlugin::enqueue(int batchSize, const void*const * inputs, void* TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT\n    {\n        //assert(batchSize == 1);\n        //GPU\n        //CUDA_CHECK(cudaStreamSynchronize(stream));\n        forwardGpu((const float *const *)inputs, (float*)outputs[0], stream, batchSize);\n        return 0;\n    }\n\n    PluginFieldCollection PReluPluginCreator::mFC{};\n    std::vector<PluginField> PReluPluginCreator::mPluginAttributes;\n\n    PReluPluginCreator::PReluPluginCreator()\n    {\n        mPluginAttributes.emplace_back(PluginField(\"gamma\", nullptr, PluginFieldType::kFLOAT32, 1));\n\n        mFC.nbFields = mPluginAttributes.size();\n        mFC.fields = mPluginAttributes.data();\n    }\n\n    const char* PReluPluginCreator::getPluginName() const TRT_NOEXCEPT\n    {\n            return \"PRelu_TRT\";\n    }\n\n    const char* PReluPluginCreator::getPluginVersion() const TRT_NOEXCEPT\n    {\n            return \"1\";\n    }\n\n    const PluginFieldCollection* PReluPluginCreator::getFieldNames() TRT_NOEXCEPT\n    {\n            return &mFC;\n    }\n\n    IPluginV2IOExt* PReluPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT\n    {\n        std::vector<float> gamma;\n        const PluginField* fields = fc->fields;\n        for (int i = 0; i < fc->nbFields; ++i) {\n            const char* attrName = fields[i].name;\n            if (!strcmp(attrName, \"gamma\")) {\n                assert(fields[i].type == PluginFieldType::kFLOAT32);\n                int size = fields[i].length;\n                gamma.reserve(size);\n                const auto* w = static_cast<const float*>(fields[i].data);\n                for (int j = 0; j < size; j++)\n                {\n                    gamma.push_back(*w);\n                    w++;\n                }\n            }\n        }\n\n        PReluPlugin* obj = new PReluPlugin(gamma);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n    IPluginV2IOExt* PReluPluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT\n    {\n        // This object will be deleted when the network is destroyed, which will\n        // call PReluPlugin::destroy()\n        PReluPlugin* obj = new PReluPlugin(serialData, serialLength);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n}\n\n"
  },
  {
    "path": "arcface/prelu.h",
    "content": "#ifndef _PRELU_PLUGIN_H\n#define _PRELU_PLUGIN_H\n\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n#include \"macros.h\"\n\nnamespace nvinfer1\n{\n    class PReluPlugin: public IPluginV2IOExt\n    {\n        public:\n            PReluPlugin(const std::vector<float>& gamma);\n            PReluPlugin(const void* data, size_t length);\n\n            ~PReluPlugin();\n\n            int getNbOutputs() const TRT_NOEXCEPT override\n            {\n                return 1;\n            }\n\n            Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n            int initialize() TRT_NOEXCEPT override;\n\n            virtual void terminate() TRT_NOEXCEPT override {};\n\n            virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0;}\n\n            virtual int enqueue(int batchSize, const void*const * inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT override;\n\n            virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n            virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n            bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const TRT_NOEXCEPT override {\n                return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n            }\n\n            const char* getPluginType() const TRT_NOEXCEPT override;\n\n            const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n            void destroy() TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n            void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n            const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n            DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override;\n\n            bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT override;\n\n            bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n            void attachToContext(\n                    cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n            void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT override;\n\n            void detachFromContext() TRT_NOEXCEPT override;\n\n            int input_size_;\n        private:\n            void forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize = 1);\n            int thread_count_ = 256;\n            std::vector<float> gamma_;\n            const char* mPluginNamespace;\n    };\n\n    class PReluPluginCreator : public IPluginCreator\n    {\n        public:\n            PReluPluginCreator();\n\n            ~PReluPluginCreator() override = default;\n\n            const char* getPluginName() const TRT_NOEXCEPT override;\n\n            const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n            const PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT override;\n\n            void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override\n            {\n                mNamespace = libNamespace;\n            }\n\n            const char* getPluginNamespace() const TRT_NOEXCEPT override\n            {\n                return mNamespace.c_str();\n            }\n\n        private:\n            std::string mNamespace;\n            static PluginFieldCollection mFC;\n            static std::vector<PluginField> mPluginAttributes;\n    };\n};\n#endif \n"
  },
  {
    "path": "centernet/README.md",
    "content": "# CenterNet\n\nThis is the trt implementation of detection model [ctdet_coco_dla_2x](https://drive.google.com/open?id=1pl_-ael8wERdUREEnaIfqOV_VF2bEVRT) from [xingyizhou/CenterNet](https://github.com/xingyizhou/CenterNet) official work. \n\n## How to Run\n\n1. Follow [NVIDIA/TensorRT](https://github.com/NVIDIA/TensorRT) tutorial to build TensorRT7\n\n2. Copy folder `dcnv2Plugin` to `TensorRT/plugin` and edit `InferPlugin.cpp` and `CMakeLists.txt`\n\n3. Rebuild to install custom plugin\n\n4. Use `tensorrt-7.2.3.4-cp36-none-linux_x86_64.whl` in TensorRT OSS to update your python-tensorrt\n\n5. Run `python centernet.py -m ${PTH_PATH} -s` to create trt engine \n\n## Sample\n\n```\n// Download ctdet_coco_dla_2x.pth and transfer it into trt engine first\n// Download the test img from https://raw.githubusercontent.com/tensorflow/models/master/research/deeplab/g3doc/img/image2.jpg or choose your own one\ncd sample\npython test.py ${ENGINE_PATH} ${IMG_PATH}\n```\n![trt_out](https://user-images.githubusercontent.com/47047345/119128637-7a878900-ba68-11eb-91ff-5dcc10f01b77.jpg)\n\n## TODO\n\nIntegrate the post process with trt engine to make it more easier to use."
  },
  {
    "path": "centernet/centernet.py",
    "content": "import numpy as np\n\nimport tensorrt as trt\nimport torch\n\nfrom sample import common\nimport argparse\nimport time\n\n# You can set the logger severity higher to suppress messages (or lower to display more messages).\nTRT_LOGGER = trt.Logger(trt.Logger.WARNING)\n\ntrt.init_libnvinfer_plugins(TRT_LOGGER, '')\nPLUGIN_CREATORS = trt.get_plugin_registry().plugin_creator_list\n\nfor plugin_creator in PLUGIN_CREATORS:\n    if plugin_creator.name == 'DCNv2_TRT':\n        dcnCreator = plugin_creator\n\n\nclass ModelData(object):\n    INPUT_NAME = \"data\"\n    INPUT_SHAPE = (3, 512, 512)\n    OUTPUT_NAME = \"prob\"\n    DTYPE = trt.float16\n\n\nclass Centernet_dla34(object):\n    def __init__(self, weights) -> None:\n        super().__init__()\n        self.weights = weights\n        self.levels = [1, 1, 1, 2, 2, 1]\n        self.channels = [16, 32, 64, 128, 256, 512]\n        self.down_ratio = 4\n        self.last_level = 5\n        self.engine = self.build_engine()\n\n    def add_batchnorm_2d(self, input_tensor, parent):\n        gamma = self.weights[parent + '.weight'].numpy()\n        beta = self.weights[parent + '.bias'].numpy()\n        mean = self.weights[parent + '.running_mean'].numpy()\n        var = self.weights[parent + '.running_var'].numpy()\n        eps = 1e-5\n\n        scale = gamma / np.sqrt(var + eps)\n        shift = beta - mean * gamma / np.sqrt(var + eps)\n        power = np.ones_like(scale)\n\n        return self.network.add_scale(input=input_tensor.get_output(0), mode=trt.ScaleMode.CHANNEL, shift=shift, scale=scale, power=power)\n\n    def add_basic_block(self, input_tensor, out_channels, residual=None, stride=1, dilation=1, parent=''):\n        conv1_w = self.weights[parent + '.conv1.weight'].numpy()\n        conv1 = self.network.add_convolution(input=input_tensor.get_output(\n            0), num_output_maps=out_channels, kernel_shape=(3, 3), kernel=conv1_w)\n        conv1.stride = (stride, stride)\n        conv1.padding = (dilation, dilation)\n        conv1.dilation = (dilation, dilation)\n\n        bn1 = self.add_batchnorm_2d(conv1, parent + '.bn1')\n        ac1 = self.network.add_activation(\n            input=bn1.get_output(0), type=trt.ActivationType.RELU)\n\n        conv2_w = self.weights[parent + '.conv2.weight'].numpy()\n        conv2 = self.network.add_convolution(input=ac1.get_output(\n            0), num_output_maps=out_channels, kernel_shape=(3, 3), kernel=conv2_w)\n        conv2.padding = (dilation, dilation)\n        conv2.dilation = (dilation, dilation)\n\n        out = self.add_batchnorm_2d(conv2, parent + '.bn2')\n\n        if residual is None:\n            out = self.network.add_elementwise(input_tensor.get_output(\n                0), out.get_output(0), trt.ElementWiseOperation.SUM)\n        else:\n            out = self.network.add_elementwise(residual.get_output(\n                0), out.get_output(0), trt.ElementWiseOperation.SUM)\n        return self.network.add_activation(input=out.get_output(0), type=trt.ActivationType.RELU)\n\n    def add_level(self, input_tensor, out_channels, stride=1, dilation=1, parent=''):\n        conv1_w = self.weights[parent + '.0.weight'].numpy()\n        conv1 = self.network.add_convolution(input=input_tensor.get_output(\n            0), num_output_maps=out_channels, kernel_shape=(3, 3), kernel=conv1_w)\n        conv1.stride = (stride, stride)\n        conv1.padding = (dilation, dilation)\n        conv1.dilation = (dilation, dilation)\n\n        bn1 = self.add_batchnorm_2d(conv1, parent + '.1')\n        ac1 = self.network.add_activation(\n            input=bn1.get_output(0), type=trt.ActivationType.RELU)\n        return ac1\n\n    def add_root(self, input_tensors: list, out_channels, kernel_size=1, residual=False, parent=''):\n        ct = self.network.add_concatenation(\n            [x.get_output(0) for x in input_tensors])\n\n        conv_w = self.weights[parent + '.conv.weight'].numpy()\n        conv = self.network.add_convolution(input=ct.get_output(\n            0), num_output_maps=out_channels, kernel_shape=(1, 1), kernel=conv_w)\n        conv.padding = ((kernel_size - 1) // 2, (kernel_size - 1) // 2)\n\n        bn1 = self.add_batchnorm_2d(conv, parent + '.bn')\n        out = self.network.add_activation(\n            input=bn1.get_output(0), type=trt.ActivationType.RELU)\n\n        if residual:\n            out = self.network.add_elementwise(input_tensors[0].get_output(\n                0), out.get_output(0), trt.ElementWiseOperation.SUM)\n\n        return self.network.add_activation(input=out.get_output(0), type=trt.ActivationType.RELU)\n\n    def add_tree(self, input_tensor, level, out_channels, residual=None, children=None, stride=1, level_root=False, parent=''):\n        children = [] if children is None else children\n        if stride > 1:\n            bottom = self.network.add_pooling(input_tensor.get_output(\n                0), trt.PoolingType.MAX, (stride, stride))\n            bottom.stride = (stride, stride)\n        else:\n            bottom = input_tensor\n\n        if input_tensor.get_output(0).shape[0] != out_channels:\n            project_conv1_w = self.weights[parent +\n                                           '.project.0.weight'].numpy()\n            project_conv1 = self.network.add_convolution(input=bottom.get_output(\n                0), num_output_maps=out_channels, kernel_shape=(1, 1), kernel=project_conv1_w)\n            residual = self.add_batchnorm_2d(\n                project_conv1, parent + '.project.1')\n        else:\n            residual = bottom\n\n        if level_root:\n            children.append(bottom)\n\n        if level == 1:\n            tree1 = self.add_basic_block(\n                input_tensor, out_channels, residual, stride, parent=parent+'.tree1')\n            tree2 = self.add_basic_block(\n                tree1, out_channels, parent=parent+'.tree2')\n            return self.add_root([tree2, tree1]+children, out_channels, parent=parent+'.root')\n        else:\n            tree1 = self.add_tree(input_tensor, level-1, out_channels,\n                                  residual, stride=stride, parent=parent+'.tree1')\n            children.append(tree1)\n            return self.add_tree(tree1, level-1, out_channels, children=children, parent=parent+'.tree2')\n\n    def add_base(self, input_tensor, parent):\n        base_conv1_w = self.weights[parent+'.base_layer.0.weight'].numpy()\n        base_conv1 = self.network.add_convolution(\n            input=input_tensor, num_output_maps=self.channels[0], kernel_shape=(7, 7), kernel=base_conv1_w)\n        base_conv1.padding = (3, 3)\n\n        base_bn1 = self.add_batchnorm_2d(base_conv1, parent+'.base_layer.1')\n        base_ac1 = self.network.add_activation(\n            input=base_bn1.get_output(0), type=trt.ActivationType.RELU)\n\n        level0 = self.add_level(\n            base_ac1, self.channels[0],    parent=parent+'.level0')\n        level1 = self.add_level(\n            level0,   self.channels[1], 2, parent=parent+'.level1')\n\n        level2 = self.add_tree(\n            level1, self.levels[2], self.channels[2], stride=2, level_root=False, parent=parent+'.level2')\n        level3 = self.add_tree(\n            level2, self.levels[3], self.channels[3], stride=2, level_root=True, parent=parent+'.level3')\n        level4 = self.add_tree(\n            level3, self.levels[4], self.channels[4], stride=2, level_root=True, parent=parent+'.level4')\n        level5 = self.add_tree(\n            level4, self.levels[5], self.channels[5], stride=2, level_root=True, parent=parent+'.level5')\n\n        return [level0, level1, level2, level3, level4, level5]\n\n    def add_deform_conv(self, input_tensor, out_channels, kernel=3, stride=1, padding=1, dilation=1, deformable_group=1, parent=''):\n        conv_offset_mask_w = self.weights[parent +\n                                          '.conv.conv_offset_mask.weight'].numpy()\n        conv_offset_mask_b = self.weights[parent +\n                                          '.conv.conv_offset_mask.bias'].numpy()\n        conv_offset_mask = self.network.add_convolution(input=input_tensor.get_output(0),\n                                                        num_output_maps=deformable_group*3*kernel*kernel,\n                                                        kernel_shape=(\n                                                            kernel, kernel),\n                                                        kernel=conv_offset_mask_w,\n                                                        bias=conv_offset_mask_b)\n        conv_offset_mask.stride = (stride, stride)\n        conv_offset_mask.padding = (padding, padding)\n\n        out_channels = trt.PluginField(\"out_channels\", np.array(\n            [out_channels], dtype=np.int32), trt.PluginFieldType.INT32)\n        kernel = trt.PluginField(\"kernel\", np.array(\n            [kernel], dtype=np.int32), trt.PluginFieldType.INT32)\n        deformable_group = trt.PluginField(\"deformable_group\", np.array(\n            [deformable_group], dtype=np.int32), trt.PluginFieldType.INT32)\n        dilation = trt.PluginField(\"dilation\", np.array(\n            [dilation], dtype=np.int32), trt.PluginFieldType.INT32)\n        padding = trt.PluginField(\"padding\", np.array(\n            [padding], dtype=np.int32), trt.PluginFieldType.INT32)\n        stride = trt.PluginField(\"stride\", np.array(\n            [stride], dtype=np.int32), trt.PluginFieldType.INT32)\n        weight = trt.PluginField(\n            \"weight\", self.weights[parent + '.conv.weight'].numpy(), trt.PluginFieldType.FLOAT32)\n        bias = trt.PluginField(\n            \"bias\", self.weights[parent + '.conv.bias'].numpy(), trt.PluginFieldType.FLOAT32)\n        field_collection = trt.PluginFieldCollection(\n            [out_channels, kernel, deformable_group, dilation, padding, stride, weight, bias])\n        DCN = dcnCreator.create_plugin(\n            name='DCNv2_TRT', field_collection=field_collection)\n\n        sigmoid_conv_offset_mask = self.network.add_activation(\n            input=conv_offset_mask.get_output(0), type=trt.ActivationType.SIGMOID)\n\n        dcn = self.network.add_plugin_v2(inputs=[input_tensor.get_output(\n            0), conv_offset_mask.get_output(0), sigmoid_conv_offset_mask.get_output(0)], plugin=DCN)\n        bn = self.add_batchnorm_2d(dcn, parent+'.actf.0')\n        return self.network.add_activation(input=bn.get_output(0), type=trt.ActivationType.RELU)\n\n    def add_ida_up(self, input_tensors, out_channels, up_f, startp, parent):\n        for i in range(startp + 1, len(input_tensors)):\n            proj = self.add_deform_conv(\n                input_tensors[i], out_channels, parent=parent+'.proj_%d' % (i-startp))\n            f = up_f[i-startp]\n            up_w = self.weights[parent + '.up_%d.weight' % (i-startp)].numpy()\n            up = self.network.add_deconvolution(\n                proj.get_output(0), out_channels, (f*2, f*2), up_w)\n            up.stride = (f, f)\n            up.padding = (f//2, f//2)\n            up.num_groups = out_channels\n            node = self.network.add_elementwise(\n                input_tensors[i-1].get_output(0), up.get_output(0), trt.ElementWiseOperation.SUM)\n            input_tensors[i] = self.add_deform_conv(\n                node, out_channels, parent=parent+'.node_%d' % (i-startp))\n        return input_tensors\n\n    def add_dla_up(self, input_tensors, first_level, parent):\n        channels = self.channels[first_level:]\n        scales = [2 ** i for i in range(len(self.channels[first_level:]))]\n        scales = np.array(scales, dtype=int)\n        out = [input_tensors[-1]]\n        for i in range(len(channels) - 1):\n            j = -i - 2\n            input_tensors = self.add_ida_up(\n                input_tensors, channels[j], scales[j:] // scales[j], len(input_tensors) - i - 2, parent+'.ida_%d' % i)\n            out.insert(0, input_tensors[-1])\n            scales[j + 1:] = scales[j]\n            channels[j + 1:] = [channels[j] for _ in channels[j + 1:]]\n        return out\n\n    def add_head(self, input_tensor, out_channels, head, head_conv=256, final_kernal=1):\n        conv1_w = self.weights[head+'.0.weight'].numpy()\n        conv1_b = self.weights[head+'.0.bias'].numpy()\n        conv1 = self.network.add_convolution(\n            input_tensor.get_output(0), head_conv, (3, 3), conv1_w, conv1_b)\n        conv1.padding = (1, 1)\n        ac1 = self.network.add_activation(\n            input=conv1.get_output(0), type=trt.ActivationType.RELU)\n        conv2_w = self.weights[head + '.2.weight'].numpy()\n        conv2_b = self.weights[head+'.2.bias'].numpy()\n        conv2 = self.network.add_convolution(ac1.get_output(\n            0), out_channels, (final_kernal, final_kernal), conv2_w, conv2_b)\n        return conv2\n\n    def populate_network(self):\n        # Configure the network layers based on the self.weights provided.\n        input_tensor = self.network.add_input(\n            name=ModelData.INPUT_NAME, dtype=ModelData.DTYPE, shape=ModelData.INPUT_SHAPE)\n\n        y = self.add_base(input_tensor, 'module.base')\n\n        first_level = int(np.log2(self.down_ratio))\n        last_level = self.last_level\n        dla_up = self.add_dla_up(y, first_level, 'module.dla_up')\n        ida_up = self.add_ida_up(dla_up[:last_level-first_level], self.channels[first_level], [\n                                 2 ** i for i in range(last_level - first_level)], 0, 'module.ida_up')\n\n        hm = self.add_head(ida_up[-1], 80, 'module.hm')\n        wh = self.add_head(ida_up[-1], 2, 'module.wh')\n        reg = self.add_head(ida_up[-1], 2, 'module.reg')\n\n        hm.get_output(0).name = 'hm'\n        wh.get_output(0).name = 'wh'\n        reg.get_output(0).name = 'reg'\n        self.network.mark_output(tensor=hm.get_output(0))\n        self.network.mark_output(tensor=wh.get_output(0))\n        self.network.mark_output(tensor=reg.get_output(0))\n\n    def build_engine(self):\n        # For more information on TRT basics, refer to the introductory samples.\n        with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:\n            self.network = network\n            builder.max_workspace_size = common.GiB(1)\n            builder.max_batch_size = 1\n            # Populate the network using self.weights from the PyTorch model.\n            self.populate_network()\n            # Build and return an engine.\n            return builder.build_cuda_engine(self.network)\n\n\ndef load_random_test_case(pagelocked_buffer):\n    # Select an image at random to be the test case.\n    img = np.random.randn(1, 3, 512, 512).astype(np.float32)\n    # Copy to the pagelocked input buffer\n    np.copyto(pagelocked_buffer, img.ravel())\n    return img\n\n\ndef main(args):\n    # Get the PyTorch weights\n    weights = torch.load(args.model, map_location={\n                         'cuda:0': 'cpu'})['state_dict']\n    # Do inference with TensorRT.\n    with Centernet_dla34(weights).engine as engine:\n        if args.save_engine:\n            with open('centernet.engine', \"wb\") as f:\n                f.write(engine.serialize())\n        inputs, outputs, bindings, stream = common.allocate_buffers(engine)\n        with engine.create_execution_context() as context:\n            img = load_random_test_case(pagelocked_buffer=inputs[0].host)\n            # For more information on performing inference, refer to the introductory samples.\n            # The common.do_inference function will return a list of outputs - we only have one in this case.\n            t = time.time()\n            [hm, wh, reg] = common.do_inference(\n                context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream, batch_size=1)\n            t = time.time() - t\n            print('output:   hm:%f, wh:%f, reg:%f' %\n                  (hm.mean(), wh.mean(), reg.mean()))\n            print(t)\n\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser(description='CenterNet dla34 ctdet')\n    parser.add_argument('--model',  '-m', type=str,\n                        default='./ctdet_coco_dla_2x.pth', help='path of pytorch .pth')\n    parser.add_argument('--save_engine', '-s',\n                        action='store_true', help='if save trt engine')\n    args = parser.parse_args()\n    main(args)\n"
  },
  {
    "path": "centernet/dcnv2Plugin/CMakeLists.txt",
    "content": "#\n# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n#\nfile(GLOB SRCS *.cpp)\nset(PLUGIN_SOURCES ${PLUGIN_SOURCES} ${SRCS})\nset(PLUGIN_SOURCES ${PLUGIN_SOURCES} PARENT_SCOPE)\nfile(GLOB CU_SRCS *.cu)\nset(PLUGIN_CU_SOURCES ${PLUGIN_CU_SOURCES} ${CU_SRCS})\nset(PLUGIN_CU_SOURCES ${PLUGIN_CU_SOURCES} PARENT_SCOPE)"
  },
  {
    "path": "centernet/dcnv2Plugin/dcn_v2_im2col_cuda.cu",
    "content": "#include \"dcn_v2_im2col_cuda.h\"\n#include <cstdio>\n#include <algorithm>\n#include <cstring>\n\n#define CUDA_KERNEL_LOOP(i, n)                          \\\n  for (int i = blockIdx.x * blockDim.x + threadIdx.x;   \\\n      i < (n);                                          \\\n      i += blockDim.x * gridDim.x)\n\nconst int CUDA_NUM_THREADS = 512;\n//inline int GET_BLOCKS(const int N)\n//{\n//  return (N + CUDA_NUM_THREADS - 1) / CUDA_NUM_THREADS;\n//}\ndim3 GET_BLOCKS(uint n)\n{\n    uint k = (n - 1) /CUDA_NUM_THREADS + 1;\n    uint x = k ;\n    uint y = 1 ;\n    if (x > 65535 )\n    {\n        x = ceil(sqrt(x));\n        y = (n - 1 )/(x*CUDA_NUM_THREADS) + 1;\n    }\n    dim3 d = {x,y,1} ;\n    return d;\n}\n\n__device__ float dmcn_im2col_bilinear(const float *bottom_data, const int data_width,\n                                      const int height, const int width, float h, float w)\n{\n  int h_low = floor(h);\n  int w_low = floor(w);\n  int h_high = h_low + 1;\n  int w_high = w_low + 1;\n\n  float lh = h - h_low;\n  float lw = w - w_low;\n  float hh = 1 - lh, hw = 1 - lw;\n\n  float v1 = 0;\n  if (h_low >= 0 && w_low >= 0)\n    v1 = bottom_data[h_low * data_width + w_low];\n  float v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1)\n    v2 = bottom_data[h_low * data_width + w_high];\n  float v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0)\n    v3 = bottom_data[h_high * data_width + w_low];\n  float v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1)\n    v4 = bottom_data[h_high * data_width + w_high];\n\n  float w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n\n  float val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  return val;\n}\n\n__device__ float dmcn_get_gradient_weight(float argmax_h, float argmax_w,\n                                          const int h, const int w, const int height, const int width)\n{\n  if (argmax_h <= -1 || argmax_h >= height || argmax_w <= -1 || argmax_w >= width)\n  {\n    //empty\n    return 0;\n  }\n\n  int argmax_h_low = floor(argmax_h);\n  int argmax_w_low = floor(argmax_w);\n  int argmax_h_high = argmax_h_low + 1;\n  int argmax_w_high = argmax_w_low + 1;\n\n  float weight = 0;\n  if (h == argmax_h_low && w == argmax_w_low)\n    weight = (h + 1 - argmax_h) * (w + 1 - argmax_w);\n  if (h == argmax_h_low && w == argmax_w_high)\n    weight = (h + 1 - argmax_h) * (argmax_w + 1 - w);\n  if (h == argmax_h_high && w == argmax_w_low)\n    weight = (argmax_h + 1 - h) * (w + 1 - argmax_w);\n  if (h == argmax_h_high && w == argmax_w_high)\n    weight = (argmax_h + 1 - h) * (argmax_w + 1 - w);\n  return weight;\n}\n\n__device__ float dmcn_get_coordinate_weight(float argmax_h, float argmax_w,\n                                            const int height, const int width, const float *im_data,\n                                            const int data_width, const int bp_dir)\n{\n  if (argmax_h <= -1 || argmax_h >= height || argmax_w <= -1 || argmax_w >= width)\n  {\n    //empty\n    return 0;\n  }\n\n  int argmax_h_low = floor(argmax_h);\n  int argmax_w_low = floor(argmax_w);\n  int argmax_h_high = argmax_h_low + 1;\n  int argmax_w_high = argmax_w_low + 1;\n\n  float weight = 0;\n\n  if (bp_dir == 0)\n  {\n    if (argmax_h_low >= 0 && argmax_w_low >= 0)\n      weight += -1 * (argmax_w_low + 1 - argmax_w) * im_data[argmax_h_low * data_width + argmax_w_low];\n    if (argmax_h_low >= 0 && argmax_w_high <= width - 1)\n      weight += -1 * (argmax_w - argmax_w_low) * im_data[argmax_h_low * data_width + argmax_w_high];\n    if (argmax_h_high <= height - 1 && argmax_w_low >= 0)\n      weight += (argmax_w_low + 1 - argmax_w) * im_data[argmax_h_high * data_width + argmax_w_low];\n    if (argmax_h_high <= height - 1 && argmax_w_high <= width - 1)\n      weight += (argmax_w - argmax_w_low) * im_data[argmax_h_high * data_width + argmax_w_high];\n  }\n  else if (bp_dir == 1)\n  {\n    if (argmax_h_low >= 0 && argmax_w_low >= 0)\n      weight += -1 * (argmax_h_low + 1 - argmax_h) * im_data[argmax_h_low * data_width + argmax_w_low];\n    if (argmax_h_low >= 0 && argmax_w_high <= width - 1)\n      weight += (argmax_h_low + 1 - argmax_h) * im_data[argmax_h_low * data_width + argmax_w_high];\n    if (argmax_h_high <= height - 1 && argmax_w_low >= 0)\n      weight += -1 * (argmax_h - argmax_h_low) * im_data[argmax_h_high * data_width + argmax_w_low];\n    if (argmax_h_high <= height - 1 && argmax_w_high <= width - 1)\n      weight += (argmax_h - argmax_h_low) * im_data[argmax_h_high * data_width + argmax_w_high];\n  }\n\n  return weight;\n}\n\n__global__ void modulated_deformable_im2col_gpu_kernel(const int n,\n                                                       const float *data_im, const float *data_offset, const float *data_mask,\n                                                       const int height, const int width, const int kernel_h, const int kernel_w,\n                                                       const int pad_h, const int pad_w,\n                                                       const int stride_h, const int stride_w,\n                                                       const int dilation_h, const int dilation_w,\n                                                       const int channel_per_deformable_group,\n                                                       const int batch_size, const int num_channels, const int deformable_group,\n                                                       const int height_col, const int width_col,\n                                                       float *data_col)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    // index index of output matrix\n    const int w_col = index % width_col;\n    const int h_col = (index / width_col) % height_col;\n    const int b_col = (index / width_col / height_col) % batch_size;\n    const int c_im = (index / width_col / height_col) / batch_size;\n    const int c_col = c_im * kernel_h * kernel_w;\n\n    // compute deformable group index\n    const int deformable_group_index = c_im / channel_per_deformable_group;\n\n    const int h_in = h_col * stride_h - pad_h;\n    const int w_in = w_col * stride_w - pad_w;\n\n    float *data_col_ptr = data_col + ((c_col * batch_size + b_col) * height_col + h_col) * width_col + w_col;\n    //const float* data_im_ptr = data_im + ((b_col * num_channels + c_im) * height + h_in) * width + w_in;\n    const float *data_im_ptr = data_im + (b_col * num_channels + c_im) * height * width;\n    const float *data_offset_ptr = data_offset + (b_col * deformable_group + deformable_group_index) * 2 * kernel_h * kernel_w * height_col * width_col;\n\n    const float *data_mask_ptr = data_mask + (b_col * deformable_group + deformable_group_index) * kernel_h * kernel_w * height_col * width_col;\n\n    for (int i = 0; i < kernel_h; ++i)\n    {\n      for (int j = 0; j < kernel_w; ++j)\n      {\n        const int data_offset_h_ptr = ((2 * (i * kernel_w + j)) * height_col + h_col) * width_col + w_col;\n        const int data_offset_w_ptr = ((2 * (i * kernel_w + j) + 1) * height_col + h_col) * width_col + w_col;\n        const int data_mask_hw_ptr = ((i * kernel_w + j) * height_col + h_col) * width_col + w_col;\n        const float offset_h = data_offset_ptr[data_offset_h_ptr];\n        const float offset_w = data_offset_ptr[data_offset_w_ptr];\n        const float mask = data_mask_ptr[data_mask_hw_ptr];\n        float val = static_cast<float>(0);\n        const float h_im = h_in + i * dilation_h + offset_h;\n        const float w_im = w_in + j * dilation_w + offset_w;\n        //if (h_im >= 0 && w_im >= 0 && h_im < height && w_im < width) {\n        if (h_im > -1 && w_im > -1 && h_im < height && w_im < width)\n        {\n          //const float map_h = i * dilation_h + offset_h;\n          //const float map_w = j * dilation_w + offset_w;\n          //const int cur_height = height - h_in;\n          //const int cur_width = width - w_in;\n          //val = dmcn_im2col_bilinear(data_im_ptr, width, cur_height, cur_width, map_h, map_w);\n          val = dmcn_im2col_bilinear(data_im_ptr, width, height, width, h_im, w_im);\n        }\n        *data_col_ptr = val * mask;\n        data_col_ptr += batch_size * height_col * width_col;\n        //data_col_ptr += height_col * width_col;\n      }\n    }\n  }\n}\n\n__global__ void modulated_deformable_col2im_gpu_kernel(const int n,\n                                                       const float *data_col, const float *data_offset, const float *data_mask,\n                                                       const int channels, const int height, const int width,\n                                                       const int kernel_h, const int kernel_w,\n                                                       const int pad_h, const int pad_w,\n                                                       const int stride_h, const int stride_w,\n                                                       const int dilation_h, const int dilation_w,\n                                                       const int channel_per_deformable_group,\n                                                       const int batch_size, const int deformable_group,\n                                                       const int height_col, const int width_col,\n                                                       float *grad_im)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    const int j = (index / width_col / height_col / batch_size) % kernel_w;\n    const int i = (index / width_col / height_col / batch_size / kernel_w) % kernel_h;\n    const int c = index / width_col / height_col / batch_size / kernel_w / kernel_h;\n    // compute the start and end of the output\n\n    const int deformable_group_index = c / channel_per_deformable_group;\n\n    int w_out = index % width_col;\n    int h_out = (index / width_col) % height_col;\n    int b = (index / width_col / height_col) % batch_size;\n    int w_in = w_out * stride_w - pad_w;\n    int h_in = h_out * stride_h - pad_h;\n\n    const float *data_offset_ptr = data_offset + (b * deformable_group + deformable_group_index) * 2 * kernel_h * kernel_w * height_col * width_col;\n    const float *data_mask_ptr = data_mask + (b * deformable_group + deformable_group_index) * kernel_h * kernel_w * height_col * width_col;\n    const int data_offset_h_ptr = ((2 * (i * kernel_w + j)) * height_col + h_out) * width_col + w_out;\n    const int data_offset_w_ptr = ((2 * (i * kernel_w + j) + 1) * height_col + h_out) * width_col + w_out;\n    const int data_mask_hw_ptr = ((i * kernel_w + j) * height_col + h_out) * width_col + w_out;\n    const float offset_h = data_offset_ptr[data_offset_h_ptr];\n    const float offset_w = data_offset_ptr[data_offset_w_ptr];\n    const float mask = data_mask_ptr[data_mask_hw_ptr];\n    const float cur_inv_h_data = h_in + i * dilation_h + offset_h;\n    const float cur_inv_w_data = w_in + j * dilation_w + offset_w;\n\n    const float cur_top_grad = data_col[index] * mask;\n    const int cur_h = (int)cur_inv_h_data;\n    const int cur_w = (int)cur_inv_w_data;\n    for (int dy = -2; dy <= 2; dy++)\n    {\n      for (int dx = -2; dx <= 2; dx++)\n      {\n        if (cur_h + dy >= 0 && cur_h + dy < height &&\n            cur_w + dx >= 0 && cur_w + dx < width &&\n            abs(cur_inv_h_data - (cur_h + dy)) < 1 &&\n            abs(cur_inv_w_data - (cur_w + dx)) < 1)\n        {\n          int cur_bottom_grad_pos = ((b * channels + c) * height + cur_h + dy) * width + cur_w + dx;\n          float weight = dmcn_get_gradient_weight(cur_inv_h_data, cur_inv_w_data, cur_h + dy, cur_w + dx, height, width);\n          atomicAdd(grad_im + cur_bottom_grad_pos, weight * cur_top_grad);\n        }\n      }\n    }\n  }\n}\n\n__global__ void modulated_deformable_col2im_coord_gpu_kernel(const int n,\n                                                             const float *data_col, const float *data_im,\n                                                             const float *data_offset, const float *data_mask,\n                                                             const int channels, const int height, const int width,\n                                                             const int kernel_h, const int kernel_w,\n                                                             const int pad_h, const int pad_w,\n                                                             const int stride_h, const int stride_w,\n                                                             const int dilation_h, const int dilation_w,\n                                                             const int channel_per_deformable_group,\n                                                             const int batch_size, const int offset_channels, const int deformable_group,\n                                                             const int height_col, const int width_col,\n                                                             float *grad_offset, float *grad_mask)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    float val = 0, mval = 0;\n    int w = index % width_col;\n    int h = (index / width_col) % height_col;\n    int c = (index / width_col / height_col) % offset_channels;\n    int b = (index / width_col / height_col) / offset_channels;\n    // compute the start and end of the output\n\n    const int deformable_group_index = c / (2 * kernel_h * kernel_w);\n    const int col_step = kernel_h * kernel_w;\n    int cnt = 0;\n    const float *data_col_ptr = data_col + deformable_group_index * channel_per_deformable_group * batch_size * width_col * height_col;\n    const float *data_im_ptr = data_im + (b * deformable_group + deformable_group_index) * channel_per_deformable_group / kernel_h / kernel_w * height * width;\n    const float *data_offset_ptr = data_offset + (b * deformable_group + deformable_group_index) * 2 * kernel_h * kernel_w * height_col * width_col;\n    const float *data_mask_ptr = data_mask + (b * deformable_group + deformable_group_index) * kernel_h * kernel_w * height_col * width_col;\n\n    const int offset_c = c - deformable_group_index * 2 * kernel_h * kernel_w;\n\n    for (int col_c = (offset_c / 2); col_c < channel_per_deformable_group; col_c += col_step)\n    {\n      const int col_pos = (((col_c * batch_size + b) * height_col) + h) * width_col + w;\n      const int bp_dir = offset_c % 2;\n\n      int j = (col_pos / width_col / height_col / batch_size) % kernel_w;\n      int i = (col_pos / width_col / height_col / batch_size / kernel_w) % kernel_h;\n      int w_out = col_pos % width_col;\n      int h_out = (col_pos / width_col) % height_col;\n      int w_in = w_out * stride_w - pad_w;\n      int h_in = h_out * stride_h - pad_h;\n      const int data_offset_h_ptr = (((2 * (i * kernel_w + j)) * height_col + h_out) * width_col + w_out);\n      const int data_offset_w_ptr = (((2 * (i * kernel_w + j) + 1) * height_col + h_out) * width_col + w_out);\n      const int data_mask_hw_ptr = (((i * kernel_w + j) * height_col + h_out) * width_col + w_out);\n      const float offset_h = data_offset_ptr[data_offset_h_ptr];\n      const float offset_w = data_offset_ptr[data_offset_w_ptr];\n      const float mask = data_mask_ptr[data_mask_hw_ptr];\n      float inv_h = h_in + i * dilation_h + offset_h;\n      float inv_w = w_in + j * dilation_w + offset_w;\n      if (inv_h <= -1 || inv_w <= -1 || inv_h >= height || inv_w >= width)\n      {\n        inv_h = inv_w = -2;\n      }\n      else\n      {\n        mval += data_col_ptr[col_pos] * dmcn_im2col_bilinear(data_im_ptr + cnt * height * width, width, height, width, inv_h, inv_w);\n      }\n      const float weight = dmcn_get_coordinate_weight(\n          inv_h, inv_w,\n          height, width, data_im_ptr + cnt * height * width, width, bp_dir);\n      val += weight * data_col_ptr[col_pos] * mask;\n      cnt += 1;\n    }\n    // KERNEL_ASSIGN(grad_offset[index], offset_req, val);\n    grad_offset[index] = val;\n    if (offset_c % 2 == 0)\n      // KERNEL_ASSIGN(grad_mask[(((b * deformable_group + deformable_group_index) * kernel_h * kernel_w + offset_c / 2) * height_col + h) * width_col + w], mask_req, mval);\n      grad_mask[(((b * deformable_group + deformable_group_index) * kernel_h * kernel_w + offset_c / 2) * height_col + h) * width_col + w] = mval;\n  }\n}\n\nvoid modulated_deformable_im2col_cuda(cudaStream_t stream,\n  const float* data_im, const float* data_offset, const float* data_mask,\n  const int batch_size, const int channels, const int height_im, const int width_im, \n  const int height_col, const int width_col, const int kernel_h, const int kenerl_w,\n  const int pad_h, const int pad_w, const int stride_h, const int stride_w, \n  const int dilation_h, const int dilation_w,\n  const int deformable_group, float* data_col) {\n  // num_axes should be smaller than block size\n  const int channel_per_deformable_group = channels / deformable_group;\n  const int num_kernels = channels * batch_size * height_col * width_col;\n  modulated_deformable_im2col_gpu_kernel\n      <<<GET_BLOCKS(num_kernels), CUDA_NUM_THREADS,\n          0, stream>>>(\n      num_kernels, data_im, data_offset, data_mask, height_im, width_im, kernel_h, kenerl_w,\n      pad_h, pad_w, stride_h, stride_w, dilation_h, dilation_w, channel_per_deformable_group,\n      batch_size, channels, deformable_group, height_col, width_col, data_col);\n  \n  cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in modulated_deformable_im2col_cuda: %s\\n\", cudaGetErrorString(err));\n  }\n\n}\n\nvoid modulated_deformable_col2im_cuda(cudaStream_t stream,\n  const float* data_col, const float* data_offset, const float* data_mask,\n  const int batch_size, const int channels, const int height_im, const int width_im, \n  const int height_col, const int width_col, const int kernel_h, const int kernel_w,\n  const int pad_h, const int pad_w, const int stride_h, const int stride_w, \n  const int dilation_h, const int dilation_w, \n  const int deformable_group, float* grad_im){\n\n  const int channel_per_deformable_group = channels / deformable_group;\n  const int num_kernels = channels * kernel_h * kernel_w * batch_size * height_col * width_col;\n  modulated_deformable_col2im_gpu_kernel\n      <<<GET_BLOCKS(num_kernels), CUDA_NUM_THREADS,\n          0, stream>>>(\n        num_kernels, data_col, data_offset, data_mask, channels, height_im, width_im,\n        kernel_h, kernel_w, pad_h, pad_h, stride_h, stride_w,\n        dilation_h, dilation_w, channel_per_deformable_group,\n        batch_size, deformable_group, height_col, width_col, grad_im);\n  cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in modulated_deformable_col2im_cuda: %s\\n\", cudaGetErrorString(err));\n  }\n\n}\n\nvoid modulated_deformable_col2im_coord_cuda(cudaStream_t stream,\n  const float* data_col, const float* data_im, const float* data_offset, const float* data_mask,\n  const int batch_size, const int channels, const int height_im, const int width_im, \n  const int height_col, const int width_col, const int kernel_h, const int kernel_w,\n  const int pad_h, const int pad_w, const int stride_h, const int stride_w, \n  const int dilation_h, const int dilation_w, \n  const int deformable_group,\n  float* grad_offset, float* grad_mask) {\n  const int num_kernels = batch_size * height_col * width_col * 2 * kernel_h * kernel_w * deformable_group;\n  const int channel_per_deformable_group = channels * kernel_h * kernel_w / deformable_group;\n  modulated_deformable_col2im_coord_gpu_kernel\n      <<<GET_BLOCKS(num_kernels), CUDA_NUM_THREADS,\n        0, stream>>>(\n        num_kernels, data_col, data_im, data_offset, data_mask, channels, height_im, width_im,\n        kernel_h, kernel_w, pad_h, pad_w, stride_h, stride_w,\n        dilation_h, dilation_w, channel_per_deformable_group,\n        batch_size, 2 * kernel_h * kernel_w * deformable_group, deformable_group, height_col, width_col, \n        grad_offset, grad_mask);\n  cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in modulated_deformable_col2im_coord_cuda: %s\\n\", cudaGetErrorString(err));\n  }\n}"
  },
  {
    "path": "centernet/dcnv2Plugin/dcn_v2_im2col_cuda.h",
    "content": "/*!\n ******************* BEGIN Caffe Copyright Notice and Disclaimer ****************\n *\n * COPYRIGHT\n *\n * All contributions by the University of California:\n * Copyright (c) 2014-2017 The Regents of the University of California (Regents)\n * All rights reserved.\n *\n * All other contributions:\n * Copyright (c) 2014-2017, the respective contributors\n * All rights reserved.\n *\n * Caffe uses a shared copyright model: each contributor holds copyright over\n * their contributions to Caffe. The project versioning records all such\n * contribution and copyright details. If a contributor wants to further mark\n * their specific copyright on a particular contribution, they should indicate\n * their copyright solely in the commit message of the change when it is\n * committed.\n *\n * LICENSE\n *\n * Redistribution and use in source and binary forms, with or without\n * modification, are permitted provided that the following conditions are met:\n *\n * 1. Redistributions of source code must retain the above copyright notice, this\n * list of conditions and the following disclaimer.\n * 2. Redistributions in binary form must reproduce the above copyright notice,\n * this list of conditions and the following disclaimer in the documentation\n * and/or other materials provided with the distribution.\n *\n * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND\n * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED\n * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\n * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR\n * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES\n * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;\n * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND\n * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT\n * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS\n * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n *\n * CONTRIBUTION AGREEMENT\n *\n * By contributing to the BVLC/caffe repository through pull-request, comment,\n * or otherwise, the contributor releases their content to the\n * license and copyright terms herein.\n *\n ***************** END Caffe Copyright Notice and Disclaimer ********************\n *\n * Copyright (c) 2018 Microsoft\n * Licensed under The MIT License [see LICENSE for details]\n * \\file modulated_deformable_im2col.h\n * \\brief Function definitions of converting an image to\n * column matrix based on kernel, padding, dilation, and offset.\n * These functions are mainly used in deformable convolution operators.\n * \\ref: https://arxiv.org/abs/1811.11168\n * \\author Yuwen Xiong, Haozhi Qi, Jifeng Dai, Xizhou Zhu, Han Hu\n */\n\n/***************** Adapted by Charles Shang *********************/\n\n#ifndef DCN_V2_IM2COL_CUDA\n#define DCN_V2_IM2COL_CUDA\n\n// #ifdef __cplusplus\n// extern \"C\"\n// {\n// #endif\n\n  void modulated_deformable_im2col_cuda(cudaStream_t stream,\n                                        const float *data_im, const float *data_offset, const float *data_mask,\n                                        const int batch_size, const int channels, const int height_im, const int width_im,\n                                        const int height_col, const int width_col, const int kernel_h, const int kenerl_w,\n                                        const int pad_h, const int pad_w, const int stride_h, const int stride_w,\n                                        const int dilation_h, const int dilation_w,\n                                        const int deformable_group, float *data_col);\n\n  void modulated_deformable_col2im_cuda(cudaStream_t stream,\n                                        const float *data_col, const float *data_offset, const float *data_mask,\n                                        const int batch_size, const int channels, const int height_im, const int width_im,\n                                        const int height_col, const int width_col, const int kernel_h, const int kenerl_w,\n                                        const int pad_h, const int pad_w, const int stride_h, const int stride_w,\n                                        const int dilation_h, const int dilation_w,\n                                        const int deformable_group, float *grad_im);\n\n  void modulated_deformable_col2im_coord_cuda(cudaStream_t stream,\n                                         const float *data_col, const float *data_im, const float *data_offset, const float *data_mask,\n                                         const int batch_size, const int channels, const int height_im, const int width_im,\n                                         const int height_col, const int width_col, const int kernel_h, const int kenerl_w,\n                                         const int pad_h, const int pad_w, const int stride_h, const int stride_w,\n                                         const int dilation_h, const int dilation_w,\n                                         const int deformable_group,\n                                         float *grad_offset, float *grad_mask);\n\n// #ifdef __cplusplus\n// }\n// #endif\n\n#endif"
  },
  {
    "path": "centernet/dcnv2Plugin/dcnv2Plugin.cpp",
    "content": "#include \"dcnv2Plugin.h\"\n#include <iostream>\n\nusing namespace nvinfer1;\nusing nvinfer1::plugin::DeformableConvolutionalLayer;\nusing nvinfer1::plugin::DCNv2PluginCreator;\n\nnamespace\n{\nconst char* DCNv2_PLUGIN_VERSION{\"1\"};\nconst char* DCNv2_PLUGIN_NAME{\"DCNv2_TRT\"};\n} // namespace\n\n#define CHECK_CUDA(call)                                                                                               \\\n    do                                                                                                                 \\\n    {                                                                                                                  \\\n        cudaError_t status = call;                                                                                     \\\n        if (status != cudaSuccess)                                                                                     \\\n        {                                                                                                              \\\n            return status;                                                                                             \\\n        }                                                                                                              \\\n    } while (0)\n\nPluginFieldCollection DCNv2PluginCreator::mFC{};\nstd::vector<PluginField> DCNv2PluginCreator::mPluginAttributes;\n\n// Parameterized constructor\nDeformableConvolutionalLayer::DeformableConvolutionalLayer(\n                         int out_channels,\n                         int kernel,\n                         int deformable_group,\n                         int dilation,\n                         int padding,\n                         int stride,\n                         const Weights* weight, const Weights* bias):\n                         out_channels(out_channels),kernel_size(kernel),deformable_group(deformable_group),\n                         dilation(dilation),padding(padding),stride(stride){\n        mWeight = copyToDevice(weight[0].values, weight[0].count);\n        mBias = copyToDevice(bias[0].values, bias[0].count);\n}\n\nDeformableConvolutionalLayer::DeformableConvolutionalLayer(const void* buffer, size_t length)\n{\n    const char* d = static_cast<const char*>(buffer);\n    const char* a = d;\n    in_channels = read<int>(d);\n    height = read<int>(d);\n    width = read<int>(d);\n    height_out = read<int>(d);\n    width_out = read<int>(d);\n\n    out_channels = read<int>(d);\n    kernel_size = read<int>(d);\n    deformable_group = read<int>(d);\n    dilation = read<int>(d);\n    padding = read<int>(d);\n    stride = read<int>(d);\n\n    int count = read<int>(d);\n    mWeight = deserializeToDevice(d, count);\n    count = read<int>(d);\n    mBias = deserializeToDevice(d, count);\n\n    ASSERT(d == a + length);\n}\n\nint DeformableConvolutionalLayer::getNbOutputs() const\n{\n    // Plugin layer has 2 outputs\n    return 1;\n}\n\nint DeformableConvolutionalLayer::initialize()\n{\n    size_t oneSize = height_out * width_out * sizeof(float);\n    std::vector<float> one_((int)oneSize, 1.0f);\n    CHECK_CUDA(cudaMalloc((void**)&mOne, oneSize));\n    CHECK_CUDA(cudaMalloc((void**)&mColumn, in_channels * kernel_size * kernel_size * oneSize));\n    CHECK_CUDA(cudaMemcpy(mOne, one_.data(), oneSize, cudaMemcpyHostToDevice));\n    return STATUS_SUCCESS; \n}\n\nDims DeformableConvolutionalLayer::getOutputDimensions(int index, const Dims* inputs, int nbInputs)\n{\n    ASSERT(index == 0);\n    ASSERT(nbInputs == 3);\n\n    in_channels = inputs[0].d[0];\n    height = inputs[0].d[1];\n    width = inputs[0].d[2];\n    height_out = (inputs[0].d[1] + 2 * padding - (dilation * (kernel_size - 1) + 1)) / stride + 1;\n    width_out = (inputs[0].d[2] + 2 * padding - (dilation * (kernel_size - 1) + 1)) / stride + 1;\n\n    return Dims3(out_channels, height_out, width_out);\n}\n\nsize_t DeformableConvolutionalLayer::getWorkspaceSize(int maxBatchSize) const\n{\n    return 0;\n}\n\nint DeformableConvolutionalLayer::enqueue(int batchSize, const void* const* inputs, void** outputs, void* workspace, cudaStream_t stream)\n{\n    const float* input = static_cast<const float *>(inputs[0]);\n    const float* offset = static_cast<const float *>(inputs[1]);\n    const float* offset_mask = static_cast<const float *>(inputs[2]);\n    const float* mask = offset_mask + deformable_group * 2 * kernel_size * kernel_size * height * width;\n    float * output = static_cast<float *>(outputs[0]);\n\n    float alpha{1}, beta{0};\n\n    // Do Bias first:\n    // M,N,K are dims of matrix A and B\n    // (see http://docs.nvidia.com/cuda/cublas/#cublas-lt-t-gt-gemm)\n    // (N x 1) (1 x M)\n    int m_ = out_channels;\n    int n_ = height_out * width_out;\n    int k_ = 1;\n    cublasSgemm(mCublas, CUBLAS_OP_T, CUBLAS_OP_N, n_, m_, k_, &alpha,\n                mOne, k_,\n                static_cast<const float *>(mBias.values), k_, &beta,\n                output, n_);\n\n    modulated_deformable_im2col_cuda(stream, input, offset, mask,\n                                    1, in_channels, height, width,\n                                    height_out, width_out, kernel_size, kernel_size,\n                                    padding, padding, stride, stride, dilation, dilation,\n                                    deformable_group, mColumn); \n\n    //(k * m)  x  (m * n)\n    // Y = WC\n    int m = out_channels;\n    int n = height_out * width_out;\n    int k = in_channels * kernel_size * kernel_size;\n    cublasSgemm(mCublas, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &alpha,\n                mColumn, n,\n                static_cast<const float *>(mWeight.values), k, &alpha,\n                output, n);\n    \n    return 0;\n}\n\nsize_t DeformableConvolutionalLayer::getSerializationSize() const\n{\n    return sizeof(int) * 13 + (mWeight.count + mBias.count) * sizeof(float);\n}\n\nvoid DeformableConvolutionalLayer::serialize(void* buffer) const\n{\n    char *d = reinterpret_cast<char*>(buffer), *a = d;\n    write(d, in_channels);\n    write(d, height);\n    write(d, width);\n    write(d, height_out);\n    write(d, width_out);\n\n    write(d, out_channels);    \n    write(d, kernel_size);\n    write(d, deformable_group);\n    write(d, dilation);\n    write(d, padding);\n    write(d, stride);\n\n    write(d, (int) mWeight.count);\n    serializeFromDevice(d, mWeight);\n    write(d, (int) mBias.count);\n    serializeFromDevice(d, mBias);\n\n    ASSERT(d == a + getSerializationSize());\n}\n\nbool DeformableConvolutionalLayer::supportsFormat(DataType type, PluginFormat format) const\n{\n    return (type == DataType::kFLOAT && format == PluginFormat::kNCHW);\n}\n\nWeights DeformableConvolutionalLayer::copyToDevice(const void* hostData, size_t count)\n{\n    void* deviceData;\n    CUASSERT(cudaMalloc(&deviceData, count * sizeof(float)));\n    CUASSERT(cudaMemcpy(deviceData, hostData, count * sizeof(float), cudaMemcpyHostToDevice));\n    return Weights{DataType::kFLOAT, deviceData, int64_t(count)};\n}\n\nvoid DeformableConvolutionalLayer::serializeFromDevice(char*& hostBuffer, Weights deviceWeights) const\n{\n    CUASSERT(cudaMemcpy(hostBuffer, deviceWeights.values, deviceWeights.count * sizeof(float), cudaMemcpyDeviceToHost));\n    hostBuffer += deviceWeights.count * sizeof(float);\n}\n\nWeights DeformableConvolutionalLayer::deserializeToDevice(const char*& hostBuffer, size_t count)\n{\n    Weights w = copyToDevice(hostBuffer, count);\n    hostBuffer += count * sizeof(float);\n    return w;\n}\n\nconst char* DeformableConvolutionalLayer::getPluginType() const\n{\n    return DCNv2_PLUGIN_NAME;\n}\n\nconst char* DeformableConvolutionalLayer::getPluginVersion() const\n{\n    return DCNv2_PLUGIN_VERSION;\n}\n\nvoid DeformableConvolutionalLayer::terminate() {\n        if (mOne)\n        {\n            cudaFree(mOne);\n            mOne = nullptr;\n        }\n        if (mColumn)\n        {\n            cudaFree(mColumn);\n            mColumn = nullptr;\n        }\n}\n\nvoid DeformableConvolutionalLayer::destroy()\n{\n    delete this;\n}\n\nIPluginV2Ext* DeformableConvolutionalLayer::clone() const\n{\n    IPluginV2Ext* plugin = new DeformableConvolutionalLayer(*this);\n    plugin->setPluginNamespace(mPluginNamespace.c_str());\n    return plugin;\n}\n\n// Set plugin namespace\nvoid DeformableConvolutionalLayer::setPluginNamespace(const char* pluginNamespace)\n{\n    mPluginNamespace = pluginNamespace;\n}\n\nconst char* DeformableConvolutionalLayer::getPluginNamespace() const\n{\n    return mPluginNamespace.c_str();\n}\n\n// Return the DataType of the plugin output at the requested index.\nDataType DeformableConvolutionalLayer::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const\n{\n    // Only DataType::kFLOAT is acceptable by the plugin layer\n    return DataType::kFLOAT;\n}\n// Return true if output tensor is broadcast across a batch.\nbool DeformableConvolutionalLayer::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const\n{\n    return false;\n}\n\n// Return true if plugin can use input that is broadcast across batch without replication.\nbool DeformableConvolutionalLayer::canBroadcastInputAcrossBatch(int inputIndex) const\n{\n    return false;\n}\n\n// Configure the layer with input and output data types.\n// inutDims: input Dimensions for the plugin layer\n// nInputs : Number of inputs to the plugin layer\n// outputDims: output Dimensions from the plugin layer\n// nOutputs: number of outputs from the plugin layer\n// type: DataType configuration for the plugin layer\n// format: format NCHW, NHWC etc\n// maxbatchSize: maximum batch size for the plugin layer\nvoid DeformableConvolutionalLayer::configurePlugin(const Dims* inputDims, int nbInputs, const Dims* outputDims, int nbOutputs,\n    const DataType* inputTypes, const DataType* outputTypes, const bool* inputIsBroadcast,\n    const bool* outputIsBroadcast, PluginFormat floatFormat, int maxBatchSize)\n{\n    ASSERT(*inputTypes == DataType::kFLOAT && floatFormat == PluginFormat::kNCHW);\n}\n\n// Attach the plugin object to an execution context and grant the plugin the access to some context resource.\nvoid DeformableConvolutionalLayer::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator)\n{\n    mCublas = cublasContext;\n}\n\n// Detach the plugin object from its execution context.\nvoid DeformableConvolutionalLayer::detachFromContext() {}\n\nDCNv2PluginCreator::DCNv2PluginCreator()\n{\n    mPluginAttributes.emplace_back(PluginField(\"out_channels\", nullptr, PluginFieldType::kINT32, 1));\n    mPluginAttributes.emplace_back(PluginField(\"kernel\", nullptr, PluginFieldType::kINT32, 1));\n    mPluginAttributes.emplace_back(PluginField(\"deformable_group\", nullptr, PluginFieldType::kINT32, 1));\n    mPluginAttributes.emplace_back(PluginField(\"dilation\", nullptr, PluginFieldType::kINT32, 1));\n    mPluginAttributes.emplace_back(PluginField(\"padding\", nullptr, PluginFieldType::kINT32, 1));\n    mPluginAttributes.emplace_back(PluginField(\"stride\", nullptr, PluginFieldType::kINT32, 1));\n    mPluginAttributes.emplace_back(PluginField(\"weight\", nullptr, PluginFieldType::kFLOAT32, 1));\n    mPluginAttributes.emplace_back(PluginField(\"bias\", nullptr, PluginFieldType::kFLOAT32, 1));\n\n    mFC.nbFields = mPluginAttributes.size();\n    mFC.fields = mPluginAttributes.data();\n}\n\nconst char* DCNv2PluginCreator::getPluginName() const\n{\n    return DCNv2_PLUGIN_NAME;\n}\n\nconst char* DCNv2PluginCreator::getPluginVersion() const\n{\n    return DCNv2_PLUGIN_VERSION;\n}\n\nconst PluginFieldCollection* DCNv2PluginCreator::getFieldNames()\n{\n    return &mFC;\n}\n\nIPluginV2Ext* DCNv2PluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc)\n{\n    std::vector<float> weight;\n    std::vector<float> bias;\n    int out_channels, kernel, deformable_group, padding, stride, dilation;\n    const PluginField* fields = fc->fields;\n    for (int i = 0; i < fc->nbFields; ++i)\n    {\n        const char* attrName = fields[i].name;\n        if (!strcmp(attrName, \"out_channels\"))\n        {\n            ASSERT(fields[i].type == PluginFieldType::kINT32);\n            out_channels = *(static_cast<const int*>(fields[i].data));\n        }\n        else if (!strcmp(attrName, \"kernel\"))\n        {\n            ASSERT(fields[i].type == PluginFieldType::kINT32);\n            kernel = *(static_cast<const int*>(fields[i].data));\n        }\n        else if (!strcmp(attrName, \"deformable_group\"))\n        {\n            ASSERT(fields[i].type == PluginFieldType::kINT32);\n            deformable_group = *(static_cast<const int*>(fields[i].data));\n        }\n        else if (!strcmp(attrName, \"dilation\"))\n        {\n            ASSERT(fields[i].type == PluginFieldType::kINT32);\n            dilation = *(static_cast<const int*>(fields[i].data));\n        }\n        else if (!strcmp(attrName, \"stride\"))\n        {\n            ASSERT(fields[i].type == PluginFieldType::kINT32);\n            stride = *(static_cast<const int*>(fields[i].data));\n        }\n        else if (!strcmp(attrName, \"padding\"))\n        {\n            ASSERT(fields[i].type == PluginFieldType::kINT32);\n            padding = *(static_cast<const int*>(fields[i].data));\n        }\n        else if (!strcmp(attrName, \"weight\"))\n        {\n            ASSERT(fields[i].type == PluginFieldType::kFLOAT32);\n            int size = fields[i].length;\n            weight.reserve(size);\n            const auto* w = static_cast<const float*>(fields[i].data);\n            for (int j = 0; j < size; j++)\n            {\n                weight.push_back(*w);\n                w++;\n            }\n        }\n        else if (!strcmp(attrName, \"bias\"))\n        {\n            ASSERT(fields[i].type == PluginFieldType::kFLOAT32);\n            int size = fields[i].length;\n            bias.reserve(size);\n            const auto* w = static_cast<const float*>(fields[i].data);\n            for (int j = 0; j < size; j++)\n            {\n                bias.push_back(*w);\n                w++;\n            }\n        }\n    }\n\n    Weights mWeight{DataType::kFLOAT, weight.data(), (int64_t) weight.size()};\n    Weights mBias{DataType::kFLOAT, bias.data(), (int64_t) bias.size()};\n\n    DeformableConvolutionalLayer* obj = new DeformableConvolutionalLayer(out_channels,\n                         kernel,\n                         deformable_group,\n                         dilation,\n                         padding,\n                         stride,\n                         &mWeight, &mBias);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\nIPluginV2Ext* DCNv2PluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength)\n{\n    // This object will be deleted when the network is destroyed, which will\n    // call Normalize::destroy()\n    DeformableConvolutionalLayer* obj = new DeformableConvolutionalLayer(serialData, serialLength);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}"
  },
  {
    "path": "centernet/dcnv2Plugin/dcnv2Plugin.h",
    "content": "/*\n * Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n#ifndef TRT_DCNV2_PLUGIN_H\n#define TRT_DCNV2_PLUGIN_H\n#include \"kernel.h\"\n#include \"plugin.h\"\n#include \"dcn_v2_im2col_cuda.h\"\n\n#include \"serialize.hpp\"\n#include <cudnn.h>\n#include <vector>\n#include <cublas_v2.h>\n#include <cuda.h>\n#include <string>\n#include <vector>\n\nusing namespace nvinfer1::plugin;\nnamespace nvinfer1\n{\nnamespace plugin\n{\n\nclass DeformableConvolutionalLayer : public IPluginV2Ext\n{\npublic:\n    DeformableConvolutionalLayer(int out_channels,\n                         int kernel,\n                         int deformable_group,\n                         int dilation,\n                         int padding,\n                         int stride,\n                         const Weights* weight, const Weights* bias);\n\n    DeformableConvolutionalLayer(const void* buffer, size_t length);\n\n    ~DeformableConvolutionalLayer() override = default;\n\n    int getNbOutputs() const override;\n\n    Dims getOutputDimensions(int index, const Dims* inputs, int nbInputs) override;\n\n    int initialize() override;\n\n    void terminate() override;\n\n    size_t getWorkspaceSize(int maxBatchSize) const override;\n\n    int enqueue(\n        int batchSize, const void* const* inputs, void** outputs, void* workspace, cudaStream_t stream) override;\n\n    size_t getSerializationSize() const override;\n\n    void serialize(void* buffer) const override;\n\n    bool supportsFormat(DataType type, PluginFormat format) const override;\n\n    const char* getPluginType() const override;\n\n    const char* getPluginVersion() const override;\n\n    void destroy() override;\n\n    IPluginV2Ext* clone() const override;\n\n    void setPluginNamespace(const char* pluginNamespace) override;\n\n    const char* getPluginNamespace() const override;\n\n    DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const override;\n\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const override;\n\n    bool canBroadcastInputAcrossBatch(int inputIndex) const override;\n\n    void attachToContext(\n        cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) override;\n\n    void configurePlugin(const Dims* inputDims, int nbInputs, const Dims* outputDims, int nbOutputs,\n        const DataType* inputTypes, const DataType* outputTypes, const bool* inputIsBroadcast,\n        const bool* outputIsBroadcast, PluginFormat floatFormat, int maxBatchSize) override;\n\n    void detachFromContext() override;\n\nprivate:\n    Weights copyToDevice(const void* hostData, size_t count);\n    void serializeFromDevice(char*& hostBuffer, Weights deviceWeights) const;\n    Weights deserializeToDevice(const char*& hostBuffer, size_t count);\n\n    std::string mPluginNamespace;\n\n    int in_channels{};\n    int height_out{};\n    int width_out{};\n    int height{};\n    int width{};\n\n    int out_channels{};\n    int kernel_size{};\n    int deformable_group{};\n    int dilation{};\n    int padding{};\n    int stride{};\n\n    Weights mWeight{};\n    Weights mBias{};\n\n    float* mOne;\n    float* mColumn;\n\n    cublasHandle_t mCublas;\n};\n\nclass DCNv2PluginCreator : public BaseCreator\n{\npublic:\n    DCNv2PluginCreator();\n\n    ~DCNv2PluginCreator() override = default;\n\n    const char* getPluginName() const override;\n\n    const char* getPluginVersion() const override;\n\n    const PluginFieldCollection* getFieldNames() override;\n\n    IPluginV2Ext* createPlugin(const char* name, const PluginFieldCollection* fc) override;\n\n    IPluginV2Ext* deserializePlugin(const char* name, const void* serialData, size_t serialLength) override;\n\nprivate:\n    static PluginFieldCollection mFC;\n\n    // Parameters for DeformableConvolutionalLayer\n    static std::vector<PluginField> mPluginAttributes;\n};\n} // namespace plugin\n} // namespace nvinfer1\n\n#endif // TRT_DCNv2_PLUGIN_H\n"
  },
  {
    "path": "centernet/sample/common.py",
    "content": "#\n# Copyright 1993-2020 NVIDIA Corporation.  All rights reserved.\n#\n# NOTICE TO LICENSEE:\n#\n# This source code and/or documentation (\"Licensed Deliverables\") are\n# subject to NVIDIA intellectual property rights under U.S. and\n# international Copyright laws.\n#\n# These Licensed Deliverables contained herein is PROPRIETARY and\n# CONFIDENTIAL to NVIDIA and is being provided under the terms and\n# conditions of a form of NVIDIA software license agreement by and\n# between NVIDIA and Licensee (\"License Agreement\") or electronically\n# accepted by Licensee.  Notwithstanding any terms or conditions to\n# the contrary in the License Agreement, reproduction or disclosure\n# of the Licensed Deliverables to any third party without the express\n# written consent of NVIDIA is prohibited.\n#\n# NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE\n# LICENSE AGREEMENT, NVIDIA MAKES NO REPRESENTATION ABOUT THE\n# SUITABILITY OF THESE LICENSED DELIVERABLES FOR ANY PURPOSE.  IT IS\n# PROVIDED \"AS IS\" WITHOUT EXPRESS OR IMPLIED WARRANTY OF ANY KIND.\n# NVIDIA DISCLAIMS ALL WARRANTIES WITH REGARD TO THESE LICENSED\n# DELIVERABLES, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY,\n# NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.\n# NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE\n# LICENSE AGREEMENT, IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY\n# SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, OR ANY\n# DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,\n# WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS\n# ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE\n# OF THESE LICENSED DELIVERABLES.\n#\n# U.S. Government End Users.  These Licensed Deliverables are a\n# \"commercial item\" as that term is defined at 48 C.F.R. 2.101 (OCT\n# 1995), consisting of \"commercial computer software\" and \"commercial\n# computer software documentation\" as such terms are used in 48\n# C.F.R. 12.212 (SEPT 1995) and is provided to the U.S. Government\n# only as a commercial end item.  Consistent with 48 C.F.R.12.212 and\n# 48 C.F.R. 227.7202-1 through 227.7202-4 (JUNE 1995), all\n# U.S. Government End Users acquire the Licensed Deliverables with\n# only those rights set forth herein.\n#\n# Any use of the Licensed Deliverables in individual and commercial\n# software must include, in the user documentation and internal\n# comments to the code, the above Disclaimer and U.S. Government End\n# Users Notice.\n#\n\nfrom itertools import chain\nimport argparse\nimport os\n\nimport pycuda.driver as cuda\nimport pycuda.autoinit\nimport numpy as np\n\nimport tensorrt as trt\n\ntry:\n    # Sometimes python2 does not understand FileNotFoundError\n    FileNotFoundError\nexcept NameError:\n    FileNotFoundError = IOError\n\nEXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)\n\ndef GiB(val):\n    return val * 1 << 30\n\n\ndef add_help(description):\n    parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n    args, _ = parser.parse_known_args()\n\n\ndef find_sample_data(description=\"Runs a TensorRT Python sample\", subfolder=\"\", find_files=[], err_msg=\"\"):\n    '''\n    Parses sample arguments.\n\n    Args:\n        description (str): Description of the sample.\n        subfolder (str): The subfolder containing data relevant to this sample\n        find_files (str): A list of filenames to find. Each filename will be replaced with an absolute path.\n\n    Returns:\n        str: Path of data directory.\n    '''\n\n    # Standard command-line arguments for all samples.\n    kDEFAULT_DATA_ROOT = os.path.join(os.sep, \"usr\", \"src\", \"tensorrt\", \"data\")\n    parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n    parser.add_argument(\"-d\", \"--datadir\", help=\"Location of the TensorRT sample data directory, and any additional data directories.\", action=\"append\", default=[kDEFAULT_DATA_ROOT])\n    args, _ = parser.parse_known_args()\n\n    def get_data_path(data_dir):\n        # If the subfolder exists, append it to the path, otherwise use the provided path as-is.\n        data_path = os.path.join(data_dir, subfolder)\n        if not os.path.exists(data_path):\n            if data_dir != kDEFAULT_DATA_ROOT:\n                print(\"WARNING: \" + data_path + \" does not exist. Trying \" + data_dir + \" instead.\")\n            data_path = data_dir\n        # Make sure data directory exists.\n        if not (os.path.exists(data_path)) and data_dir != kDEFAULT_DATA_ROOT:\n            print(\"WARNING: {:} does not exist. Please provide the correct data path with the -d option.\".format(data_path))\n        return data_path\n\n    data_paths = [get_data_path(data_dir) for data_dir in args.datadir]\n    return data_paths, locate_files(data_paths, find_files, err_msg)\n\ndef locate_files(data_paths, filenames, err_msg=\"\"):\n    \"\"\"\n    Locates the specified files in the specified data directories.\n    If a file exists in multiple data directories, the first directory is used.\n\n    Args:\n        data_paths (List[str]): The data directories.\n        filename (List[str]): The names of the files to find.\n\n    Returns:\n        List[str]: The absolute paths of the files.\n\n    Raises:\n        FileNotFoundError if a file could not be located.\n    \"\"\"\n    found_files = [None] * len(filenames)\n    for data_path in data_paths:\n        # Find all requested files.\n        for index, (found, filename) in enumerate(zip(found_files, filenames)):\n            if not found:\n                file_path = os.path.abspath(os.path.join(data_path, filename))\n                if os.path.exists(file_path):\n                    found_files[index] = file_path\n\n    # Check that all files were found\n    for f, filename in zip(found_files, filenames):\n        if not f or not os.path.exists(f):\n            raise FileNotFoundError(\"Could not find {:}. Searched in data paths: {:}\\n{:}\".format(filename, data_paths, err_msg))\n    return found_files\n\n# Simple helper data class that's a little nicer to use than a 2-tuple.\nclass HostDeviceMem(object):\n    def __init__(self, host_mem, device_mem):\n        self.host = host_mem\n        self.device = device_mem\n\n    def __str__(self):\n        return \"Host:\\n\" + str(self.host) + \"\\nDevice:\\n\" + str(self.device)\n\n    def __repr__(self):\n        return self.__str__()\n\n# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.\ndef allocate_buffers(engine):\n    inputs = []\n    outputs = []\n    bindings = []\n    stream = cuda.Stream()\n    for binding in engine:\n        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n        dtype = trt.nptype(engine.get_binding_dtype(binding))\n        # Allocate host and device buffers\n        host_mem = cuda.pagelocked_empty(size, dtype)\n        device_mem = cuda.mem_alloc(host_mem.nbytes)\n        # Append the device buffer to device bindings.\n        bindings.append(int(device_mem))\n        # Append to the appropriate list.\n        if engine.binding_is_input(binding):\n            inputs.append(HostDeviceMem(host_mem, device_mem))\n        else:\n            outputs.append(HostDeviceMem(host_mem, device_mem))\n    return inputs, outputs, bindings, stream\n\n# This function is generalized for multiple inputs/outputs.\n# inputs and outputs are expected to be lists of HostDeviceMem objects.\ndef do_inference(context, bindings, inputs, outputs, stream, batch_size=1):\n    # Transfer input data to the GPU.\n    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]\n    # Run inference.\n    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)\n    # Transfer predictions back from the GPU.\n    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]\n    # Synchronize the stream\n    stream.synchronize()\n    # Return only the host outputs.\n    return [out.host for out in outputs]\n\n# This function is generalized for multiple inputs/outputs for full dimension networks.\n# inputs and outputs are expected to be lists of HostDeviceMem objects.\ndef do_inference_v2(context, bindings, inputs, outputs, stream):\n    # Transfer input data to the GPU.\n    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]\n    # Run inference.\n    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n    # Transfer predictions back from the GPU.\n    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]\n    # Synchronize the stream\n    stream.synchronize()\n    # Return only the host outputs.\n    return [out.host for out in outputs]\n\n\n# `retry_call` and `retry` are used to wrap the function we want to try multiple times\ndef retry_call(func, args=[], kwargs={}, n_retries=3):\n    \"\"\"Wrap a function to retry it several times.\n\n    Args:\n        func: function to call\n        args (List): args parsed to func\n        kwargs (Dict): kwargs parsed to func\n        n_retries (int): maximum times of tries\n    \"\"\"\n    for i_try in range(n_retries):\n        try:\n            func(*args, **kwargs)\n            break\n        except:\n            if i_try == n_retries - 1:\n                raise\n            print(\"retry...\")\n\n# Usage: @retry(n_retries)\ndef retry(n_retries=3):\n    \"\"\"Wrap a function to retry it several times. Decorator version of `retry_call`.\n\n    Args:\n        n_retries (int): maximum times of tries\n\n    Usage:\n        @retry(n_retries)\n        def func(...):\n            pass\n    \"\"\"\n    def wrapper(func):\n        def _wrapper(*args, **kwargs):\n            retry_call(func, args, kwargs, n_retries)\n        return _wrapper\n    return wrapper\n"
  },
  {
    "path": "centernet/sample/test.py",
    "content": "import cv2 as cv\nimport numpy as np\n\nimport tensorrt as trt\nimport common\n\nimport torch\nimport time\nfrom sys import argv\n\n# You can set the logger severity higher to suppress messages (or lower to display more messages).\nTRT_LOGGER = trt.Logger(trt.Logger.WARNING)\ntrt.init_libnvinfer_plugins(TRT_LOGGER, '')\n\n\ndef _gather_feat(feat, ind, mask=None):\n    dim = feat.size(2)\n    ind = ind.unsqueeze(2).expand(ind.size(0), ind.size(1), dim)\n    feat = feat.gather(1, ind)\n    if mask is not None:\n        mask = mask.unsqueeze(2).expand_as(feat)\n        feat = feat[mask]\n        feat = feat.view(-1, dim)\n    return feat\n\n\ndef _transpose_and_gather_feat(feat, ind):\n    feat = feat.permute(0, 2, 3, 1).contiguous()\n    feat = feat.view(feat.size(0), -1, feat.size(3))\n    feat = _gather_feat(feat, ind)\n    return feat\n\n\ndef pre_process(image):\n    long_size = max(image.shape)\n    img = np.zeros((long_size, long_size, 3))\n    img[:image.shape[0], :img.shape[1], :] = image[:]\n    img = cv.resize(img, (512,512))\n    inp_image = ((img / 255. - 0.5) / 0.5).astype(np.float32)\n    images = inp_image.transpose(2, 0, 1)\n    return images, long_size/512\n\n\ndef _nms(heat, kernel=3):\n    pad = (kernel - 1) // 2\n\n    hmax = torch.nn.functional.max_pool2d(\n        heat, (kernel, kernel), stride=1, padding=pad)\n    keep = (hmax == heat).float()\n    return heat * keep\n\n\ndef _topk(scores, K=40):\n    batch, cat, height, width = scores.size()\n\n    topk_scores, topk_inds = torch.topk(scores.view(batch, cat, -1), K)\n\n    topk_inds = topk_inds % (height * width)\n    topk_ys = (topk_inds.true_divide(width)).int().float()\n    topk_xs = (topk_inds % width).int().float()\n\n    topk_score, topk_ind = torch.topk(topk_scores.view(batch, -1), K)\n    topk_clses = (topk_ind.true_divide(K)).int()\n    topk_inds = _gather_feat(\n        topk_inds.view(batch, -1, 1), topk_ind).view(batch, K)\n    topk_ys = _gather_feat(topk_ys.view(batch, -1, 1), topk_ind).view(batch, K)\n    topk_xs = _gather_feat(topk_xs.view(batch, -1, 1), topk_ind).view(batch, K)\n\n    return topk_score, topk_inds, topk_clses, topk_ys, topk_xs\n\n\ndef ctdet_decode(heat, wh, reg=None, cat_spec_wh=False, K=100):\n    batch, cat, height, width = heat.size()\n\n    heat = torch.sigmoid(heat)\n    # perform nms on heatmaps\n    heat = _nms(heat)\n\n    scores, inds, clses, ys, xs = _topk(heat, K=K)\n    if reg is not None:\n        reg = _transpose_and_gather_feat(reg, inds)\n        reg = reg.view(batch, K, 2)\n        xs = xs.view(batch, K, 1) + reg[:, :, 0:1]\n        ys = ys.view(batch, K, 1) + reg[:, :, 1:2]\n    else:\n        xs = xs.view(batch, K, 1) + 0.5\n        ys = ys.view(batch, K, 1) + 0.5\n    wh = _transpose_and_gather_feat(wh, inds)\n    if cat_spec_wh:\n        wh = wh.view(batch, K, cat, 2)\n        clses_ind = clses.view(batch, K, 1, 1).expand(batch, K, 1, 2).long()\n        wh = wh.gather(2, clses_ind).view(batch, K, 2)\n    else:\n        wh = wh.view(batch, K, 2)\n    clses = clses.view(batch, K, 1).float()\n    scores = scores.view(batch, K, 1)\n    bboxes = torch.cat([xs - wh[..., 0:1] / 2,\n                        ys - wh[..., 1:2] / 2,\n                        xs + wh[..., 0:1] / 2,\n                        ys + wh[..., 1:2] / 2], dim=2)\n    detections = torch.cat([bboxes, scores, clses], dim=2)\n    return detections\n\n\nif __name__ == '__main__':\n    try:\n        engine_path = argv[1]\n        img_path = argv[2]\n    except:\n        print('engine path and image path are needed!')\n        exit()\n    with open(engine_path, \"rb\") as f, trt.Runtime(TRT_LOGGER) as runtime, runtime.deserialize_cuda_engine(f.read()) as engine:\n        inputs, outputs, bindings, stream = common.allocate_buffers(engine)\n        with engine.create_execution_context() as context:\n            img = cv.imread('test.jpg')\n            dis = img.copy()\n            img, s = pre_process(img)\n            # Copy to the pagelocked input buffer\n            np.copyto(inputs[0].host, img.ravel())\n            [hm, wh, reg] = common.do_inference(\n                context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream, batch_size=1)\n\n            [dets] = ctdet_decode(torch.from_numpy(hm.reshape(1, 80, 128, 128)), torch.from_numpy(\n                wh.reshape(1, 2, 128, 128)), torch.from_numpy(reg.reshape(1, 2, 128, 128)))\n\n            for i in dets:\n                if i[-2] > 0.5:\n                    i[:4] *= 4*s\n                    cv.rectangle(dis, (int(i[0]), int(\n                        i[1])), (int(i[2]), int(i[3])), 255, 1)\n                    cv.putText(dis, '%d' %\n                               int(i[-1]), (int(i[0]), int(i[1])), 1, 1, 255)\n\n            cv.imwrite('trt_out.jpg', dis)\n"
  },
  {
    "path": "contributing.md",
    "content": "# How to Contribute\n\n1. Fork this repo to your github account\n\n2. Clone your fork\n\n3. Create a feature branch\n\n4. Make changes, including but not limited to create new model, bug fix, documentation, tutorials, etc.\n\n5. Pre-commit check and push, we use clang-format to do coding style checking, and the coding style is following google c++ coding style with 4-space.\n\n```bash\npip install pre-commit clang-format\n\ncd tensorrtx\npre-commit install\ngit add [files-to-commit]\npre-commit run\n\n# fix pre-commit errors, then git add files-to-commit again\ngit add [files-to-commit]\n\ngit commit -m \"describe your commit\"\n\ngit push origin [feature-branch]\n```\n\n6. Submit a pull-request on github web UI to master branch of wang-xinyu/tensorrtx.\n"
  },
  {
    "path": "convnextv2/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\nproject(convnextv2)\n\nfind_package(CUDA REQUIRED)\nfind_package(OpenCV REQUIRED)\n\ninclude_directories(${CUDA_INCLUDE_DIRS} /usr/local/cuda/include /usr/local/TensorRT-8.6.1.6/include)\nlink_directories(${CUDA_TOOLKIT_ROOT_DIR}/lib64 /usr/local/cuda/lib64 /usr/local/TensorRT-8.6.1.6/targets/x86_64-linux-gnu/lib)\n\n# TRT\nfind_library(NVINFER nvinfer PATHS /usr/local/TensorRT-8.6.1.6/targets/x86_64-linux-gnu/lib NO_DEFAULT_PATH)\nfind_library(NVINFER_PLUGIN nvinfer_plugin PATHS /usr/local/TensorRT-8.6.1.6/targets/x86_64-linux-gnu/lib NO_DEFAULT_PATH)\nfind_library(NVPARSERS nvparsers PATHS /usr/local/TensorRT-8.6.1.6/targets/x86_64-linux-gnu/lib NO_DEFAULT_PATH)\n\nset(CMAKE_CXX_STANDARD 14)\n\ncuda_add_executable(convnextv2 src/convnextv2.cpp src/LayerNormPlugin.cu)\ntarget_link_libraries(convnextv2 ${NVINFER} ${NVINFER_PLUGIN} ${CUDA_LIBRARIES} ${OpenCV_LIBS})\n\ncuda_add_library(layernorm_plugin SHARED src/LayerNormPlugin.cu)\ntarget_link_libraries(layernorm_plugin ${NVINFER} ${NVINFER_PLUGIN} ${CUDA_LIBRARIES})\n\n# Inference executable\ncuda_add_executable(inference_cpp src/inference_cpp.cpp src/LayerNormPlugin.cu)\ntarget_link_libraries(inference_cpp ${NVINFER} ${NVINFER_PLUGIN} ${CUDA_LIBRARIES} ${OpenCV_LIBS})\n"
  },
  {
    "path": "convnextv2/README.md",
    "content": "# ConvNeXtV2 TensorRT\n\n## Environment\n\n- ubuntu20.04\n-  cuda11.8\n-  cudnn8.9.7\n-  TensorRT8.6.1.6\n-  OpenCV4.13\n\n## Support\n\n[ConvNext-V2](https://github.com/facebookresearch/ConvNeXt-V2.git)provides official pre-trained models such as ImageNet-1K fine-tuned models, ImageNet-22K fine-tuned models, and custom dataset classification models trained using these pre-trained weights.\n\n## Build and Run\n\n``````\n# Downloda dependencies\npip install torch tensorrt pycuda numpy opencv-python\n\n# Generate .wts\ncd path-to-tensorrtx/convnextv2\npython path-to-gen_wts.py path-to-pt path-to-wts\n\n# Build convnextv2\ncmake -B build\nmake -C build\n\n# Update config.yaml to match your selected model\n\n# Generate .engine\n./build/convnextv2 path-to-wts path-to-engine\n\n# Inference(python)\npython path-to-inference.py path-to-engine path-to-your-image path-to-your-labels.txt\n\n# Inference(cpp)\n./build/inference_cpp path-to-engine path-to-your-image path-to-your-labels.txt\n``````\n\n## More Information\n\nAn interesting fact is that the suffix of the engine file can be arbitrarily specified; it does not need to be “engine”, and you can even use your own name as the suffix.\n"
  },
  {
    "path": "convnextv2/config.yaml",
    "content": "# ConvNeXtV2 Configuration\n\n# Model variants reference:\n# Atto:  depths: [2, 2, 6, 2], dims: [40, 80, 160, 320]\n# Femto: depths: [2, 2, 6, 2], dims: [48, 96, 192, 384]\n# Pico:  depths: [2, 2, 6, 2], dims: [64, 128, 256, 512]\n# Nano:  depths: [2, 2, 8, 2], dims: [80, 160, 320, 640]\n# Tiny:  depths: [3, 3, 9, 3], dims: [96, 192, 384, 768]\n# Base:  depths: [3, 3, 27, 3], dims: [128, 256, 512, 1024]\n# Large: depths: [3, 3, 27, 3], dims: [192, 384, 768, 1536]\n# Huge:  depths: [3, 3, 27, 3], dims: [352, 704, 1408, 2816]\n\ndepths: [2, 2, 8, 2]\ndims: [80, 160, 320, 640]\ninput_h: 224\ninput_w: 224\n"
  },
  {
    "path": "convnextv2/gen_wts.py",
    "content": "import torch\nimport struct\n\n\ndef gen_wts(model_path, wts_path):\n    print(f\"Loading {model_path}...\")\n    try:\n        data = torch.load(model_path, map_location='cpu')\n    except FileNotFoundError:\n        print(f\"Error: {model_path} not found.\")\n        return\n\n    if isinstance(data, dict) and 'model' in data:\n        state_dict = data['model']\n    else:\n        state_dict = data\n\n    print(f\"Exporting to {wts_path}...\")\n\n    # Infer architecture\n    dims = []\n    depths = [0, 0, 0, 0]\n\n    # Check dimensions from downsample layers\n    # downsample_layers.0.0 is stem: conv set output to dim[0]\n    # downsample_layers.1.0 is conv: dim[0] -> dim[1]\n    # ...\n\n    if 'downsample_layers.0.0.weight' in state_dict:\n        dims.append(state_dict['downsample_layers.0.0.weight'].shape[0])\n    if 'downsample_layers.1.0.weight' in state_dict:\n        dims.append(state_dict['downsample_layers.1.0.weight'].shape[0])\n    if 'downsample_layers.2.0.weight' in state_dict:\n        dims.append(state_dict['downsample_layers.2.0.weight'].shape[0])\n    if 'downsample_layers.3.0.weight' in state_dict:\n        dims.append(state_dict['downsample_layers.3.0.weight'].shape[0])\n\n    # Count blocks per stage\n    for k in state_dict.keys():\n        if k.startswith('stages.'):\n            parts = k.split('.')\n            if len(parts) >= 3:\n                stage_idx = int(parts[1])\n                block_idx = int(parts[2])\n                if stage_idx < 4:\n                    depths[stage_idx] = max(depths[stage_idx], block_idx + 1)\n\n    print(\"Inferred Architecture:\")\n    print(f\"  Dims: {dims}\")\n    print(f\"  Depths: {depths}\")\n\n    with open(wts_path, 'w') as f:\n        f.write(f\"{len(state_dict)}\\n\")\n        for k, v in state_dict.items():\n            vr = v.reshape(-1).cpu().numpy()\n            f.write(f\"{k} {len(vr)}\")\n            for val in vr:\n                f.write(\" \")\n                f.write(struct.pack('>f', float(val)).hex())\n            f.write(\"\\n\")\n\n    print(\"Done.\")\n\n\nif __name__ == \"__main__\":\n    import sys\n    if len(sys.argv) != 3:\n        print(f\"Usage: python {sys.argv[0]} <pt_path> <wts_path>\")\n        print(f\"Example: python {sys.argv[0]} models/test.pt convnextv2.wts\")\n        sys.exit(1)\n\n    pt_path = sys.argv[1]\n    wts_path = sys.argv[2]\n    gen_wts(pt_path, wts_path)\n"
  },
  {
    "path": "convnextv2/inference.py",
    "content": "import tensorrt as trt\nimport pycuda.driver as cuda\nimport pycuda.autoinit  # noqa: F401\nimport numpy as np\nimport cv2\nimport ctypes\nimport os\nimport sys\n\n\ndef load_imagenet_labels(label_file=\"imagenet_classes.txt\"):\n    \"\"\"Load ImageNet class labels\"\"\"\n    if not os.path.exists(label_file):\n        return None\n    with open(label_file, 'r') as f:\n        labels = [line.strip() for line in f.readlines()]\n    return labels\n\n\ndef main(engine_path, img_path, label_file=\"imagenet_classes.txt\"):\n    # Load plugin library\n    so_file = os.path.abspath(\"./build/liblayernorm_plugin.so\")\n    if not os.path.exists(so_file):\n        print(f\"Plugin library not found: {so_file}\")\n        return\n\n    ctypes.CDLL(so_file)\n\n    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)\n    runtime = trt.Runtime(TRT_LOGGER)\n\n    if not os.path.exists(engine_path):\n        print(f\"Engine file not found: {engine_path}\")\n        return\n\n    with open(engine_path, \"rb\") as f:\n        serialized_engine = f.read()\n\n    engine = runtime.deserialize_cuda_engine(serialized_engine)\n    if not engine:\n        print(\"Failed to deserialize engine.\")\n        return\n\n    context = engine.create_execution_context()\n    # Get Input Shape from Engine\n    input_shape = (224, 224)  # Default\n    for i in range(engine.num_bindings):\n        if engine.binding_is_input(i):\n            shape = engine.get_binding_shape(i)\n            # shape is usually (N, C, H, W) or (C, H, W)\n            if len(shape) == 4:\n                input_shape = (shape[2], shape[3])\n            elif len(shape) == 3:\n                input_shape = (shape[1], shape[2])\n            break\n\n    # Prepare input\n    img = cv2.imread(img_path)\n    if img is None:\n        print(f\"Failed to load image: {img_path}\")\n        return\n    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)\n    img = cv2.resize(img, (input_shape[1], input_shape[0]))  # cv2.resize takes (W, H)\n    img = img.astype(np.float32) / 255.0\n\n    # ImageNet Mean/Std\n    mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)\n    std = np.array([0.229, 0.224, 0.225], dtype=np.float32)\n    img = (img - mean) / std\n\n    img = img.transpose(2, 0, 1)  # HWC -> CHW\n    img = np.expand_dims(img, axis=0)  # CHW -> NCHW\n    img = np.ascontiguousarray(img)\n\n    inputs, outputs, bindings, stream = [], [], [], cuda.Stream()\n\n    for i in range(engine.num_bindings):\n        dtype = trt.nptype(engine.get_binding_dtype(i))\n        shape = engine.get_binding_shape(i)\n\n        # Handle dynamic shape or fixed\n        # Check if input or output\n        is_input = engine.binding_is_input(i)\n\n        # Since we use explicit batch, shape[0] might be -1 or 1\n        # If -1, we set context binding shape\n        if shape[0] == -1:\n            shape = (1,) + shape[1:]\n            context.set_binding_shape(i, shape)\n\n        size = trt.volume(shape) * np.dtype(dtype).itemsize\n\n        # Host memory\n        host_mem = cuda.pagelocked_empty(trt.volume(shape), dtype)\n        # Device memory\n        device_mem = cuda.mem_alloc(size)\n\n        bindings.append(int(device_mem))\n\n        if is_input:\n            inputs.append({'host': host_mem, 'device': device_mem, 'shape': shape})\n            # Copy input data to host buffer\n            np.copyto(host_mem, img.ravel())\n        else:\n            outputs.append({'host': host_mem, 'device': device_mem, 'shape': shape})\n\n    # Inference\n    # Transfer input data to the GPU.\n    for inp in inputs:\n        cuda.memcpy_htod_async(inp['device'], inp['host'], stream)\n\n    # Run inference.\n    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n\n    # Transfer predictions back from the GPU.\n    for out in outputs:\n        cuda.memcpy_dtoh_async(out['host'], out['device'], stream)\n\n    # Synchronize the stream\n    stream.synchronize()\n\n    # Process output\n    labels = load_imagenet_labels(label_file)\n    for out in outputs:\n        output_data = out['host']\n        max_idx = np.argmax(output_data)\n        max_val = output_data[max_idx]\n        if labels:\n            print(f\"Predicted Class: {max_idx} - {labels[max_idx]} (Score: {max_val})\")\n        else:\n            print(f\"Predicted Class: {max_idx} (Score: {max_val})\")\n\n\nif __name__ == \"__main__\":\n    if len(sys.argv) < 3 or len(sys.argv) > 4:\n        print(f\"Usage: python {sys.argv[0]} <engine_path> <image_path> [label_file]\")\n        print(f\"Example: python {sys.argv[0]} convnextv2.engine images/test.jpg\")\n        print(f\"         python {sys.argv[0]} convnextv2.engine images/test.jpg custom_labels.txt\")\n        sys.exit(1)\n\n    engine_path = sys.argv[1]\n    img_path = sys.argv[2]\n    label_file = sys.argv[3] if len(sys.argv) == 4 else \"imagenet_classes.txt\"\n    main(engine_path, img_path, label_file)\n"
  },
  {
    "path": "convnextv2/src/LayerNormPlugin.cu",
    "content": "#include <cuda_fp16.h>\n#include <cassert>\n#include <cstring>\n#include <cub/cub.cuh>\n#include <iostream>\n#include \"LayerNormPlugin.h\"\n\nusing namespace nvinfer1;\n\nstatic const char* PLUGIN_NAME = \"LayerNorm\";\nstatic const char* PLUGIN_VERSION = \"1\";\n\nPluginFieldCollection LayerNormPluginCreator::mFC{};\nstd::vector<PluginField> LayerNormPluginCreator::mPluginAttributes;\n\n// Helper to check CUDA errors\n#define CHECK(status)                                                                     \\\n    do {                                                                                  \\\n        auto ret = (status);                                                              \\\n        if (ret != 0) {                                                                   \\\n            std::cerr << \"Cuda failure: \" << ret << \" at line \" << __LINE__ << std::endl; \\\n            abort();                                                                      \\\n        }                                                                                 \\\n    } while (0)\n\ntemplate <typename T>\n__device__ inline T epsilon();\n\ntemplate <>\n__device__ inline float epsilon<float>() {\n    return 1e-6f;\n}\n\ntemplate <>\n__device__ inline half epsilon<half>() {\n    return (half)1e-6f;\n}\n\n// --- Kernel ---\n// Supports hidden_size up to 1024 with TPB=256, VPT=4\ntemplate <typename T, int VPT>\n__global__ void layerNormKernel(const T* __restrict__ input, const T* __restrict__ gamma, const T* __restrict__ beta,\n                                T* __restrict__ output, int hidden_size, float eps) {\n    // blockIdx.x corresponds to one instance (one row of hidden_size elements)\n\n    int row_offset = blockIdx.x * hidden_size;\n\n    // Load data\n    float vals[VPT];\n#pragma unroll\n    for (int i = 0; i < VPT; ++i) {\n        int col = threadIdx.x * VPT + i;\n        if (col < hidden_size) {\n            vals[i] = (float)input[row_offset + col];\n        } else {\n            vals[i] = 0.0f;\n        }\n    }\n\n    // Compute mean\n    float thread_sum = 0.0f;\n#pragma unroll\n    for (int i = 0; i < VPT; ++i) {\n        if (threadIdx.x * VPT + i < hidden_size)\n            thread_sum += vals[i];\n    }\n\n    using BlockReduce = cub::BlockReduce<float, 256>;\n    __shared__ typename BlockReduce::TempStorage temp_storage;\n    float sum = BlockReduce(temp_storage).Sum(thread_sum);\n    __shared__ float mean;\n    if (threadIdx.x == 0)\n        mean = sum / hidden_size;\n    __syncthreads();\n\n    // Compute variance\n    float thread_sq_diff = 0.0f;\n#pragma unroll\n    for (int i = 0; i < VPT; ++i) {\n        if (threadIdx.x * VPT + i < hidden_size) {\n            float diff = vals[i] - mean;\n            thread_sq_diff += diff * diff;\n        }\n    }\n    float sq_diff_sum = BlockReduce(temp_storage).Sum(thread_sq_diff);\n    __shared__ float inv_std;\n    if (threadIdx.x == 0) {\n        inv_std = rsqrtf((sq_diff_sum / hidden_size) + eps);\n    }\n    __syncthreads();\n\n// Normalize and scale\n#pragma unroll\n    for (int i = 0; i < VPT; ++i) {\n        int col = threadIdx.x * VPT + i;\n        if (col < hidden_size) {\n            float val = (vals[i] - mean) * inv_std;\n            float g = (float)gamma[col];\n            float b = (float)beta[col];\n            output[row_offset + col] = (T)(val * g + b);\n        }\n    }\n}\n\n// --- Plugin Implementation ---\n\nLayerNormPlugin::LayerNormPlugin(const std::string& name, float epsilon, int hidden_size)\n    : mName(name), mEpsilon(epsilon), mHiddenSize(hidden_size) {}\n\nLayerNormPlugin::LayerNormPlugin(const std::string& name, const void* data, size_t length) : mName(name) {\n    const char* d = static_cast<const char*>(data);\n    const char* a = d;\n    mEpsilon = *reinterpret_cast<const float*>(d);\n    d += sizeof(float);\n    mHiddenSize = *reinterpret_cast<const int*>(d);\n    d += sizeof(int);\n    assert(d == a + length);\n}\n\nLayerNormPlugin::~LayerNormPlugin() {}\n\nIPluginV2DynamicExt* LayerNormPlugin::clone() const noexcept {\n    auto p = new LayerNormPlugin(mName, mEpsilon, mHiddenSize);\n    p->setPluginNamespace(mNamespace.c_str());\n    return p;\n}\n\nint32_t LayerNormPlugin::getNbOutputs() const noexcept {\n    return 1;\n}\n\nDataType LayerNormPlugin::getOutputDataType(int32_t index, const DataType* inputTypes,\n                                            int32_t nbInputs) const noexcept {\n    return inputTypes[0];\n}\n\nDimsExprs LayerNormPlugin::getOutputDimensions(int32_t outputIndex, const DimsExprs* inputs, int32_t nbInputs,\n                                               IExprBuilder& exprBuilder) noexcept {\n    return inputs[0];\n}\n\nbool LayerNormPlugin::supportsFormatCombination(int32_t pos, const PluginTensorDesc* inOut, int32_t nbInputs,\n                                                int32_t nbOutputs) noexcept {\n    if (pos == 0) {  // Input\n        return (inOut[0].type == DataType::kFLOAT || inOut[0].type == DataType::kHALF) &&\n               inOut[0].format == TensorFormat::kLINEAR;\n    }\n    if (pos == 1 || pos == 2) {  // Gamma, Beta\n        return inOut[pos].type == inOut[0].type && inOut[pos].format == TensorFormat::kLINEAR;\n    }\n    if (pos == 3) {  // Output\n        return inOut[pos].type == inOut[0].type && inOut[pos].format == TensorFormat::kLINEAR;\n    }\n    return false;\n}\n\nvoid LayerNormPlugin::configurePlugin(const DynamicPluginTensorDesc* in, int32_t nbInputs,\n                                      const DynamicPluginTensorDesc* out, int32_t nbOutputs) noexcept {\n    // Validate inputs\n    mHiddenSize = in[0].desc.dims.d[in[0].desc.dims.nbDims - 1];\n}\n\nsize_t LayerNormPlugin::getWorkspaceSize(const PluginTensorDesc* inputs, int32_t nbInputs,\n                                         const PluginTensorDesc* outputs, int32_t nbOutputs) const noexcept {\n    return 0;\n}\n\nint32_t LayerNormPlugin::enqueue(const PluginTensorDesc* inputDesc, const PluginTensorDesc* outputDesc,\n                                 const void* const* inputs, void* const* outputs, void* workspace,\n                                 cudaStream_t stream) noexcept {\n\n    int total = 1;\n    for (int i = 0; i < inputDesc[0].dims.nbDims; ++i)\n        total *= inputDesc[0].dims.d[i];\n    int rows = total / mHiddenSize;\n\n    if (inputDesc[0].type == DataType::kFLOAT) {\n        layerNormKernel<float, 4><<<rows, 256, 0, stream>>>((const float*)inputs[0], (const float*)inputs[1],\n                                                            (const float*)inputs[2], (float*)outputs[0], mHiddenSize,\n                                                            mEpsilon);\n    } else {\n        layerNormKernel<half, 4><<<rows, 256, 0, stream>>>((const half*)inputs[0], (const half*)inputs[1],\n                                                           (const half*)inputs[2], (half*)outputs[0], mHiddenSize,\n                                                           mEpsilon);\n    }\n    return 0;\n}\n\nconst char* LayerNormPlugin::getPluginType() const noexcept {\n    return PLUGIN_NAME;\n}\nconst char* LayerNormPlugin::getPluginVersion() const noexcept {\n    return PLUGIN_VERSION;\n}\n\nvoid LayerNormPlugin::destroy() noexcept {\n    delete this;\n}\n\nint32_t LayerNormPlugin::initialize() noexcept {\n    return 0;\n}\nvoid LayerNormPlugin::terminate() noexcept {}\n\nsize_t LayerNormPlugin::getSerializationSize() const noexcept {\n    return sizeof(float) + sizeof(int);\n}\n\nvoid LayerNormPlugin::serialize(void* buffer) const noexcept {\n    char* d = static_cast<char*>(buffer);\n    *reinterpret_cast<float*>(d) = mEpsilon;\n    d += sizeof(float);\n    *reinterpret_cast<int*>(d) = mHiddenSize;\n    d += sizeof(int);\n}\n\nvoid LayerNormPlugin::setPluginNamespace(const char* libNamespace) noexcept {\n    mNamespace = libNamespace;\n}\nconst char* LayerNormPlugin::getPluginNamespace() const noexcept {\n    return mNamespace.c_str();\n}\n\n// --- Creator Implementation ---\n\nLayerNormPluginCreator::LayerNormPluginCreator() {\n    mPluginAttributes.emplace_back(PluginField(\"epsilon\", nullptr, PluginFieldType::kFLOAT32, 1));\n    mFC.nbFields = mPluginAttributes.size();\n    mFC.fields = mPluginAttributes.data();\n}\n\nLayerNormPluginCreator::~LayerNormPluginCreator() {}\n\nconst char* LayerNormPluginCreator::getPluginName() const noexcept {\n    return PLUGIN_NAME;\n}\nconst char* LayerNormPluginCreator::getPluginVersion() const noexcept {\n    return PLUGIN_VERSION;\n}\n\nconst PluginFieldCollection* LayerNormPluginCreator::getFieldNames() noexcept {\n    return &mFC;\n}\n\nIPluginV2* LayerNormPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) noexcept {\n    float epsilon = 1e-6f;\n    for (int i = 0; i < fc->nbFields; ++i) {\n        if (strcmp(fc->fields[i].name, \"epsilon\") == 0) {\n            epsilon = *static_cast<const float*>(fc->fields[i].data);\n        }\n    }\n    return new LayerNormPlugin(name, epsilon, 0);  // hidden_size will be set in configure\n}\n\nIPluginV2* LayerNormPluginCreator::deserializePlugin(const char* name, const void* serialData,\n                                                     size_t serialLength) noexcept {\n    return new LayerNormPlugin(name, serialData, serialLength);\n}\n\nvoid LayerNormPluginCreator::setPluginNamespace(const char* libNamespace) noexcept {\n    mNamespace = libNamespace;\n}\nconst char* LayerNormPluginCreator::getPluginNamespace() const noexcept {\n    return mNamespace.c_str();\n}\n\nREGISTER_TENSORRT_PLUGIN(LayerNormPluginCreator);\n"
  },
  {
    "path": "convnextv2/src/LayerNormPlugin.h",
    "content": "#ifndef LAYER_NORM_PLUGIN_H\n#define LAYER_NORM_PLUGIN_H\n\n#include <NvInfer.h>\n#include <string>\n#include <vector>\n\nusing namespace nvinfer1;\n\nclass LayerNormPlugin : public IPluginV2DynamicExt {\n   public:\n    LayerNormPlugin(const std::string& name, float epsilon, int hidden_size);\n    LayerNormPlugin(const std::string& name, const void* data, size_t length);\n    LayerNormPlugin() = delete;\n    ~LayerNormPlugin() override;\n\n    // IPluginV2DynamicExt Methods\n    IPluginV2DynamicExt* clone() const noexcept override;\n    int32_t getNbOutputs() const noexcept override;\n    DataType getOutputDataType(int32_t index, const DataType* inputTypes, int32_t nbInputs) const noexcept override;\n    DimsExprs getOutputDimensions(int32_t outputIndex, const DimsExprs* inputs, int32_t nbInputs,\n                                  IExprBuilder& exprBuilder) noexcept override;\n    bool supportsFormatCombination(int32_t pos, const PluginTensorDesc* inOut, int32_t nbInputs,\n                                   int32_t nbOutputs) noexcept override;\n    void configurePlugin(const DynamicPluginTensorDesc* in, int32_t nbInputs, const DynamicPluginTensorDesc* out,\n                         int32_t nbOutputs) noexcept override;\n    size_t getWorkspaceSize(const PluginTensorDesc* inputs, int32_t nbInputs, const PluginTensorDesc* outputs,\n                            int32_t nbOutputs) const noexcept override;\n    int32_t enqueue(const PluginTensorDesc* inputDesc, const PluginTensorDesc* outputDesc, const void* const* inputs,\n                    void* const* outputs, void* workspace, cudaStream_t stream) noexcept override;\n\n    // IPluginV2 Methods\n    const char* getPluginType() const noexcept override;\n    const char* getPluginVersion() const noexcept override;\n    void destroy() noexcept override;\n    int32_t initialize() noexcept override;\n    void terminate() noexcept override;\n    size_t getSerializationSize() const noexcept override;\n    void serialize(void* buffer) const noexcept override;\n    void setPluginNamespace(const char* pluginNamespace) noexcept override;\n    const char* getPluginNamespace() const noexcept override;\n\n   private:\n    std::string mName;\n    std::string mNamespace;\n    float mEpsilon;\n    int mHiddenSize;  // Number of channels\n};\n\nclass LayerNormPluginCreator : public IPluginCreator {\n   public:\n    LayerNormPluginCreator();\n    ~LayerNormPluginCreator() override;\n\n    const char* getPluginName() const noexcept override;\n    const char* getPluginVersion() const noexcept override;\n    const PluginFieldCollection* getFieldNames() noexcept override;\n    IPluginV2* createPlugin(const char* name, const PluginFieldCollection* fc) noexcept override;\n    IPluginV2* deserializePlugin(const char* name, const void* serialData, size_t serialLength) noexcept override;\n    void setPluginNamespace(const char* pluginNamespace) noexcept override;\n    const char* getPluginNamespace() const noexcept override;\n\n   private:\n    static PluginFieldCollection mFC;\n    static std::vector<PluginField> mPluginAttributes;\n    std::string mNamespace;\n};\n\n#endif  // LAYER_NORM_PLUGIN_H\n"
  },
  {
    "path": "convnextv2/src/convnextv2.cpp",
    "content": "#include <cuda_runtime_api.h>\n#include <algorithm>\n#include <cassert>\n#include <cmath>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <opencv2/opencv.hpp>\n#include <vector>\n#include \"LayerNormPlugin.h\"\n#include \"NvInfer.h\"\n#include \"logging.h\"\n\nstatic const char* INPUT_BLOB_NAME = \"data\";\nstatic const char* OUTPUT_BLOB_NAME = \"output\";\n\nstruct ConvNextConfig {\n    int depths[4];\n    int dims[4];\n    int input_h;\n    int input_w;\n};\n\n// Simple parser for YAML-like config (key: [v1, v2..] or key: value)\nConvNextConfig loadConfig(const std::string& configPath) {\n    ConvNextConfig cfg;\n    // Default to Nano\n    cfg.depths[0] = 2;\n    cfg.depths[1] = 2;\n    cfg.depths[2] = 8;\n    cfg.depths[3] = 2;\n    cfg.dims[0] = 80;\n    cfg.dims[1] = 160;\n    cfg.dims[2] = 320;\n    cfg.dims[3] = 640;\n    cfg.input_h = 224;\n    cfg.input_w = 224;\n\n    std::ifstream file(configPath);\n    if (!file.is_open()) {\n        std::cerr << \"Warning: Could not open config file \" << configPath << \". Using default Nano config.\"\n                  << std::endl;\n        return cfg;\n    }\n\n    std::string line;\n    while (std::getline(file, line)) {\n        if (line.empty() || line[0] == '#')\n            continue;\n        std::stringstream ss(line);\n        std::string key;\n        std::getline(ss, key, ':');\n\n        // Trim key\n        key.erase(0, key.find_first_not_of(\" \\t\"));\n        key.erase(key.find_last_not_of(\" \\t\") + 1);\n\n        if (key == \"depths\" || key == \"dims\") {\n            // format: [v1, v2, v3, v4]\n            std::string valStr;\n            std::getline(ss, valStr);\n            // Simple parse: remove [ ] and split by ,\n            size_t start = valStr.find('[');\n            size_t end = valStr.find(']');\n            if (start != std::string::npos && end != std::string::npos) {\n                std::string nums = valStr.substr(start + 1, end - start - 1);\n                std::stringstream ssNums(nums);\n                std::string segment;\n                int idx = 0;\n                while (std::getline(ssNums, segment, ',') && idx < 4) {\n                    if (key == \"depths\")\n                        cfg.depths[idx++] = std::stoi(segment);\n                    else\n                        cfg.dims[idx++] = std::stoi(segment);\n                }\n            }\n        } else if (key == \"input_h\") {\n            int val;\n            ss >> val;\n            cfg.input_h = val;\n        } else if (key == \"input_w\") {\n            int val;\n            ss >> val;\n            cfg.input_w = val;\n        }\n    }\n    std::cout << \"Loaded Config - Depths: [\" << cfg.depths[0] << \",\" << cfg.depths[1] << \",\" << cfg.depths[2] << \",\"\n              << cfg.depths[3] << \"]\"\n              << \" Dims: [\" << cfg.dims[0] << \",\" << cfg.dims[1] << \",\" << cfg.dims[2] << \",\" << cfg.dims[3] << \"]\"\n              << \" Input: \" << cfg.input_h << \"x\" << cfg.input_w << std::endl;\n    return cfg;\n}\n\n// Global config\nstatic ConvNextConfig g_config;\n// Macros/Consts replaced by g_config members\n#define DEPTHS g_config.depths\n#define DIMS g_config.dims\n#define INPUT_H g_config.input_h\n#define INPUT_W g_config.input_w\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Global variables for paths\nstd::string g_wts_path = \"convnextv2.wts\";\nstd::string g_engine_path = \"convnextv2.engine\";\n\n// Weights utils\nstd::map<std::string, Weights> loadWeights(const std::string& file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n    while (count--) {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n        uint32_t* val = new uint32_t[size];\n        for (uint32_t x = 0, y = size; x < y; ++x) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition* network, ITensor& input, std::map<std::string, Weights>& weightMap,\n                            std::string name, float eps) {\n    float* gamma = (float*)weightMap[name + \".weight\"].values;\n    float* beta = (float*)weightMap[name + \".bias\"].values;\n    float* mean = (float*)weightMap[name + \".running_mean\"].values;\n    float* var = (float*)weightMap[name + \".running_var\"].values;\n    int len = weightMap[name + \".running_var\"].count;\n\n    float* scval = new float[len];\n    float* shval = new float[len];\n    float* pval = new float[len];\n\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n        pval[i] = 1.0;\n    }\n    Weights wsc{DataType::kFLOAT, scval, len};\n    Weights wsh{DataType::kFLOAT, shval, len};\n    Weights wpower{DataType::kFLOAT, pval, len};\n\n    IScaleLayer* scale = network->addScale(input, ScaleMode::kCHANNEL, wsh, wsc, wpower);\n    assert(scale);\n    return scale;\n}\n\nITensor* convNextBlock(INetworkDefinition* network, ITensor* input, int dim, std::string name,\n                       std::map<std::string, Weights>& weightMap) {\n    // Input is NCHW\n\n    // 1. DWConv 7x7\n    Weights empty{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* dwconv = network->addConvolutionNd(*input, dim, DimsHW{7, 7}, weightMap[name + \".dwconv.weight\"],\n                                                          weightMap[name + \".dwconv.bias\"]);\n    assert(dwconv);\n    dwconv->setStrideNd(DimsHW{1, 1});\n    dwconv->setPaddingNd(DimsHW{3, 3});\n    dwconv->setNbGroups(dim);\n    ITensor* x = dwconv->getOutput(0);\n\n    // 2. Permute NCHW -> NHWC for LayerNorm\n    IShuffleLayer* p1 = network->addShuffle(*x);\n    p1->setSecondTranspose({0, 2, 3, 1});\n    x = p1->getOutput(0);\n\n    // 3. LayerNorm (Plugin)\n    auto creator = getPluginRegistry()->getPluginCreator(\"LayerNorm\", \"1\");\n    PluginFieldCollection pfc;\n    float eps = 1e-6f;\n    PluginField pf(\"epsilon\", &eps, PluginFieldType::kFLOAT32, 1);\n    pfc.nbFields = 1;\n    pfc.fields = &pf;\n    IPluginV2* plugin = creator->createPlugin(name.c_str(), &pfc);\n\n    // Pass gamma/beta (1D of size C) as plugin inputs along with x (N,H,W,C)\n    auto w_ln_w = weightMap[name + \".norm.weight\"];\n    auto w_ln_b = weightMap[name + \".norm.bias\"];\n    IConstantLayer* c_gamma = network->addConstant(Dims{1, {w_ln_w.count}}, w_ln_w);\n    IConstantLayer* c_beta = network->addConstant(Dims{1, {w_ln_b.count}}, w_ln_b);\n\n    ITensor* inputs[] = {x, c_gamma->getOutput(0), c_beta->getOutput(0)};\n    IPluginV2Layer* ln = network->addPluginV2(inputs, 3, *plugin);\n    x = ln->getOutput(0);\n\n    // 4. Permute NHWC -> NCHW\n    IShuffleLayer* p2 = network->addShuffle(*x);\n    p2->setSecondTranspose({0, 3, 1, 2});\n    x = p2->getOutput(0);\n\n    // 5. PWConv1 (1x1)\n    IConvolutionLayer* pw1 = network->addConvolutionNd(*x, 4 * dim, DimsHW{1, 1}, weightMap[name + \".pwconv1.weight\"],\n                                                       weightMap[name + \".pwconv1.bias\"]);\n    x = pw1->getOutput(0);\n\n    // 6. GELU\n    // Manual GELU implementation: 0.5 * x * (1 + erf(x / sqrt(2)))\n    float* sqrt2_inv = new float[1];\n    *sqrt2_inv = 1.0f / std::sqrt(2.0f);\n    Weights w_sqrt2{DataType::kFLOAT, sqrt2_inv, 1};\n    IConstantLayer* c_sqrt2 = network->addConstant(Dims4{1, 1, 1, 1}, w_sqrt2);  // Broadcast\n\n    IElementWiseLayer* div = network->addElementWise(*x, *c_sqrt2->getOutput(0), ElementWiseOperation::kPROD);\n    IUnaryLayer* erf = network->addUnary(*div->getOutput(0), UnaryOperation::kERF);\n\n    float* one = new float[1];\n    *one = 1.0f;\n    Weights w_one{DataType::kFLOAT, one, 1};\n    IConstantLayer* c_one = network->addConstant(Dims4{1, 1, 1, 1}, w_one);\n\n    IElementWiseLayer* add_erf =\n            network->addElementWise(*erf->getOutput(0), *c_one->getOutput(0), ElementWiseOperation::kSUM);\n\n    float* half = new float[1];\n    *half = 0.5f;\n    Weights w_half{DataType::kFLOAT, half, 1};\n    IConstantLayer* c_half = network->addConstant(Dims4{1, 1, 1, 1}, w_half);\n\n    IElementWiseLayer* mul_half = network->addElementWise(*x, *c_half->getOutput(0), ElementWiseOperation::kPROD);\n\n    IElementWiseLayer* gelu =\n            network->addElementWise(*mul_half->getOutput(0), *add_erf->getOutput(0), ElementWiseOperation::kPROD);\n    x = gelu->getOutput(0);\n\n    // 7. GRN (implemented in NCHW). X shape: [N, 4*dim, H, W], gx -> [N, C, 1, 1]\n\n    // x*x\n    IElementWiseLayer* sq = network->addElementWise(*x, *x, ElementWiseOperation::kPROD);\n    ITensor* x_sq = sq->getOutput(0);\n\n    // Sum over H,W (axes 2, 3 = 4 | 8 = 12)\n    IReduceLayer* red_sum = network->addReduce(*x_sq, ReduceOperation::kSUM, 12, true);\n    ITensor* sum_x = red_sum->getOutput(0);\n\n    // Sqrt\n    IUnaryLayer* sqrt_layer = network->addUnary(*sum_x, UnaryOperation::kSQRT);\n    ITensor* gx = sqrt_layer->getOutput(0);  // [N, C, 1, 1]\n\n    // Normalize GRN: nx = gx / (mean(gx, dim=1) + eps)\n    // Mean over C (axis 1)\n    IReduceLayer* red_mean = network->addReduce(*gx, ReduceOperation::kAVG, 2, true);  // bit 1 set -> axis 1\n    ITensor* mean_gx = red_mean->getOutput(0);                                         // [N, 1, 1, 1]\n\n    // Add eps\n    float eps_val = 1e-6f;\n    Weights w_eps{DataType::kFLOAT, &eps_val, 1};\n\n    // Creating scalar constant [1,1,1,1]\n    float* eps_ptr = new float[1];\n    eps_ptr[0] = 1e-6f;\n    Weights eps_w{DataType::kFLOAT, eps_ptr, 1};\n    IConstantLayer* c_eps = network->addConstant(Dims4{1, 1, 1, 1}, eps_w);\n\n    IElementWiseLayer* add_eps = network->addElementWise(*mean_gx, *c_eps->getOutput(0), ElementWiseOperation::kSUM);\n    ITensor* denom = add_eps->getOutput(0);\n\n    // Div\n    IElementWiseLayer* div_grn = network->addElementWise(*gx, *denom, ElementWiseOperation::kDIV);\n    ITensor* nx = div_grn->getOutput(0);  // [N, C, 1, 1]\n\n    // Scale X by nx\n    IElementWiseLayer* scale_x = network->addElementWise(*x, *nx, ElementWiseOperation::kPROD);\n    ITensor* x_norm = scale_x->getOutput(0);\n\n    // Apply Gamma/Beta for GRN (channel-wise scale) then add residual from GELU input\n    Weights w_grn_g = weightMap[name + \".grn.gamma\"];\n    Weights w_grn_b = weightMap[name + \".grn.beta\"];\n    Weights w_power{DataType::kFLOAT, nullptr, 0};\n    IScaleLayer* grn_scale = network->addScale(*x_norm, ScaleMode::kCHANNEL, w_grn_b, w_grn_g, w_power);\n    x = grn_scale->getOutput(0);\n\n    // Residual: x = grn_scaled + gelu_output\n    ITensor* x_in = gelu->getOutput(0);\n    IElementWiseLayer* add_grn = network->addElementWise(*x, *x_in, ElementWiseOperation::kSUM);\n    x = add_grn->getOutput(0);\n\n    // 8. PWConv2 (1x1)\n    IConvolutionLayer* pw2 = network->addConvolutionNd(*x, dim, DimsHW{1, 1}, weightMap[name + \".pwconv2.weight\"],\n                                                       weightMap[name + \".pwconv2.bias\"]);\n    x = pw2->getOutput(0);\n\n    // 9. DropPath (Ignored in inference)\n\n    // 10. Residual\n    IElementWiseLayer* res = network->addElementWise(*input, *x, ElementWiseOperation::kSUM);\n    return res->getOutput(0);\n}\n\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);\n    INetworkDefinition* network = builder->createNetworkV2(explicitBatch);\n\n    // Create input\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims4{maxBatchSize, 3, INPUT_H, INPUT_W});\n    assert(data);\n\n    // Load weights from the path provided via command line (g_wts_path)\n    std::map<std::string, Weights> weightMap = loadWeights(g_wts_path);\n\n    // Initialize Stem\n    // downsample_layers.0: Conv 4x4, s=4 -> LN\n    // Conv\n    IConvolutionLayer* conv0 =\n            network->addConvolutionNd(*data, DIMS[0], DimsHW{4, 4}, weightMap[\"downsample_layers.0.0.weight\"],\n                                      weightMap[\"downsample_layers.0.0.bias\"]);\n    assert(conv0);\n    conv0->setStrideNd(DimsHW{4, 4});\n\n    ITensor* x = conv0->getOutput(0);\n\n    // LN\n    // Transpose to NHWC\n    IShuffleLayer* p0 = network->addShuffle(*x);\n    p0->setSecondTranspose({0, 2, 3, 1});\n    x = p0->getOutput(0);\n\n    // Plugin\n    auto creator = getPluginRegistry()->getPluginCreator(\"LayerNorm\", \"1\");\n    PluginFieldCollection pfc;\n    float eps = 1e-6f;\n    PluginField pf(\"epsilon\", &eps, PluginFieldType::kFLOAT32, 1);\n    pfc.nbFields = 1;\n    pfc.fields = &pf;\n    IPluginV2* plugin = creator->createPlugin(\"stem_ln\", &pfc);\n\n    auto w_ln0_w = weightMap[\"downsample_layers.0.1.weight\"];\n    auto w_ln0_b = weightMap[\"downsample_layers.0.1.bias\"];\n    IConstantLayer* c_g0 = network->addConstant(Dims{1, {w_ln0_w.count}}, w_ln0_w);\n    IConstantLayer* c_b0 = network->addConstant(Dims{1, {w_ln0_b.count}}, w_ln0_b);\n    ITensor* in0[] = {x, c_g0->getOutput(0), c_b0->getOutput(0)};\n    IPluginV2Layer* ln0 = network->addPluginV2(in0, 3, *plugin);\n    x = ln0->getOutput(0);\n\n    // Transpose back\n    IShuffleLayer* p0_back = network->addShuffle(*x);\n    p0_back->setSecondTranspose({0, 3, 1, 2});\n    x = p0_back->getOutput(0);\n\n    // Stages\n    for (int i = 0; i < 4; i++) {\n        // Downsample layer (except first stage which is stem)\n        if (i > 0) {\n            std::string ds_name = \"downsample_layers.\" + std::to_string(i);\n            // LN -> Conv 2x2 s=2\n            // LN (NHWC)\n            IShuffleLayer* p_ds = network->addShuffle(*x);\n            p_ds->setSecondTranspose({0, 2, 3, 1});\n            x = p_ds->getOutput(0);\n\n            auto creator = getPluginRegistry()->getPluginCreator(\"LayerNorm\", \"1\");\n            PluginFieldCollection pfc_ds;\n            float eps_ds = 1e-6f;\n            PluginField pf_ds(\"epsilon\", &eps_ds, PluginFieldType::kFLOAT32, 1);\n            pfc_ds.nbFields = 1;\n            pfc_ds.fields = &pf_ds;\n            IPluginV2* plugin_ds = creator->createPlugin((ds_name + \"_ln\").c_str(), &pfc_ds);\n\n            auto w_ds_w = weightMap[ds_name + \".0.weight\"];\n            auto w_ds_b = weightMap[ds_name + \".0.bias\"];\n            IConstantLayer* c_ds_g = network->addConstant(Dims{1, {w_ds_w.count}}, w_ds_w);\n            IConstantLayer* c_ds_b = network->addConstant(Dims{1, {w_ds_b.count}}, w_ds_b);\n            ITensor* in_ds[] = {x, c_ds_g->getOutput(0), c_ds_b->getOutput(0)};\n            IPluginV2Layer* ln_ds = network->addPluginV2(in_ds, 3, *plugin_ds);\n            x = ln_ds->getOutput(0);\n\n            IShuffleLayer* p_ds_back = network->addShuffle(*x);\n            p_ds_back->setSecondTranspose({0, 3, 1, 2});\n            x = p_ds_back->getOutput(0);\n\n            // Conv 2x2, s=2\n            IConvolutionLayer* conv_ds = network->addConvolutionNd(\n                    *x, DIMS[i], DimsHW{2, 2}, weightMap[ds_name + \".1.weight\"], weightMap[ds_name + \".1.bias\"]);\n            conv_ds->setStrideNd(DimsHW{2, 2});\n            x = conv_ds->getOutput(0);\n        }\n\n        // Blocks\n        for (int j = 0; j < DEPTHS[i]; j++) {\n            std::string block_name = \"stages.\" + std::to_string(i) + \".\" + std::to_string(j);\n            x = convNextBlock(network, x, DIMS[i], block_name, weightMap);\n        }\n    }\n\n    // Final Norm (Global Avg Pooling -> LayerNorm -> Head)\n\n    // Global Avg Pooling\n    IReduceLayer* gap = network->addReduce(*x, ReduceOperation::kAVG, 12, true);  // sum H,W (indices 2,3)\n    x = gap->getOutput(0);                                                        // [N, C, 1, 1]\n\n    // Reshape to [N,1,1,C] so LayerNorm plugin sees channels as last dimension\n    IShuffleLayer* p_fin = network->addShuffle(*x);\n    p_fin->setReshapeDimensions(Dims4{maxBatchSize, 1, 1, DIMS[3]});\n    x = p_fin->getOutput(0);\n\n    auto creator_fin = getPluginRegistry()->getPluginCreator(\"LayerNorm\", \"1\");\n    PluginFieldCollection pfc_fin;\n    float eps_fin = 1e-6f;\n    PluginField pf_fin(\"epsilon\", &eps_fin, PluginFieldType::kFLOAT32, 1);\n    pfc_fin.nbFields = 1;\n    pfc_fin.fields = &pf_fin;\n    IPluginV2* plugin_fin = creator_fin->createPlugin(\"final_norm\", &pfc_fin);\n\n    // norm.weight / norm.bias\n    auto w_fn_w = weightMap[\"norm.weight\"];\n    auto w_fn_b = weightMap[\"norm.bias\"];\n    IConstantLayer* c_fn_g = network->addConstant(Dims{1, {w_fn_w.count}}, w_fn_w);\n    IConstantLayer* c_fn_b = network->addConstant(Dims{1, {w_fn_b.count}}, w_fn_b);\n    ITensor* in_fn[] = {x, c_fn_g->getOutput(0), c_fn_b->getOutput(0)};\n    IPluginV2Layer* ln_fn = network->addPluginV2(in_fn, 3, *plugin_fin);\n    x = ln_fn->getOutput(0);\n\n    // Reshape back to [N, C, 1, 1] for 1x1 conv.\n    IShuffleLayer* p_fin_b = network->addShuffle(*x);\n    p_fin_b->setReshapeDimensions(Dims4{maxBatchSize, DIMS[3], 1, 1});\n    x = p_fin_b->getOutput(0);\n\n    Weights head_w = weightMap[\"head.weight\"];\n    Weights head_b = weightMap[\"head.bias\"];\n    // Check num classes\n    int num_classes = head_w.count / DIMS[3];\n\n    IConvolutionLayer* head = network->addConvolutionNd(*x, num_classes, DimsHW{1, 1}, head_w, head_b);\n    x = head->getOutput(0);\n\n    x->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*x);\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n// Workspace size configured below depending on TRT version\n#if (NV_TENSORRT_MAJOR * 10 + NV_TENSORRT_MINOR) >= 86\n    // setMemoryPoolLimit\n    config->setMemoryPoolLimit(MemoryPoolType::kWORKSPACE, 1U << 30);  // 1GB\n#else\n    config->setMaxWorkspaceSize(1 << 30);  // 1GB\n#endif\n\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n\n    delete network;\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n    (*modelStream) = engine->serialize();\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n}\n\nvoid inference(const std::string& engine_file, const std::string& image_file) {\n    std::cout << \"Running inference...\" << std::endl;\n    std::ifstream file(engine_file, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"Error: Engine file not found\" << std::endl;\n        return;\n    }\n    file.seekg(0, file.end);\n    size_t size = file.tellg();\n    file.seekg(0, file.beg);\n    char* trtModelStream = new char[size];\n    assert(trtModelStream);\n    file.read(trtModelStream, size);\n    file.close();\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Load image\n    cv::Mat img = cv::imread(image_file);\n    if (img.empty()) {\n        std::cerr << \"Error: Image not found\" << std::endl;\n        return;\n    }\n    cv::resize(img, img, cv::Size(INPUT_W, INPUT_H));\n    img.convertTo(img, CV_32F);\n\n    // Normalize (Mean [0.485, 0.456, 0.406], Std [0.229, 0.224, 0.225])\n    // OpenCV is BGR. Pytorch expects RGB.\n    cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n    img /= 255.0;\n\n    float mean[] = {0.485, 0.456, 0.406};\n    float std[] = {0.229, 0.224, 0.225};\n\n    // HWC -> NCHW and Normalize\n    float* hostData = new float[3 * INPUT_H * INPUT_W];\n    for (int h = 0; h < INPUT_H; ++h) {\n        for (int w = 0; w < INPUT_W; ++w) {\n            for (int c = 0; c < 3; ++c) {\n                float val = img.at<cv::Vec3f>(h, w)[c];\n                hostData[c * INPUT_H * INPUT_W + h * INPUT_W + w] = (val - mean[c]) / std[c];\n            }\n        }\n    }\n\n    void* deviceData;\n    cudaMalloc(&deviceData, 3 * INPUT_H * INPUT_W * sizeof(float));\n    cudaMemcpy(deviceData, hostData, 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice);\n\n    // Output buffer\n    // Determine output size.\n    int outputSize = 1000;  // Default ImageNet\n    // Check binding dimensions\n    int outputIndex = engine->getBindingIndex(OUTPUT_BLOB_NAME);\n    Dims outDims = engine->getBindingDimensions(outputIndex);\n    // outputSize = outDims.d[1];\n\n    float* hostOutput = new float[outputSize];\n    void* deviceOutput;\n    cudaMalloc(&deviceOutput, outputSize * sizeof(float));\n\n    void* bindings[] = {deviceData, deviceOutput};\n\n    // Execute\n    context->executeV2(bindings);\n\n    // Copy back\n    cudaMemcpy(hostOutput, deviceOutput, outputSize * sizeof(float), cudaMemcpyDeviceToHost);\n\n    // Softmax and Argmax\n    float maxVal = -1e9;\n    int maxIdx = -1;\n    for (int i = 0; i < outputSize; ++i) {\n        if (hostOutput[i] > maxVal) {\n            maxVal = hostOutput[i];\n            maxIdx = i;\n        }\n    }\n    std::cout << \"Predicted Class: \" << maxIdx << \" (Score: \" << maxVal << \")\" << std::endl;\n\n    cudaFree(deviceData);\n    cudaFree(deviceOutput);\n    delete[] hostData;\n    delete[] hostOutput;\n    delete context;\n    delete engine;\n    delete runtime;\n}\n\nint main(int argc, char** argv) {\n    if (argc < 3) {\n        std::cerr << \"Usage: \" << argv[0] << \" <wts_path> <engine_path> [config_path]\" << std::endl;\n        std::cerr << \"Example: \" << argv[0] << \" convnextv2.wts convnextv2.engine config.yaml\" << std::endl;\n        return -1;\n    }\n\n    g_wts_path = argv[1];\n    g_engine_path = argv[2];\n    std::string config_path = (argc >= 4) ? argv[3] : \"config.yaml\";\n    g_config = loadConfig(config_path);\n\n    // Register Plugin manually if needed\n    auto* lnCreator = new LayerNormPluginCreator();\n    getPluginRegistry()->registerCreator(*lnCreator, \"\");\n\n    // Generate engine\n    IHostMemory* modelStream{nullptr};\n    APIToModel(1, &modelStream);\n    assert(modelStream != nullptr);\n    std::ofstream p(g_engine_path, std::ios::binary);\n    if (!p) {\n        std::cerr << \"Could not open plan output file\" << std::endl;\n        return -1;\n    }\n    p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n    modelStream->destroy();\n    std::cout << \"Engine generated successfully: \" << g_engine_path << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "convnextv2/src/inference_cpp.cpp",
    "content": "#include <cuda_runtime_api.h>\n#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include <vector>\n#include \"LayerNormPlugin.h\"\n#include \"NvInfer.h\"\n#include \"logging.h\"\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\nstd::vector<std::string> load_imagenet_labels(const std::string& label_file = \"imagenet_classes.txt\") {\n    std::vector<std::string> labels;\n    std::ifstream file(label_file);\n    if (!file.is_open()) {\n        return labels;\n    }\n    std::string line;\n    while (std::getline(file, line)) {\n        labels.push_back(line);\n    }\n    return labels;\n}\n\nstatic const char* INPUT_BLOB_NAME = \"data\";\nstatic const char* OUTPUT_BLOB_NAME = \"prob\";\n\nvoid inference(const std::string& engine_file, const std::string& image_file,\n               const std::string& label_file = \"imagenet_classes.txt\") {\n    std::cout << \"Running inference...\" << std::endl;\n\n    // Register LayerNorm plugin\n    static LayerNormPluginCreator pluginCreator;\n    getPluginRegistry()->registerCreator(pluginCreator, \"\");\n\n    std::ifstream file(engine_file, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"Error: Engine file not found: \" << engine_file << std::endl;\n        return;\n    }\n    file.seekg(0, file.end);\n    size_t size = file.tellg();\n    file.seekg(0, file.beg);\n    char* trtModelStream = new char[size];\n    assert(trtModelStream);\n    file.read(trtModelStream, size);\n    file.close();\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Determine dimensions from engine\n    int inputIndex = -1;\n    int outputIndex = -1;\n    for (int i = 0; i < engine->getNbBindings(); ++i) {\n        if (engine->bindingIsInput(i)) {\n            inputIndex = i;\n        } else {\n            outputIndex = i;\n        }\n    }\n\n    if (inputIndex == -1 || outputIndex == -1) {\n        std::cerr << \"Error: Could not find input or output bindings in engine.\" << std::endl;\n        return;\n    }\n\n    Dims inputDims = engine->getBindingDimensions(inputIndex);\n    Dims outputDims = engine->getBindingDimensions(outputIndex);\n\n    // Assuming NCHW format for input\n    int input_h = inputDims.d[2];\n    int input_w = inputDims.d[3];\n    int input_c = inputDims.d[1];  // Usually 3\n\n    // Assuming N x NumClasses or just NumClasses\n    int outputSize = 1;\n    for (int i = 0; i < outputDims.nbDims; ++i) {\n        // Skip batch dimension if it is dynamic (-1) or 1\n        if (i == 0 && (outputDims.d[i] == -1 || outputDims.d[i] == 1))\n            continue;\n        outputSize *= outputDims.d[i];\n    }\n\n    std::cout << \"Input Dimensions: \" << input_c << \"x\" << input_h << \"x\" << input_w << std::endl;\n    std::cout << \"Output Size: \" << outputSize << std::endl;\n\n    // Load image\n    cv::Mat img = cv::imread(image_file);\n    if (img.empty()) {\n        std::cerr << \"Error: Image not found: \" << image_file << std::endl;\n        return;\n    }\n    cv::resize(img, img, cv::Size(input_w, input_h));\n    img.convertTo(img, CV_32F);\n\n    // Normalize (Mean [0.485, 0.456, 0.406], Std [0.229, 0.224, 0.225])\n    // OpenCV is BGR. Pytorch expects RGB.\n    cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n    img /= 255.0;\n\n    float mean[] = {0.485, 0.456, 0.406};\n    float std[] = {0.229, 0.224, 0.225};\n\n    // HWC -> NCHW and Normalize\n    float* hostData = new float[input_c * input_h * input_w];\n    for (int h = 0; h < input_h; ++h) {\n        for (int w = 0; w < input_w; ++w) {\n            for (int c = 0; c < input_c; ++c) {\n                float val = img.at<cv::Vec3f>(h, w)[c];\n                hostData[c * input_h * input_w + h * input_w + w] = (val - mean[c]) / std[c];\n            }\n        }\n    }\n\n    void* deviceData;\n    cudaMalloc(&deviceData, input_c * input_h * input_w * sizeof(float));\n    cudaMemcpy(deviceData, hostData, input_c * input_h * input_w * sizeof(float), cudaMemcpyHostToDevice);\n\n    // Output buffer\n    float* hostOutput = new float[outputSize];\n    void* deviceOutput;\n    cudaMalloc(&deviceOutput, outputSize * sizeof(float));\n\n    void* bindings[] = {deviceData, deviceOutput};\n    if (engine->getBindingIndex(INPUT_BLOB_NAME) != 0) {\n        bindings[inputIndex] = deviceData;\n        bindings[outputIndex] = deviceOutput;\n    }\n\n    // Execute\n    context->executeV2(bindings);\n\n    // Copy back\n    cudaMemcpy(hostOutput, deviceOutput, outputSize * sizeof(float), cudaMemcpyDeviceToHost);\n\n    // Argmax\n    float maxVal = -1e9;\n    int maxIdx = -1;\n    for (int i = 0; i < outputSize; ++i) {\n        if (hostOutput[i] > maxVal) {\n            maxVal = hostOutput[i];\n            maxIdx = i;\n        }\n    }\n\n    auto labels = load_imagenet_labels(label_file);\n    if (!labels.empty() && maxIdx < static_cast<int>(labels.size())) {\n        std::cout << \"Predicted Class: \" << maxIdx << \" - \" << labels[maxIdx] << \" (Score: \" << maxVal << \")\"\n                  << std::endl;\n    } else {\n        std::cout << \"Predicted Class: \" << maxIdx << \" (Score: \" << maxVal << \")\" << std::endl;\n    }\n\n    cudaFree(deviceData);\n    cudaFree(deviceOutput);\n    delete[] hostData;\n    delete[] hostOutput;\n    delete context;\n    delete engine;\n    delete runtime;\n}\n\nint main(int argc, char** argv) {\n    if (argc < 3 || argc > 4) {\n        std::cerr << \"Usage: \" << argv[0] << \" <engine_path> <image_path> [label_file]\" << std::endl;\n        std::cerr << \"Example: \" << argv[0] << \" convnextv2.engine images/test.jpg\" << std::endl;\n        std::cerr << \"         \" << argv[0] << \" convnextv2.engine images/test.jpg custom_labels.txt\" << std::endl;\n        return -1;\n    }\n\n    std::string engine_path = argv[1];\n    std::string image_path = argv[2];\n    std::string label_file = (argc == 4) ? argv[3] : \"imagenet_classes.txt\";\n\n    inference(engine_path, image_path, label_file);\n\n    return 0;\n}\n"
  },
  {
    "path": "convnextv2/src/logging.h",
    "content": "#ifndef LOGGING_H\n#define LOGGING_H\n\n#include <NvInfer.h>\n#include <iostream>\n\nusing namespace nvinfer1;\n\nclass Logger : public ILogger {\n   public:\n    Logger(Severity severity = Severity::kINFO) : reportableSeverity(severity) {}\n\n    void log(Severity severity, const char* msg) noexcept override {\n        if (severity > reportableSeverity)\n            return;\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                std::cerr << \"INTERNAL_ERROR: \";\n                break;\n            case Severity::kERROR:\n                std::cerr << \"ERROR: \";\n                break;\n            case Severity::kWARNING:\n                std::cerr << \"WARNING: \";\n                break;\n            case Severity::kINFO:\n                std::cout << \"INFO: \";\n                break;\n            default:\n                std::cout << \"VERBOSE: \";\n                break;\n        }\n        std::cout << msg << std::endl;\n    }\n\n    Severity reportableSeverity;\n};\n\n#endif\n"
  },
  {
    "path": "crnn/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(crnn)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\nif (CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n    message(\"embed_platform on\")\n    include_directories(/usr/local/cuda/targets/aarch64-linux/include)\n    link_directories(/usr/local/cuda/targets/aarch64-linux/lib)\nelse()\n    message(\"embed_platform off\")\n    include_directories(/usr/local/cuda/include)\n    link_directories(/usr/local/cuda/lib64)\nendif()\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(crnn ${PROJECT_SOURCE_DIR}/crnn.cpp)\ntarget_link_libraries(crnn nvinfer)\ntarget_link_libraries(crnn cudart)\ntarget_link_libraries(crnn ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "crnn/README.md",
    "content": "# crnn\n\nThe Pytorch implementation is [meijieru/crnn.pytorch](https://github.com/meijieru/crnn.pytorch).\n\n## How to Run\n\n```\n1. generate crnn.wts from pytorch\n\ngit clone https://github.com/wang-xinyu/tensorrtx.git\ngit clone https://github.com/meijieru/crnn.pytorch.git\n// download its weights 'crnn.pth'\n// copy tensorrtx/crnn/genwts.py into crnn.pytorch/\n// go to crnn.pytorch/\npython genwts.py\n// a file 'crnn.wts' will be generated.\n\n2. build tensorrtx/crnn and run\n\n// put crnn.wts into tensorrtx/crnn\n// go to tensorrtx/crnn\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./crnn -s  // serialize model to plan file i.e. 'crnn.engine'\n// copy crnn.pytorch/data/demo.png here\nsudo ./crnn -d  // deserialize plan file and run inference\n\n3. check the output as follows:\n\nraw: a-----v--a-i-l-a-bb-l-e---\nsim: available\n\n```\n\n## More Information\n\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n\n## Acknowledgment\n\nThanks for the donation for this crnn tensorrt implementation from @雍.\n\n"
  },
  {
    "path": "crnn/crnn.cpp",
    "content": "#include <iostream>\r\n#include <chrono>\r\n#include <map>\r\n#include <opencv2/opencv.hpp>\r\n#include \"NvInfer.h\"\r\n#include \"cuda_runtime_api.h\"\r\n#include \"logging.h\"\r\n\r\n#define CHECK(status) \\\r\n    do\\\r\n    {\\\r\n        auto ret = (status);\\\r\n        if (ret != 0)\\\r\n        {\\\r\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\r\n            abort();\\\r\n        }\\\r\n    } while (0)\r\n\r\n#define USE_FP16  // comment out this if want to use FP32\r\n#define DEVICE 0  // GPU id\r\n#define BATCH_SIZE 1\r\n\r\n// stuff we know about the network and the input/output blobs\r\nstatic const int INPUT_H = 32;\r\nstatic const int INPUT_W = 100;\r\nstatic const int OUTPUT_SIZE = 26 * 37;\r\nconst char* INPUT_BLOB_NAME = \"data\";\r\nconst char* OUTPUT_BLOB_NAME = \"prob\";\r\nstatic Logger gLogger;\r\n\r\nconst int ks[] = {3, 3, 3, 3, 3, 3, 2};\r\nconst int ps[] = {1, 1, 1, 1, 1, 1, 0};\r\nconst int ss[] = {1, 1, 1, 1, 1, 1, 1};\r\nconst int nm[] = {64, 128, 256, 256, 512, 512, 512};\r\nconst std::string alphabet = \"-0123456789abcdefghijklmnopqrstuvwxyz\";\r\n\r\nusing namespace nvinfer1;\r\n\r\nstd::string strDecode(std::vector<int>& preds, bool raw) {\r\n    std::string str;\r\n    if (raw) {\r\n        for (auto v: preds) {\r\n            str.push_back(alphabet[v]);\r\n        }\r\n    } else {\r\n        for (size_t i = 0; i < preds.size(); i++) {\r\n            if (preds[i] == 0 || (i > 0 && preds[i - 1] == preds[i])) continue;\r\n            str.push_back(alphabet[preds[i]]);\r\n        }\r\n    }\r\n    return str;\r\n}\r\n\r\n// TensorRT weight files have a simple space delimited format:\r\n// [type] [size] <data x size in hex>\r\nstd::map<std::string, Weights> loadWeights(const std::string file) {\r\n    std::cout << \"Loading weights: \" << file << std::endl;\r\n    std::map<std::string, Weights> weightMap;\r\n\r\n    // Open weights file\r\n    std::ifstream input(file);\r\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\r\n\r\n    // Read number of weight blobs\r\n    int32_t count;\r\n    input >> count;\r\n    assert(count > 0 && \"Invalid weight map file.\");\r\n\r\n    while (count--)\r\n    {\r\n        Weights wt{DataType::kFLOAT, nullptr, 0};\r\n        uint32_t size;\r\n\r\n        // Read name and type of blob\r\n        std::string name;\r\n        input >> name >> std::dec >> size;\r\n        wt.type = DataType::kFLOAT;\r\n\r\n        // Load blob\r\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\r\n        for (uint32_t x = 0, y = size; x < y; ++x)\r\n        {\r\n            input >> std::hex >> val[x];\r\n        }\r\n        wt.values = val;\r\n\r\n        wt.count = size;\r\n        weightMap[name] = wt;\r\n    }\r\n\r\n    return weightMap;\r\n}\r\n\r\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\r\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\r\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\r\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\r\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\r\n    int len = weightMap[lname + \".running_var\"].count;\r\n\r\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights scale{DataType::kFLOAT, scval, len};\r\n\r\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights shift{DataType::kFLOAT, shval, len};\r\n\r\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        pval[i] = 1.0;\r\n    }\r\n    Weights power{DataType::kFLOAT, pval, len};\r\n\r\n    weightMap[lname + \".scale\"] = scale;\r\n    weightMap[lname + \".shift\"] = shift;\r\n    weightMap[lname + \".power\"] = power;\r\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\r\n    assert(scale_1);\r\n    return scale_1;\r\n}\r\n\r\nILayer* convRelu(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int i, bool use_bn = false) {\r\n    int nOut = nm[i];\r\n    IConvolutionLayer* conv = network->addConvolutionNd(input, nOut, DimsHW{ks[i], ks[i]}, weightMap[\"cnn.conv\" + std::to_string(i) + \".weight\"], weightMap[\"cnn.conv\" + std::to_string(i) + \".bias\"]);\r\n    assert(conv);\r\n    conv->setStrideNd(DimsHW{ss[i], ss[i]});\r\n    conv->setPaddingNd(DimsHW{ps[i], ps[i]});\r\n    ILayer *tmp = conv;\r\n    if (use_bn) {\r\n        tmp = addBatchNorm2d(network, weightMap, *conv->getOutput(0), \"cnn.batchnorm\" + std::to_string(i), 1e-5);\r\n    }\r\n    auto relu = network->addActivation(*tmp->getOutput(0), ActivationType::kRELU);\r\n    assert(relu);\r\n    return relu;\r\n}\r\n\r\nvoid splitLstmWeights(std::map<std::string, Weights>& weightMap, std::string lname) {\r\n    int weight_size = weightMap[lname].count;\r\n    for (int i = 0; i < 4; i++) {\r\n        Weights wt{DataType::kFLOAT, nullptr, 0};\r\n        wt.count = weight_size / 4;\r\n        float *val = reinterpret_cast<float*>(malloc(sizeof(float) * wt.count));\r\n        memcpy(val, (float*)weightMap[lname].values + wt.count * i, sizeof(float) * wt.count);\r\n        wt.values = val;\r\n        weightMap[lname + std::to_string(i)] = wt;\r\n    }\r\n}\r\n\r\nILayer* addLSTM(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int nHidden, std::string lname) {\r\n    splitLstmWeights(weightMap, lname + \".weight_ih_l0\");\r\n    splitLstmWeights(weightMap, lname + \".weight_hh_l0\");\r\n    splitLstmWeights(weightMap, lname + \".bias_ih_l0\");\r\n    splitLstmWeights(weightMap, lname + \".bias_hh_l0\");\r\n    splitLstmWeights(weightMap, lname + \".weight_ih_l0_reverse\");\r\n    splitLstmWeights(weightMap, lname + \".weight_hh_l0_reverse\");\r\n    splitLstmWeights(weightMap, lname + \".bias_ih_l0_reverse\");\r\n    splitLstmWeights(weightMap, lname + \".bias_hh_l0_reverse\");\r\n    Dims dims = input.getDimensions();\r\n    std::cout << \"lstm input shape: \" << dims.nbDims << \" [\" << dims.d[0] << \" \" << dims.d[1] << \" \" << dims.d[2] << \"]\"<< std::endl;\r\n    auto lstm = network->addRNNv2(input, 1, nHidden, dims.d[1], RNNOperation::kLSTM);\r\n    lstm->setDirection(RNNDirection::kBIDIRECTION);\r\n    lstm->setWeightsForGate(0, RNNGateType::kINPUT, true, weightMap[lname + \".weight_ih_l00\"]);\r\n    lstm->setWeightsForGate(0, RNNGateType::kFORGET, true, weightMap[lname + \".weight_ih_l01\"]);\r\n    lstm->setWeightsForGate(0, RNNGateType::kCELL, true, weightMap[lname + \".weight_ih_l02\"]);\r\n    lstm->setWeightsForGate(0, RNNGateType::kOUTPUT, true, weightMap[lname + \".weight_ih_l03\"]);\r\n\r\n    lstm->setWeightsForGate(0, RNNGateType::kINPUT, false, weightMap[lname + \".weight_hh_l00\"]);\r\n    lstm->setWeightsForGate(0, RNNGateType::kFORGET, false, weightMap[lname + \".weight_hh_l01\"]);\r\n    lstm->setWeightsForGate(0, RNNGateType::kCELL, false, weightMap[lname + \".weight_hh_l02\"]);\r\n    lstm->setWeightsForGate(0, RNNGateType::kOUTPUT, false, weightMap[lname + \".weight_hh_l03\"]);\r\n\r\n    lstm->setBiasForGate(0, RNNGateType::kINPUT, true, weightMap[lname + \".bias_ih_l00\"]);\r\n    lstm->setBiasForGate(0, RNNGateType::kFORGET, true, weightMap[lname + \".bias_ih_l01\"]);\r\n    lstm->setBiasForGate(0, RNNGateType::kCELL, true, weightMap[lname + \".bias_ih_l02\"]);\r\n    lstm->setBiasForGate(0, RNNGateType::kOUTPUT, true, weightMap[lname + \".bias_ih_l03\"]);\r\n\r\n    lstm->setBiasForGate(0, RNNGateType::kINPUT, false, weightMap[lname + \".bias_hh_l00\"]);\r\n    lstm->setBiasForGate(0, RNNGateType::kFORGET, false, weightMap[lname + \".bias_hh_l01\"]);\r\n    lstm->setBiasForGate(0, RNNGateType::kCELL, false, weightMap[lname + \".bias_hh_l02\"]);\r\n    lstm->setBiasForGate(0, RNNGateType::kOUTPUT, false, weightMap[lname + \".bias_hh_l03\"]);\r\n\r\n    lstm->setWeightsForGate(1, RNNGateType::kINPUT, true, weightMap[lname + \".weight_ih_l0_reverse0\"]);\r\n    lstm->setWeightsForGate(1, RNNGateType::kFORGET, true, weightMap[lname + \".weight_ih_l0_reverse1\"]);\r\n    lstm->setWeightsForGate(1, RNNGateType::kCELL, true, weightMap[lname + \".weight_ih_l0_reverse2\"]);\r\n    lstm->setWeightsForGate(1, RNNGateType::kOUTPUT, true, weightMap[lname + \".weight_ih_l0_reverse3\"]);\r\n\r\n    lstm->setWeightsForGate(1, RNNGateType::kINPUT, false, weightMap[lname + \".weight_hh_l0_reverse0\"]);\r\n    lstm->setWeightsForGate(1, RNNGateType::kFORGET, false, weightMap[lname + \".weight_hh_l0_reverse1\"]);\r\n    lstm->setWeightsForGate(1, RNNGateType::kCELL, false, weightMap[lname + \".weight_hh_l0_reverse2\"]);\r\n    lstm->setWeightsForGate(1, RNNGateType::kOUTPUT, false, weightMap[lname + \".weight_hh_l0_reverse3\"]);\r\n\r\n    lstm->setBiasForGate(1, RNNGateType::kINPUT, true, weightMap[lname + \".bias_ih_l0_reverse0\"]);\r\n    lstm->setBiasForGate(1, RNNGateType::kFORGET, true, weightMap[lname + \".bias_ih_l0_reverse1\"]);\r\n    lstm->setBiasForGate(1, RNNGateType::kCELL, true, weightMap[lname + \".bias_ih_l0_reverse2\"]);\r\n    lstm->setBiasForGate(1, RNNGateType::kOUTPUT, true, weightMap[lname + \".bias_ih_l0_reverse3\"]);\r\n\r\n    lstm->setBiasForGate(1, RNNGateType::kINPUT, false, weightMap[lname + \".bias_hh_l0_reverse0\"]);\r\n    lstm->setBiasForGate(1, RNNGateType::kFORGET, false, weightMap[lname + \".bias_hh_l0_reverse1\"]);\r\n    lstm->setBiasForGate(1, RNNGateType::kCELL, false, weightMap[lname + \".bias_hh_l0_reverse2\"]);\r\n    lstm->setBiasForGate(1, RNNGateType::kOUTPUT, false, weightMap[lname + \".bias_hh_l0_reverse3\"]);\r\n    return lstm;\r\n}\r\n\r\n// Creat the engine using only the API and not any parser.\r\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\r\n    INetworkDefinition* network = builder->createNetworkV2(0U);\r\n\r\n    // Create input tensor of shape {C, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\r\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});\r\n    assert(data);\r\n\r\n    std::map<std::string, Weights> weightMap = loadWeights(\"../crnn.wts\");\r\n\r\n    // cnn\r\n    auto x = convRelu(network, weightMap, *data, 0);\r\n    auto p = network->addPoolingNd(*x->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\r\n    p->setStrideNd(DimsHW{2, 2});\r\n    x = convRelu(network, weightMap, *p->getOutput(0), 1);\r\n    p = network->addPoolingNd(*x->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\r\n    p->setStrideNd(DimsHW{2, 2});\r\n    x = convRelu(network, weightMap, *p->getOutput(0), 2, true);\r\n    x = convRelu(network, weightMap, *x->getOutput(0), 3);\r\n    p = network->addPoolingNd(*x->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\r\n    p->setStrideNd(DimsHW{2, 1});\r\n    p->setPaddingNd(DimsHW{0, 1});\r\n    x = convRelu(network, weightMap, *p->getOutput(0), 4, true);\r\n    x = convRelu(network, weightMap, *x->getOutput(0), 5);\r\n    p = network->addPoolingNd(*x->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\r\n    p->setStrideNd(DimsHW{2, 1});\r\n    p->setPaddingNd(DimsHW{0, 1});\r\n    x = convRelu(network, weightMap, *p->getOutput(0), 6, true);\r\n\r\n    auto sfl = network->addShuffle(*x->getOutput(0));\r\n    sfl->setFirstTranspose(Permutation{1, 2, 0});\r\n\r\n    // rnn\r\n    auto lstm0 = addLSTM(network, weightMap, *sfl->getOutput(0), 256, \"rnn.0.rnn\");\r\n    auto sfl0 = network->addShuffle(*lstm0->getOutput(0));\r\n    sfl0->setReshapeDimensions(Dims4{26, 1, 1, 512});\r\n    auto fc0 = network->addFullyConnected(*sfl0->getOutput(0), 256, weightMap[\"rnn.0.embedding.weight\"], weightMap[\"rnn.0.embedding.bias\"]);\r\n\r\n    sfl = network->addShuffle(*fc0->getOutput(0));\r\n    sfl->setFirstTranspose(Permutation{2, 3, 0, 1});\r\n    sfl->setReshapeDimensions(Dims3{1, 26, 256});\r\n\r\n    auto lstm1 = addLSTM(network, weightMap, *sfl->getOutput(0), 256, \"rnn.1.rnn\");\r\n    auto sfl1 = network->addShuffle(*lstm1->getOutput(0));\r\n    sfl1->setReshapeDimensions(Dims4{26, 1, 1, 512});\r\n    auto fc1 = network->addFullyConnected(*sfl1->getOutput(0), 37, weightMap[\"rnn.1.embedding.weight\"], weightMap[\"rnn.1.embedding.bias\"]);\r\n    Dims dims = fc1->getOutput(0)->getDimensions();\r\n    std::cout << \"fc1 shape \" << dims.d[0] << \" \" << dims.d[1] << \" \" << dims.d[2] << std::endl;\r\n\r\n    fc1->getOutput(0)->setName(OUTPUT_BLOB_NAME);\r\n    network->markOutput(*fc1->getOutput(0));\r\n\r\n    // Build engine\r\n    builder->setMaxBatchSize(maxBatchSize);\r\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\r\n#ifdef USE_FP16\r\n    config->setFlag(BuilderFlag::kFP16);\r\n#endif\r\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\r\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\r\n    std::cout << \"Build engine successfully!\" << std::endl;\r\n\r\n    // Don't need the network any more\r\n    network->destroy();\r\n\r\n    // Release host memory\r\n    for (auto& mem : weightMap)\r\n    {\r\n        free((void*) (mem.second.values));\r\n    }\r\n\r\n    return engine;\r\n}\r\n\r\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\r\n    // Create builder\r\n    IBuilder* builder = createInferBuilder(gLogger);\r\n    IBuilderConfig* config = builder->createBuilderConfig();\r\n\r\n    // Create model to populate the network, then set the outputs and create an engine\r\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\r\n    assert(engine != nullptr);\r\n\r\n    // Serialize the engine\r\n    (*modelStream) = engine->serialize();\r\n\r\n    // Close everything down\r\n    engine->destroy();\r\n    builder->destroy();\r\n}\r\n\r\nvoid doInference(IExecutionContext& context, cudaStream_t& stream, void **buffers, float* input, float* output, int batchSize) {\r\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\r\n    CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * 1 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\r\n    context.enqueue(batchSize, buffers, stream, nullptr);\r\n    CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\r\n    cudaStreamSynchronize(stream);\r\n}\r\n\r\nint main(int argc, char** argv) {\r\n    cudaSetDevice(DEVICE);\r\n    // create a model using the API directly and serialize it to a stream\r\n    char *trtModelStream{nullptr};\r\n    size_t size{0};\r\n    if (argc == 2 && std::string(argv[1]) == \"-s\") {\r\n        IHostMemory* modelStream{nullptr};\r\n        APIToModel(BATCH_SIZE, &modelStream);\r\n        assert(modelStream != nullptr);\r\n        std::ofstream p(\"crnn.engine\", std::ios::binary);\r\n        if (!p) {\r\n            std::cerr << \"could not open plan output file\" << std::endl;\r\n            return -1;\r\n        }\r\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\r\n        modelStream->destroy();\r\n        return 0;\r\n    } else if (argc == 2 && std::string(argv[1]) == \"-d\") {\r\n        std::ifstream file(\"crnn.engine\", std::ios::binary);\r\n        if (file.good()) {\r\n            file.seekg(0, file.end);\r\n            size = file.tellg();\r\n            file.seekg(0, file.beg);\r\n            trtModelStream = new char[size];\r\n            assert(trtModelStream);\r\n            file.read(trtModelStream, size);\r\n            file.close();\r\n        }\r\n    } else {\r\n        std::cerr << \"arguments not right!\" << std::endl;\r\n        std::cerr << \"./crnn -s  // serialize model to plan file\" << std::endl;\r\n        std::cerr << \"./crnn -d ../samples  // deserialize plan file and run inference\" << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    // prepare input data ---------------------------\r\n    static float data[BATCH_SIZE * 1 * INPUT_H * INPUT_W];\r\n    //for (int i = 0; i < 1 * INPUT_H * INPUT_W; i++)\r\n    //    data[i] = 1.0;\r\n    static float prob[BATCH_SIZE * OUTPUT_SIZE];\r\n    IRuntime* runtime = createInferRuntime(gLogger);\r\n    assert(runtime != nullptr);\r\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\r\n    assert(engine != nullptr);\r\n    IExecutionContext* context = engine->createExecutionContext();\r\n    assert(context != nullptr);\r\n    delete[] trtModelStream;\r\n    assert(engine->getNbBindings() == 2);\r\n    void* buffers[2];\r\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\r\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\r\n    const int inputIndex = engine->getBindingIndex(INPUT_BLOB_NAME);\r\n    const int outputIndex = engine->getBindingIndex(OUTPUT_BLOB_NAME);\r\n    assert(inputIndex == 0);\r\n    assert(outputIndex == 1);\r\n    // Create GPU buffers on device\r\n    CHECK(cudaMalloc(&buffers[inputIndex], BATCH_SIZE * 1 * INPUT_H * INPUT_W * sizeof(float)));\r\n    CHECK(cudaMalloc(&buffers[outputIndex], BATCH_SIZE * OUTPUT_SIZE * sizeof(float)));\r\n    // Create stream\r\n    cudaStream_t stream;\r\n    CHECK(cudaStreamCreate(&stream));\r\n\r\n    cv::Mat img = cv::imread(\"demo.png\");\r\n    if (img.empty()) {\r\n        std::cerr << \"demo.png not found !!!\" << std::endl;\r\n        return -1;\r\n    }\r\n    cv::cvtColor(img, img, CV_BGR2GRAY);\r\n    cv::resize(img, img, cv::Size(INPUT_W, INPUT_H));\r\n    for (int i = 0; i < INPUT_H * INPUT_W; i++) {\r\n        data[i] = ((float)img.at<uchar>(i) / 255.0 - 0.5) * 2.0;\r\n    }\r\n\r\n    // Run inference\r\n    auto start = std::chrono::system_clock::now();\r\n    doInference(*context, stream, buffers, data, prob, BATCH_SIZE);\r\n    auto end = std::chrono::system_clock::now();\r\n    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\r\n\r\n    std::vector<int> preds;\r\n    for (int i = 0; i < 26; i++) {\r\n        int maxj = 0;\r\n        for (int j = 1; j < 37; j++) {\r\n            if (prob[37 * i + j] > prob[37 * i + maxj]) maxj = j;\r\n        }\r\n        preds.push_back(maxj);\r\n    }\r\n    std::cout << \"raw: \" << strDecode(preds, true) << std::endl;\r\n    std::cout << \"sim: \" << strDecode(preds, false) << std::endl;\r\n\r\n    // Release stream and buffers\r\n    cudaStreamDestroy(stream);\r\n    CHECK(cudaFree(buffers[inputIndex]));\r\n    CHECK(cudaFree(buffers[outputIndex]));\r\n    // Destroy the engine\r\n    context->destroy();\r\n    engine->destroy();\r\n    runtime->destroy();\r\n\r\n    // Print histogram of the output distribution\r\n    //std::cout << \"\\nOutput:\\n\\n\";\r\n    //for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\r\n    //{\r\n    //    std::cout << prob[i] << \", \";\r\n    //    if (i % 10 == 0) std::cout << std::endl;\r\n    //}\r\n    //std::cout << std::endl;\r\n\r\n    return 0;\r\n}\r\n"
  },
  {
    "path": "crnn/genwts.py",
    "content": "import torch\nfrom torch.autograd import Variable\nimport utils\nimport models.crnn as crnn\nimport struct\n\nmodel_path = './data/crnn.pth'\n\nmodel = crnn.CRNN(32, 1, 37, 256)\nif torch.cuda.is_available():\n    model = model.cuda()\nprint('loading pretrained model from %s' % model_path)\nmodel.load_state_dict(torch.load(model_path))\n\nimage = torch.ones(1, 1, 32, 100)\nif torch.cuda.is_available():\n    image = image.cuda()\n\nmodel.eval()\nprint(model)\nprint('image shape ', image.shape)\npreds = model(image)\n\nf = open(\"crnn.wts\", 'w')\nf.write(\"{}\\n\".format(len(model.state_dict().keys())))\nfor k,v in model.state_dict().items():\n    print('key: ', k)\n    print('value: ', v.shape)\n    vr = v.reshape(-1).cpu().numpy()\n    f.write(\"{} {}\".format(k, len(vr)))\n    for vv in vr:\n        f.write(\" \")\n        f.write(struct.pack(\">f\", float(vv)).hex())\n    f.write(\"\\n\")\n\n"
  },
  {
    "path": "crnn/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "csrnet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\n\nproject(csrnet)\n\nadd_definitions(-std=c++11)\nadd_definitions(-DAPI_EXPORTS)\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\n# cuda\ninclude_directories(/usr/local/cuda/targets/x86_64-linux/include )\nlink_directories(/usr/local/cuda/targets/x86_64-linux/lib)\n\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\n# opencv\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\ninclude_directories(${PROJECT_SOURCE_DIR}/)\n\nadd_executable(csrnet csrnet.cpp)\ntarget_link_libraries(csrnet nvinfer cudart ${OpenCV_LIBS})"
  },
  {
    "path": "csrnet/README.md",
    "content": "# csrnet\n\nThe Pytorch implementation is [leeyeehoo/CSRNet-pytorch](https://github.com/leeyeehoo/CSRNet-pytorch).\n\nThis repo is a TensorRT implementation of CSRNet.\n\npaper : [CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes](https://arxiv.org/abs/1802.10062)\n\nDev environment:\n- Ubuntu 22.04\n- TensorRT 8.6\n- OpenCV 4.5.4\n- CMake 3.24\n- GPU Driver 535.113.01\n- CUDA 12.2\n- RTX3080\n\n\n# how to run\n\n```bash\n1. generate csrnet engine\ngit clone https://github.com/leeyeehoo/CSRNet-pytorch.git\ngit clone https://github.com/wang-xinyu/tensorrtx.git\n// copy gen_wts.py to CSRNet-pytorch\n// generate wts file\npython gen_wts.py\n// csrnet wts will be generated in CSRNet-pytorch\n\n2. build csrnet.engine\n// mv CSRNet-pytorch/csrnet.engine to tensorrtx/csrnet\nmv CSRNet-pytorch/csrnet.wts tensorrtx/csrnet\n// build\nmkdir build\ncmake ..\nmake\nsudo ./csrnet -s  ./csrnet.wts\n\nLoading weights: ./csrnet.wts\nbuild engine successfully : ./csrnet.engine\n\n// download images https://github.com/wang-xinyu/tensorrtx/assets/46584679/46bc4def-e573-44ae-996d-5d68927c78ff and copy to images\nsudo ./csrnet -d  ./images\n\n// output e.g\n// enqueueV2 time: 0.0323869s\n// detect time:44ms\n// people num :22.9101 write_path: ../images/data.jpg\n```\n\n\n# result \n\ninference people num: 22.9101\n\n<p align=\"center\">\n<img src= https://raw.githubusercontent.com/wang-xinyu/tensorrtx/dbf857d25f77bf64113fc99a745ccf4973bdd44e/Density_Plot.jpg>\n</p>\n"
  },
  {
    "path": "csrnet/config.h",
    "content": "#pragma once\n\nconst static char *kInputTensorName = \"data\";\nconst static char *kOutputTensorName = \"prob\";\nconst static char *kEngineFile = \"./csrnet.engine\";\n\nconst static int kBatchSize = 1;\n\nconst static int MAX_INPUT_SIZE = 1440; // 32x\nconst static int MIN_INPUT_SIZE = 608;\nconst static int OPT_INPUT_W = 1152;\nconst static int OPT_INPUT_H = 640;\n\nconstexpr static int kMaxInputImageSize = MAX_INPUT_SIZE * MAX_INPUT_SIZE * 3;\nconstexpr static int kMaxOutputProbSize =\n    (MAX_INPUT_SIZE * MAX_INPUT_SIZE) >> 6;"
  },
  {
    "path": "csrnet/csrnet.cpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include <chrono>\n#include <config.h>\n#include <cstring>\n#include <dirent.h>\n#include <fstream>\n#include <iostream>\n#include <logging.h>\n#include <map>\n#include <numeric>\n#include <opencv2/opencv.hpp>\n#include <vector>\nusing namespace nvinfer1;\n\n#define CHECK(status)                                                          \\\n  do {                                                                         \\\n    auto ret = (status);                                                       \\\n    if (ret != 0) {                                                            \\\n      std::cerr << \"Cuda failure: \" << ret << std::endl;                       \\\n      abort();                                                                 \\\n    }                                                                          \\\n  } while (0)\n\nstatic Logger gLogger;\nstatic char *kWTSFile = \"\";\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n  std::cout << \"Loading weights: \" << file << std::endl;\n  std::map<std::string, Weights> weightMap;\n\n  // Open weights file\n  std::ifstream input(file);\n  assert(input.is_open() && \"Unable to load weight file.\");\n\n  // Read number of weight blobs\n  int32_t count;\n  input >> count;\n  assert(count > 0 && \"Invalid weight map file.\");\n\n  while (count--) {\n    Weights wt{DataType::kFLOAT, nullptr, 0};\n    uint32_t size;\n\n    // Read name and type of blob\n    std::string name;\n    input >> name >> std::dec >> size;\n    wt.type = DataType::kFLOAT;\n\n    // Load blob\n    uint32_t *val = reinterpret_cast<uint32_t *>(malloc(sizeof(val) * size));\n    for (uint32_t x = 0, y = size; x < y; ++x) {\n      input >> std::hex >> val[x];\n    }\n    wt.values = val;\n\n    wt.count = size;\n    weightMap[name] = wt;\n  }\n\n  return weightMap;\n}\n// clang-format off\n/*\nCSRNet(\n (frontend): Sequential(\n    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n    (1): ReLU(inplace=True)\n    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n    (3): ReLU(inplace=True)\n    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n    (6): ReLU(inplace=True)\n    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n    (8): ReLU(inplace=True)\n    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n    (11): ReLU(inplace=True)\n    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n    (13): ReLU(inplace=True)\n    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n    (15): ReLU(inplace=True)\n    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)\n    (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n    (18): ReLU(inplace=True)\n    (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n    (20): ReLU(inplace=True)\n    (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))\n    (22): ReLU(inplace=True)\n  )\n  (backend): Sequential(\n    (0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2),\n    dilation=(2, 2)) (1): ReLU(inplace=True) (2): Conv2d(512, 512,\n    kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2)) (3):\n    ReLU(inplace=True) (4): Conv2d(512, 512, kernel_size=(3, 3), stride=(1,\n    1), padding=(2, 2), dilation=(2, 2)) (5): ReLU(inplace=True) (6):\n    Conv2d(512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2),\n    dilation=(2, 2)) (7): ReLU(inplace=True) (8): Conv2d(256, 128,\n    kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2)) (9):\n    ReLU(inplace=True) (10): Conv2d(128, 64, kernel_size=(3, 3), stride=(1,\n    1), padding=(2, 2), dilation=(2, 2)) (11): ReLU(inplace=True)\n  )\n  (output_layer): Conv2d(64, 1, kernel_size=(1, 1), stride=(1, 1))\n)\n*/\n// clang-format on\nvoid doInference(IExecutionContext &context, float *input, float *output,\n                 int input_h, int input_w) {\n  const ICudaEngine &engine = context.getEngine();\n\n  uint64_t input_size = 3 * input_h * input_w * sizeof(float);\n  uint64_t output_size = ((input_h * input_w) >> 6) * sizeof(float);\n\n  // Pointers to input and output device buffers to pass to engine.\n  // Engine requires exactly IEngine::getNbBindings() number of buffers.\n  assert(engine.getNbBindings() == 2);\n  void *buffers[2];\n\n  // In order to bind the buffers, we need to know the names of the input and\n  // output tensors. Note that indices are guaranteed to be less than\n  // IEngine::getNbBindings()\n  const int inputIndex = engine.getBindingIndex(kInputTensorName);\n  const int outputIndex = engine.getBindingIndex(kOutputTensorName);\n  context.setBindingDimensions(inputIndex, Dims4(1, 3, input_h, input_w));\n\n  // Create GPU buffers on device\n  CHECK(cudaMalloc(&buffers[inputIndex], input_size));\n  CHECK(cudaMalloc(&buffers[outputIndex], output_size));\n\n  // Create stream\n  cudaStream_t stream;\n  CHECK(cudaStreamCreate(&stream));\n\n  // DMA input batch data to device, infer on the batch asynchronously, and DMA\n  // output back to host\n  CHECK(cudaMemcpyAsync(buffers[inputIndex], input, input_size,\n                        cudaMemcpyHostToDevice, stream));\n  auto t1 = std::chrono::high_resolution_clock::now();\n  context.enqueueV2(buffers, stream, nullptr);\n  std::cout << \"enqueueV2 time: \"\n            << std::chrono::duration<float>(\n                   std::chrono::high_resolution_clock::now() - t1)\n                   .count()\n            << \"s\" << std::endl;\n  CHECK(cudaMemcpyAsync(output, buffers[outputIndex], output_size,\n                        cudaMemcpyDeviceToHost, stream));\n  cudaStreamSynchronize(stream);\n\n  // Release stream and buffers\n  cudaStreamDestroy(stream);\n  CHECK(cudaFree(buffers[inputIndex]));\n  CHECK(cudaFree(buffers[outputIndex]));\n}\nICudaEngine *createEngine(unsigned int maxBatchSize, IBuilder *builder,\n                          IBuilderConfig *config, DataType dt) {\n\n  //   INetworkDefinition *network = builder->createNetworkV2(0U);\n  const auto explicitBatch =\n      1U << static_cast<uint32_t>(\n          NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);\n  INetworkDefinition *network = builder->createNetworkV2(explicitBatch);\n  ITensor *data = network->addInput(kInputTensorName, dt, Dims4{1, 3, -1, -1});\n  assert(data);\n  std::map<std::string, Weights> weightMap = loadWeights(kWTSFile);\n\n  IConvolutionLayer *conv1 = network->addConvolutionNd(\n      *data, 64, DimsHW{3, 3}, weightMap[\"frontend.0.weight\"],\n      weightMap[\"frontend.0.bias\"]);\n  assert(conv1);\n  conv1->setStrideNd(DimsHW{1, 1});\n  conv1->setPaddingNd(DimsHW{1, 1});\n\n  IActivationLayer *relu1 =\n      network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n\n  assert(relu1);\n\n  auto conv2 = network->addConvolutionNd(*relu1->getOutput(0), 64, DimsHW{3, 3},\n                                         weightMap[\"frontend.2.weight\"],\n                                         weightMap[\"frontend.2.bias\"]);\n  assert(conv2);\n  conv2->setStrideNd(DimsHW{1, 1});\n  conv2->setPaddingNd(DimsHW{1, 1});\n  auto relu2 =\n      network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);\n  assert(relu2);\n  auto pool1 = network->addPoolingNd(*relu2->getOutput(0), PoolingType::kMAX,\n                                     DimsHW{2, 2});\n  assert(pool1);\n  pool1->setStrideNd(DimsHW{2, 2});\n  auto conv3 = network->addConvolutionNd(\n      *pool1->getOutput(0), 128, DimsHW{3, 3}, weightMap[\"frontend.5.weight\"],\n      weightMap[\"frontend.5.bias\"]);\n  assert(conv3);\n  conv3->setStrideNd(DimsHW{1, 1});\n\n  conv3->setPaddingNd(DimsHW{1, 1});\n  auto relu3 =\n      network->addActivation(*conv3->getOutput(0), ActivationType::kRELU);\n  assert(relu3);\n\n  auto conv4 = network->addConvolutionNd(\n      *relu3->getOutput(0), 128, DimsHW{3, 3}, weightMap[\"frontend.7.weight\"],\n      weightMap[\"frontend.7.bias\"]);\n  assert(conv4);\n  conv4->setStrideNd(DimsHW{1, 1});\n  conv4->setPaddingNd(DimsHW{1, 1});\n  auto relu4 =\n      network->addActivation(*conv4->getOutput(0), ActivationType::kRELU);\n  assert(relu4);\n\n  auto pool2 = network->addPoolingNd(*relu4->getOutput(0), PoolingType::kMAX,\n                                     DimsHW{2, 2});\n  assert(pool2);\n  pool2->setStrideNd(DimsHW{2, 2});\n\n  auto conv5 = network->addConvolutionNd(\n      *pool2->getOutput(0), 256, DimsHW{3, 3}, weightMap[\"frontend.10.weight\"],\n      weightMap[\"frontend.10.bias\"]);\n  assert(conv5);\n  conv5->setStrideNd(DimsHW{1, 1});\n  conv5->setPaddingNd(DimsHW{1, 1});\n  auto relu5 =\n      network->addActivation(*conv5->getOutput(0), ActivationType::kRELU);\n  assert(relu5);\n\n  auto conv6 = network->addConvolutionNd(\n      *relu5->getOutput(0), 256, DimsHW{3, 3}, weightMap[\"frontend.12.weight\"],\n      weightMap[\"frontend.12.bias\"]);\n  assert(conv6);\n  conv6->setStrideNd(DimsHW{1, 1});\n  conv6->setPaddingNd(DimsHW{1, 1});\n  auto relu6 =\n      network->addActivation(*conv6->getOutput(0), ActivationType::kRELU);\n  assert(relu6);\n  auto conv7 = network->addConvolutionNd(\n      *relu6->getOutput(0), 256, DimsHW{3, 3}, weightMap[\"frontend.14.weight\"],\n      weightMap[\"frontend.14.bias\"]);\n  assert(conv7);\n  conv7->setStrideNd(DimsHW{1, 1});\n  conv7->setPaddingNd(DimsHW{1, 1});\n  auto relu7 =\n      network->addActivation(*conv7->getOutput(0), ActivationType::kRELU);\n  assert(relu7);\n  auto pool3 = network->addPoolingNd(*relu7->getOutput(0), PoolingType::kMAX,\n                                     DimsHW{2, 2});\n  assert(pool3);\n  pool3->setStrideNd(DimsHW{2, 2});\n  auto conv8 = network->addConvolutionNd(\n      *pool3->getOutput(0), 512, DimsHW{3, 3}, weightMap[\"frontend.17.weight\"],\n      weightMap[\"frontend.17.bias\"]);\n  assert(conv8);\n  conv8->setStrideNd(DimsHW{1, 1});\n  conv8->setPaddingNd(DimsHW{1, 1});\n  auto relu8 =\n      network->addActivation(*conv8->getOutput(0), ActivationType::kRELU);\n  assert(relu8);\n  auto conv9 = network->addConvolutionNd(\n      *relu8->getOutput(0), 512, DimsHW{3, 3}, weightMap[\"frontend.19.weight\"],\n      weightMap[\"frontend.19.bias\"]);\n  assert(conv9);\n  conv9->setStrideNd(DimsHW{1, 1});\n  conv9->setPaddingNd(DimsHW{1, 1});\n  auto relu9 =\n      network->addActivation(*conv9->getOutput(0), ActivationType::kRELU);\n  assert(relu9);\n  auto conv10 = network->addConvolutionNd(\n      *relu9->getOutput(0), 512, DimsHW{3, 3}, weightMap[\"frontend.21.weight\"],\n      weightMap[\"frontend.21.bias\"]);\n  assert(conv10);\n  conv10->setStrideNd(DimsHW{1, 1});\n  conv10->setPaddingNd(DimsHW{1, 1});\n  auto relu10 =\n      network->addActivation(*conv10->getOutput(0), ActivationType::kRELU);\n  assert(relu10);\n  // backend\n  auto conv11 = network->addConvolutionNd(\n      *relu10->getOutput(0), 512, DimsHW{3, 3}, weightMap[\"backend.0.weight\"],\n      weightMap[\"backend.0.bias\"]);\n  assert(conv11);\n  conv11->setPaddingNd(DimsHW{2, 2});\n  conv11->setStrideNd(DimsHW{1, 1});\n  conv11->setDilationNd(DimsHW{2, 2});\n  auto relu11 =\n      network->addActivation(*conv11->getOutput(0), ActivationType::kRELU);\n\n  assert(relu11);\n  auto conv12 = network->addConvolutionNd(\n      *relu11->getOutput(0), 512, DimsHW{3, 3}, weightMap[\"backend.2.weight\"],\n      weightMap[\"backend.2.bias\"]);\n  assert(conv12);\n  conv12->setPaddingNd(DimsHW{2, 2});\n  conv12->setStrideNd(DimsHW{1, 1});\n  conv12->setDilationNd(DimsHW{2, 2});\n  auto relu12 =\n      network->addActivation(*conv12->getOutput(0), ActivationType::kRELU);\n  assert(relu12);\n\n  auto conv13 = network->addConvolutionNd(\n      *relu12->getOutput(0), 512, DimsHW{3, 3}, weightMap[\"backend.4.weight\"],\n      weightMap[\"backend.4.bias\"]);\n  assert(conv13);\n  conv13->setPaddingNd(DimsHW{2, 2});\n  conv13->setStrideNd(DimsHW{1, 1});\n  conv13->setDilationNd(DimsHW{2, 2});\n  auto relu13 =\n      network->addActivation(*conv13->getOutput(0), ActivationType::kRELU);\n  assert(relu13);\n\n  auto conv14 = network->addConvolutionNd(\n      *relu13->getOutput(0), 256, DimsHW{3, 3}, weightMap[\"backend.6.weight\"],\n      weightMap[\"backend.6.bias\"]);\n  assert(conv14);\n  conv14->setPaddingNd(DimsHW{2, 2});\n  conv14->setStrideNd(DimsHW{1, 1});\n  conv14->setDilationNd(DimsHW{2, 2});\n  auto relu14 =\n      network->addActivation(*conv14->getOutput(0), ActivationType::kRELU);\n  assert(relu14);\n  auto conv15 = network->addConvolutionNd(\n      *relu14->getOutput(0), 128, DimsHW{3, 3}, weightMap[\"backend.8.weight\"],\n      weightMap[\"backend.8.bias\"]);\n  assert(conv15);\n  conv15->setPaddingNd(DimsHW{2, 2});\n  conv15->setStrideNd(DimsHW{1, 1});\n  conv15->setDilationNd(DimsHW{2, 2});\n  auto relu15 =\n      network->addActivation(*conv15->getOutput(0), ActivationType::kRELU);\n  assert(relu15);\n  auto conv16 = network->addConvolutionNd(\n      *relu15->getOutput(0), 64, DimsHW{3, 3}, weightMap[\"backend.10.weight\"],\n      weightMap[\"backend.10.bias\"]);\n  assert(conv16);\n  conv16->setPaddingNd(DimsHW{2, 2});\n  conv16->setStrideNd(DimsHW{1, 1});\n  conv16->setDilationNd(DimsHW{2, 2});\n  auto relu16 =\n      network->addActivation(*conv16->getOutput(0), ActivationType::kRELU);\n\n  assert(relu16);\n\n  auto conv17 = network->addConvolutionNd(\n      *relu16->getOutput(0), 1, DimsHW{1, 1}, weightMap[\"output_layer.weight\"],\n      weightMap[\"output_layer.bias\"]);\n  assert(conv17);\n\n  conv17->setStrideNd(DimsHW{1, 1});\n  conv17->getOutput(0)->setName(kOutputTensorName);\n  network->markOutput(*conv17->getOutput(0));\n\n  IOptimizationProfile *profile = builder->createOptimizationProfile();\n  profile->setDimensions(kInputTensorName, OptProfileSelector::kMIN,\n                         Dims4(1, 3, MIN_INPUT_SIZE, MIN_INPUT_SIZE));\n  profile->setDimensions(kInputTensorName, OptProfileSelector::kOPT,\n                         Dims4(1, 3, OPT_INPUT_H, OPT_INPUT_W));\n  profile->setDimensions(kInputTensorName, OptProfileSelector::kMAX,\n                         Dims4(1, 3, MAX_INPUT_SIZE, MAX_INPUT_SIZE));\n  config->addOptimizationProfile(profile);\n\n  builder->setMaxBatchSize(kBatchSize);\n  config->setMaxWorkspaceSize(16 << 20);\n#ifdef USE_FP16\n  config->setFlag(BuilderFlag::kFP16);\n#endif\n  ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);\n\n  printf(\"build engine successfully : %s\\n\", kEngineFile);\n  // Don't need the network any more\n  network->destroy();\n\n  // Release host memory\n  for (auto &mem : weightMap) {\n    free((void *)(mem.second.values));\n  }\n\n  return engine;\n}\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory **modelStream) {\n  // Create builder\n  IBuilder *builder = createInferBuilder(gLogger);\n  IBuilderConfig *config = builder->createBuilderConfig();\n\n  // Create model to populate the network, then set the outputs and create an\n  // engine\n  ICudaEngine *engine =\n      createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n  assert(engine != nullptr);\n\n  // Serialize the engine\n  (*modelStream) = engine->serialize();\n\n  // Close everything down\n  engine->destroy();\n  config->destroy();\n  builder->destroy();\n}\n\nint read_files_in_dir(const char *p_dir_name,\n                      std::vector<std::string> &file_names) {\n  DIR *p_dir = opendir(p_dir_name);\n  if (p_dir == nullptr) {\n    return -1;\n  }\n\n  struct dirent *p_file = nullptr;\n  while ((p_file = readdir(p_dir)) != nullptr) {\n    if (strcmp(p_file->d_name, \".\") != 0 && strcmp(p_file->d_name, \"..\") != 0) {\n      std::string cur_file_name(p_file->d_name);\n      file_names.push_back(cur_file_name);\n    }\n  }\n  closedir(p_dir);\n  return 0;\n}\n\nint main(int argc, char **argv) {\n\n  if (argc != 3) {\n    std::cerr << \"arguments not right!\" << std::endl;\n    std::cerr << \"./csrnet -s  ./csrnet.wts // serialize model to plan file\"\n              << std::endl;\n    std::cerr\n        << \"./csrnet -d  ../images  // deserialize plan file and run inference\"\n        << std::endl;\n    return -1;\n  }\n  char *trtModelStream{nullptr};\n  size_t size{0};\n\n  if (std::string(argv[1]) == \"-s\") {\n    IHostMemory *modelStream{nullptr};\n    kWTSFile = argv[2];\n    APIToModel(kBatchSize, &modelStream);\n    assert(modelStream != nullptr);\n\n    std::ofstream p(kEngineFile, std::ios::binary);\n    if (!p) {\n      std::cerr << \"could not open plan output file\" << std::endl;\n      return -1;\n    }\n    p.write(reinterpret_cast<const char *>(modelStream->data()),\n            modelStream->size());\n    modelStream->destroy();\n    return 1;\n  } else if (std::string(argv[1]) == \"-d\") {\n    std::ifstream file(kEngineFile, std::ios::binary);\n    if (file.good()) {\n      file.seekg(0, file.end);\n      size = file.tellg();\n      file.seekg(0, file.beg);\n      trtModelStream = new char[size];\n      assert(trtModelStream);\n      file.read(trtModelStream, size);\n      file.close();\n    }\n  } else {\n    return -1;\n  }\n  IRuntime *runtime = createInferRuntime(gLogger);\n  assert(runtime != nullptr);\n  ICudaEngine *engine = runtime->deserializeCudaEngine(trtModelStream, size);\n  assert(engine != nullptr);\n  IExecutionContext *context = engine->createExecutionContext();\n  assert(context != nullptr);\n  delete[] trtModelStream;\n\n  std::vector<std::string> file_names;\n  if (read_files_in_dir(argv[2], file_names) < 0) {\n    std::cout << \"read_files_in_dir failed.\" << std::endl;\n    return -1;\n  }\n\n  std::vector<float> mean_value{0.406, 0.456, 0.485}; // BGR\n  std::vector<float> std_value{0.225, 0.224, 0.229};\n\n  int fcount = 0;\n\n  float *data = new float[kMaxInputImageSize];\n  float *prob = new float[kMaxOutputProbSize];\n\n  for (auto f : file_names) {\n    fcount++;\n    cv::Mat src_img = cv::imread(std::string(argv[2]) + \"/\" + f);\n    if (src_img.empty())\n      continue;\n\n    int i = 0;\n    for (int row = 0; row < src_img.rows; ++row) {\n      uchar *uc_pixel = src_img.data + row * src_img.step;\n      for (int col = 0; col < src_img.cols; ++col) {\n        data[i] = (uc_pixel[2] / 255.0 - mean_value[2]) / std_value[2];\n        data[i + src_img.rows * src_img.cols] =\n            (uc_pixel[1] / 255.0 - mean_value[1]) / std_value[1];\n        data[i + 2 * src_img.rows * src_img.cols] =\n            (uc_pixel[0] / 255.0 - mean_value[0]) / std_value[0];\n        uc_pixel += 3;\n        ++i;\n      }\n    }\n    // Run inference\n    auto start = std::chrono::system_clock::now();\n    doInference(*context, data, prob, src_img.rows, src_img.cols);\n    auto end = std::chrono::system_clock::now();\n    std::cout << \"detect time:\"\n              << std::chrono::duration_cast<std::chrono::milliseconds>(end -\n                                                                       start)\n                     .count()\n              << \"ms\" << std::endl;\n    float num = std::accumulate(\n        prob, prob + ((src_img.rows * src_img.cols) >> 6), 0.0f);\n\n    cv::Mat densityMap(src_img.rows >> 3, src_img.cols >> 3, CV_32FC1,\n                       (void *)prob);\n\n    cv::Mat densityMapScaled;\n    cv::normalize(densityMap, densityMapScaled, 0, 255, cv::NORM_MINMAX,\n                  CV_8UC1);\n    cv::Mat densityColorMap;\n    cv::applyColorMap(densityMapScaled, densityColorMap, cv::COLORMAP_VIRIDIS);\n\n    cv::resize(densityColorMap, densityColorMap, src_img.size());\n    cv::addWeighted(densityColorMap, 0.5, src_img, 0.5, 0, src_img);\n\n    // write to jpg\n    cv::putText(src_img, std::string(\"people num: \") + std::to_string(num),\n                cv::Point(10, 50), cv::FONT_HERSHEY_SIMPLEX, 0.5,\n                cv::Scalar(255, 255, 255), 1);\n    std::string write_path = std::string(argv[2]) + \"result_\" + f;\n    std::cout << \"people num :\" << num << \" write_path: \" << write_path\n              << std::endl;\n    cv::imwrite(write_path, src_img);\n  }\n  delete[] data;\n  delete[] prob;\n\n  return 0;\n}"
  },
  {
    "path": "csrnet/gen_wts.py",
    "content": "from torch.nn.modules import module\nfrom model import CSRNet\nimport torch\nimport os\nimport struct\n\n\nsave_path = os.path.join(os.path.dirname(\n    __file__), \"output\", os.path.basename(__file__).split('.')[0])\nos.makedirs(save_path, exist_ok=True)\nwts_file = os.path.join(save_path, \"csrnet.wts\")\n\n\n# load model\nmodel_path = \"partBmodel_best.pth.tar\"\nmodel = CSRNet()\ncheckpoint = torch.load(model_path)\nmodel.load_state_dict(checkpoint['state_dict'])\n\n\n# save to wts\nprint(f'Writing into {wts_file}')\nwith open(wts_file, 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')"
  },
  {
    "path": "csrnet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include \"macros.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\npublic:\n  LogStreamConsumerBuffer(std::ostream &stream, const std::string &prefix,\n                          bool shouldLog)\n      : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n  LogStreamConsumerBuffer(LogStreamConsumerBuffer &&other)\n      : mOutput(other.mOutput) {}\n\n  ~LogStreamConsumerBuffer() {\n    // std::streambuf::pbase() gives a pointer to the beginning of the buffered\n    // part of the output sequence std::streambuf::pptr() gives a pointer to the\n    // current position of the output sequence if the pointer to the beginning\n    // is not equal to the pointer to the current position, call putOutput() to\n    // log the output to the stream\n    if (pbase() != pptr()) {\n      putOutput();\n    }\n  }\n\n  // synchronizes the stream buffer and returns 0 on success\n  // synchronizing the stream buffer consists of inserting the buffer contents\n  // into the stream, resetting the buffer and flushing the stream\n  virtual int sync() {\n    putOutput();\n    return 0;\n  }\n\n  void putOutput() {\n    if (mShouldLog) {\n      // prepend timestamp\n      std::time_t timestamp = std::time(nullptr);\n      tm *tm_local = std::localtime(&timestamp);\n      std::cout << \"[\";\n      std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon\n                << \"/\";\n      std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday\n                << \"/\";\n      std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year\n                << \"-\";\n      std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour\n                << \":\";\n      std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n      std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec\n                << \"] \";\n      // std::stringbuf::str() gets the string contents of the buffer\n      // insert the buffer contents pre-appended by the appropriate prefix into\n      // the stream\n      mOutput << mPrefix << str();\n      // set the buffer to empty\n      str(\"\");\n      // flush the stream\n      mOutput.flush();\n    }\n  }\n\n  void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\nprivate:\n  std::ostream &mOutput;\n  std::string mPrefix;\n  bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before\n//! std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\npublic:\n  LogStreamConsumerBase(std::ostream &stream, const std::string &prefix,\n                        bool shouldLog)\n      : mBuffer(stream, prefix, shouldLog) {}\n\nprotected:\n  LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when\n//! logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the\n//!  LogStreamConsumerBuffer member field in LogStreamConsumer and then the\n//!  address of the buffer is passed to std::ostream. This is necessary to\n//!  prevent the address of an uninitialized buffer from being passed to\n//!  std::ostream. Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\npublic:\n  //! \\brief Creates a LogStreamConsumer which logs messages with level\n  //! severity.\n  //!  Reportable severity determines if the messages are severe enough to be\n  //!  logged.\n  LogStreamConsumer(Severity reportableSeverity, Severity severity)\n      : LogStreamConsumerBase(severityOstream(severity),\n                              severityPrefix(severity),\n                              severity <= reportableSeverity),\n        std::ostream(&mBuffer) // links the stream buffer with the stream\n        ,\n        mShouldLog(severity <= reportableSeverity), mSeverity(severity) {}\n\n  LogStreamConsumer(LogStreamConsumer &&other)\n      : LogStreamConsumerBase(severityOstream(other.mSeverity),\n                              severityPrefix(other.mSeverity),\n                              other.mShouldLog),\n        std::ostream(&mBuffer) // links the stream buffer with the stream\n        ,\n        mShouldLog(other.mShouldLog), mSeverity(other.mSeverity) {}\n\n  void setReportableSeverity(Severity reportableSeverity) {\n    mShouldLog = mSeverity <= reportableSeverity;\n    mBuffer.setShouldLog(mShouldLog);\n  }\n\nprivate:\n  static std::ostream &severityOstream(Severity severity) {\n    return severity >= Severity::kINFO ? std::cout : std::cerr;\n  }\n\n  static std::string severityPrefix(Severity severity) {\n    switch (severity) {\n    case Severity::kINTERNAL_ERROR:\n      return \"[F] \";\n    case Severity::kERROR:\n      return \"[E] \";\n    case Severity::kWARNING:\n      return \"[W] \";\n    case Severity::kINFO:\n      return \"[I] \";\n    case Severity::kVERBOSE:\n      return \"[V] \";\n    default:\n      assert(0);\n      return \"\";\n    }\n  }\n\n  bool mShouldLog;\n  Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and\n//! samples to log information to the console, and supports logging two types of\n//! messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or\n//! internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to\n//! emitting directly to stdout/stderr is that the logic for controlling the\n//! verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results\n//! to a file in some standard format (for example, JUnit XML), and providing\n//! additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits\n//! directly from the nvinfer1::ILogger interface, which is problematic since\n//! there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to\n//! access the ILogger) we can refactor the class to eliminate the inheritance\n//! and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\npublic:\n  Logger(Severity severity = Severity::kWARNING)\n      : mReportableSeverity(severity) {}\n\n  //!\n  //! \\enum TestResult\n  //! \\brief Represents the state of a given test\n  //!\n  enum class TestResult {\n    kRUNNING, //!< The test is running\n    kPASSED,  //!< The test passed\n    kFAILED,  //!< The test failed\n    kWAIVED   //!< The test was waived\n  };\n\n  //!\n  //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger\n  //! associated with this Logger \\return The nvinfer1::ILogger associated with\n  //! this Logger\n  //!\n  //! TODO Once all samples are updated to use this method to register the\n  //! logger with TensorRT, we can eliminate the inheritance of Logger from\n  //! ILogger\n  //!\n  nvinfer1::ILogger &getTRTLogger() { return *this; }\n\n  //!\n  //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n  //!\n  //! Note samples should not be calling this function directly; it will\n  //! eventually go away once we eliminate the inheritance from\n  //! nvinfer1::ILogger\n  //!\n  void log(Severity severity, const char *msg) TRT_NOEXCEPT override {\n    LogStreamConsumer(mReportableSeverity, severity)\n        << \"[TRT] \" << std::string(msg) << std::endl;\n  }\n\n  //!\n  //! \\brief Method for controlling the verbosity of logging output\n  //!\n  //! \\param severity The logger will only emit messages that have severity of\n  //! this level or higher.\n  //!\n  void setReportableSeverity(Severity severity) {\n    mReportableSeverity = severity;\n  }\n\n  //!\n  //! \\brief Opaque handle that holds logging information for a particular test\n  //!\n  //! This object is an opaque handle to information used by the Logger to print\n  //! test results. The sample must call Logger::defineTest() in order to obtain\n  //! a TestAtom that can be used with Logger::reportTest{Start,End}().\n  //!\n  class TestAtom {\n  public:\n    TestAtom(TestAtom &&) = default;\n\n  private:\n    friend class Logger;\n\n    TestAtom(bool started, const std::string &name, const std::string &cmdline)\n        : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n    bool mStarted;\n    std::string mName;\n    std::string mCmdline;\n  };\n\n  //!\n  //! \\brief Define a test for logging\n  //!\n  //! \\param[in] name The name of the test.  This should be a string starting\n  //! with\n  //!                  \"TensorRT\" and containing dot-separated strings\n  //!                  containing the characters [A-Za-z0-9_]. For example,\n  //!                  \"TensorRT.sample_googlenet\"\n  //! \\param[in] cmdline The command line used to reproduce the test\n  //\n  //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n  //!\n  static TestAtom defineTest(const std::string &name,\n                             const std::string &cmdline) {\n    return TestAtom(false, name, cmdline);\n  }\n\n  //!\n  //! \\brief A convenience overloaded version of defineTest() that accepts an\n  //! array of command-line arguments\n  //!        as input\n  //!\n  //! \\param[in] name The name of the test\n  //! \\param[in] argc The number of command-line arguments\n  //! \\param[in] argv The array of command-line arguments (given as C strings)\n  //!\n  //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n  static TestAtom defineTest(const std::string &name, int argc,\n                             char const *const *argv) {\n    auto cmdline = genCmdlineString(argc, argv);\n    return defineTest(name, cmdline);\n  }\n\n  //!\n  //! \\brief Report that a test has started.\n  //!\n  //! \\pre reportTestStart() has not been called yet for the given testAtom\n  //!\n  //! \\param[in] testAtom The handle to the test that has started\n  //!\n  static void reportTestStart(TestAtom &testAtom) {\n    reportTestResult(testAtom, TestResult::kRUNNING);\n    assert(!testAtom.mStarted);\n    testAtom.mStarted = true;\n  }\n\n  //!\n  //! \\brief Report that a test has ended.\n  //!\n  //! \\pre reportTestStart() has been called for the given testAtom\n  //!\n  //! \\param[in] testAtom The handle to the test that has ended\n  //! \\param[in] result The result of the test. Should be one of\n  //! TestResult::kPASSED,\n  //!                   TestResult::kFAILED, TestResult::kWAIVED\n  //!\n  static void reportTestEnd(const TestAtom &testAtom, TestResult result) {\n    assert(result != TestResult::kRUNNING);\n    assert(testAtom.mStarted);\n    reportTestResult(testAtom, result);\n  }\n\n  static int reportPass(const TestAtom &testAtom) {\n    reportTestEnd(testAtom, TestResult::kPASSED);\n    return EXIT_SUCCESS;\n  }\n\n  static int reportFail(const TestAtom &testAtom) {\n    reportTestEnd(testAtom, TestResult::kFAILED);\n    return EXIT_FAILURE;\n  }\n\n  static int reportWaive(const TestAtom &testAtom) {\n    reportTestEnd(testAtom, TestResult::kWAIVED);\n    return EXIT_SUCCESS;\n  }\n\n  static int reportTest(const TestAtom &testAtom, bool pass) {\n    return pass ? reportPass(testAtom) : reportFail(testAtom);\n  }\n\n  Severity getReportableSeverity() const { return mReportableSeverity; }\n\nprivate:\n  //!\n  //! \\brief returns an appropriate string for prefixing a log message with the\n  //! given severity\n  //!\n  static const char *severityPrefix(Severity severity) {\n    switch (severity) {\n    case Severity::kINTERNAL_ERROR:\n      return \"[F] \";\n    case Severity::kERROR:\n      return \"[E] \";\n    case Severity::kWARNING:\n      return \"[W] \";\n    case Severity::kINFO:\n      return \"[I] \";\n    case Severity::kVERBOSE:\n      return \"[V] \";\n    default:\n      assert(0);\n      return \"\";\n    }\n  }\n\n  //!\n  //! \\brief returns an appropriate string for prefixing a test result message\n  //! with the given result\n  //!\n  static const char *testResultString(TestResult result) {\n    switch (result) {\n    case TestResult::kRUNNING:\n      return \"RUNNING\";\n    case TestResult::kPASSED:\n      return \"PASSED\";\n    case TestResult::kFAILED:\n      return \"FAILED\";\n    case TestResult::kWAIVED:\n      return \"WAIVED\";\n    default:\n      assert(0);\n      return \"\";\n    }\n  }\n\n  //!\n  //! \\brief returns an appropriate output stream (cout or cerr) to use with the\n  //! given severity\n  //!\n  static std::ostream &severityOstream(Severity severity) {\n    return severity >= Severity::kINFO ? std::cout : std::cerr;\n  }\n\n  //!\n  //! \\brief method that implements logging test results\n  //!\n  static void reportTestResult(const TestAtom &testAtom, TestResult result) {\n    severityOstream(Severity::kINFO)\n        << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n        << testAtom.mCmdline << std::endl;\n  }\n\n  //!\n  //! \\brief generate a command line string from the given (argc, argv) values\n  //!\n  static std::string genCmdlineString(int argc, char const *const *argv) {\n    std::stringstream ss;\n    for (int i = 0; i < argc; i++) {\n      if (i > 0)\n        ss << \" \";\n      ss << argv[i];\n    }\n    return ss.str();\n  }\n\n  Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages\n//! of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger &logger) {\n  return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages\n//! of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger &logger) {\n  return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages\n//! of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger &logger) {\n  return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages\n//! of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger &logger) {\n  return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages\n//! of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger &logger) {\n  return LogStreamConsumer(logger.getReportableSeverity(),\n                           Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "csrnet/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "dbnet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(dbnet)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\naux_source_directory(. DIRSRCS)\n\n# clipper\ninclude_directories(./ ./clipper)\nadd_subdirectory(clipper)\n\nadd_executable(dbnet ${DIRSRCS})\ntarget_link_libraries(dbnet clipper)\ntarget_link_libraries(dbnet nvinfer)\ntarget_link_libraries(dbnet cudart)\ntarget_link_libraries(dbnet ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "dbnet/README.md",
    "content": "# DBNet\n\nThe Pytorch implementation is [DBNet](https://github.com/BaofengZan/DBNet.pytorch).\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/25873202/113959270-1eb8c600-9855-11eb-9c4d-1e6dc8e38a17.jpg\">\n</p>\n\n\n\n## How to Run\n\n* 1. generate `.wts`\n\n  Download code and model from [DBNet](https://github.com/BaofengZan/DBNet.pytorch) and config your environments.\n\n  Go to file`tools/predict.py`, set `--save_wts` as `True`, then run, the `DBNet.wts` will be generated.\n\n  Onnx can also be exported, just need to set `--onnx` as `True`.\n\n* 2. cmake and make\n\n  ```\n  mkdir build\n  cd build\n  cmake ..\n  make\n  cp /your_wts_path/DBNet.wts .\n  sudo ./dbnet -s             // serialize model to plan file i.e. 'DBNet.engine'\n  sudo ./dbnet -d  ./test_imgs // deserialize plan file and run inference, all images in test_imgs folder will be processed.\n  ```\n\n\n\n## For windows\n\nhttps://github.com/BaofengZan/DBNet-TensorRT\n\n\n\n## Todo\n\n- [x] 1. In `common.hpp`, the following two functions can be merged.\n\n     ```c++\n     ILayer* convBnLeaky(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int ksize, int s, int g, std::string lname, bool bias = true) \n     ```\n\n     ```c++\n     ILayer* convBnLeaky2(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int ksize, int s, int g, std::string lname, bool bias = true)\n     ```\n\n- [x] 2. The postprocess method here should be optimized, which is a little different from pytorch side.\n\n- [x] 3. The input image here is resized to `640 x 640` directly, while the pytorch side is using `letterbox` method.\n\n"
  },
  {
    "path": "dbnet/clipper/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\naux_source_directory(. DIR_CLIPPER_SRCS)\nadd_library(clipper ${DIR_CLIPPER_SRCS})"
  },
  {
    "path": "dbnet/clipper/clipper.cpp",
    "content": "/*******************************************************************************\n*                                                                              *\n* Author    :  Angus Johnson                                                   *\n* Version   :  6.4.2                                                           *\n* Date      :  27 February 2017                                                *\n* Website   :  http://www.angusj.com                                           *\n* Copyright :  Angus Johnson 2010-2017                                         *\n*                                                                              *\n* License:                                                                     *\n* Use, modification & distribution is subject to Boost Software License Ver 1. *\n* http://www.boost.org/LICENSE_1_0.txt                                         *\n*                                                                              *\n* Attributions:                                                                *\n* The code in this library is an extension of Bala Vatti's clipping algorithm: *\n* \"A generic solution to polygon clipping\"                                     *\n* Communications of the ACM, Vol 35, Issue 7 (July 1992) pp 56-63.             *\n* http://portal.acm.org/citation.cfm?id=129906                                 *\n*                                                                              *\n* Computer graphics and geometric modeling: implementation and algorithms      *\n* By Max K. Agoston                                                            *\n* Springer; 1 edition (January 4, 2005)                                        *\n* http://books.google.com/books?q=vatti+clipping+agoston                       *\n*                                                                              *\n* See also:                                                                    *\n* \"Polygon Offsetting by Computing Winding Numbers\"                            *\n* Paper no. DETC2005-85513 pp. 565-575                                         *\n* ASME 2005 International Design Engineering Technical Conferences             *\n* and Computers and Information in Engineering Conference (IDETC/CIE2005)      *\n* September 24-28, 2005 , Long Beach, California, USA                          *\n* http://www.me.berkeley.edu/~mcmains/pubs/DAC05OffsetPolygon.pdf              *\n*                                                                              *\n*******************************************************************************/\n\n/*******************************************************************************\n*                                                                              *\n* This is a translation of the Delphi Clipper library and the naming style     *\n* used has retained a Delphi flavour.                                          *\n*                                                                              *\n*******************************************************************************/\n\n#include \"clipper.hpp\"\n#include <cmath>\n#include <vector>\n#include <algorithm>\n#include <stdexcept>\n#include <cstring>\n#include <cstdlib>\n#include <ostream>\n#include <functional>\n\nnamespace ClipperLib {\n\nstatic double const pi = 3.141592653589793238;\nstatic double const two_pi = pi *2;\nstatic double const def_arc_tolerance = 0.25;\n\nenum Direction { dRightToLeft, dLeftToRight };\n\nstatic int const Unassigned = -1;  //edge not currently 'owning' a solution\nstatic int const Skip = -2;        //edge that would otherwise close a path\n\n#define HORIZONTAL (-1.0E+40)\n#define TOLERANCE (1.0e-20)\n#define NEAR_ZERO(val) (((val) > -TOLERANCE) && ((val) < TOLERANCE))\n\nstruct TEdge {\n  IntPoint Bot;\n  IntPoint Curr; //current (updated for every new scanbeam)\n  IntPoint Top;\n  double Dx;\n  PolyType PolyTyp;\n  EdgeSide Side; //side only refers to current side of solution poly\n  int WindDelta; //1 or -1 depending on winding direction\n  int WindCnt;\n  int WindCnt2; //winding count of the opposite polytype\n  int OutIdx;\n  TEdge *Next;\n  TEdge *Prev;\n  TEdge *NextInLML;\n  TEdge *NextInAEL;\n  TEdge *PrevInAEL;\n  TEdge *NextInSEL;\n  TEdge *PrevInSEL;\n};\n\nstruct IntersectNode {\n  TEdge          *Edge1;\n  TEdge          *Edge2;\n  IntPoint        Pt;\n};\n\nstruct LocalMinimum {\n  cInt          Y;\n  TEdge        *LeftBound;\n  TEdge        *RightBound;\n};\n\nstruct OutPt;\n\n//OutRec: contains a path in the clipping solution. Edges in the AEL will\n//carry a pointer to an OutRec when they are part of the clipping solution.\nstruct OutRec {\n  int       Idx;\n  bool      IsHole;\n  bool      IsOpen;\n  OutRec   *FirstLeft;  //see comments in clipper.pas\n  PolyNode *PolyNd;\n  OutPt    *Pts;\n  OutPt    *BottomPt;\n};\n\nstruct OutPt {\n  int       Idx;\n  IntPoint  Pt;\n  OutPt    *Next;\n  OutPt    *Prev;\n};\n\nstruct Join {\n  OutPt    *OutPt1;\n  OutPt    *OutPt2;\n  IntPoint  OffPt;\n};\n\nstruct LocMinSorter\n{\n  inline bool operator()(const LocalMinimum& locMin1, const LocalMinimum& locMin2)\n  {\n    return locMin2.Y < locMin1.Y;\n  }\n};\n\n//------------------------------------------------------------------------------\n//------------------------------------------------------------------------------\n\ninline cInt Round(double val)\n{\n  if ((val < 0)) return static_cast<cInt>(val - 0.5); \n  else return static_cast<cInt>(val + 0.5);\n}\n//------------------------------------------------------------------------------\n\ninline cInt Abs(cInt val)\n{\n  return val < 0 ? -val : val;\n}\n\n//------------------------------------------------------------------------------\n// PolyTree methods ...\n//------------------------------------------------------------------------------\n\nvoid PolyTree::Clear()\n{\n    for (PolyNodes::size_type i = 0; i < AllNodes.size(); ++i)\n      delete AllNodes[i];\n    AllNodes.resize(0); \n    Childs.resize(0);\n}\n//------------------------------------------------------------------------------\n\nPolyNode* PolyTree::GetFirst() const\n{\n  if (!Childs.empty())\n      return Childs[0];\n  else\n      return 0;\n}\n//------------------------------------------------------------------------------\n\nint PolyTree::Total() const\n{\n  int result = (int)AllNodes.size();\n  //with negative offsets, ignore the hidden outer polygon ...\n  if (result > 0 && Childs[0] != AllNodes[0]) result--;\n  return result;\n}\n\n//------------------------------------------------------------------------------\n// PolyNode methods ...\n//------------------------------------------------------------------------------\n\nPolyNode::PolyNode(): Parent(0), Index(0), m_IsOpen(false)\n{\n}\n//------------------------------------------------------------------------------\n\nint PolyNode::ChildCount() const\n{\n  return (int)Childs.size();\n}\n//------------------------------------------------------------------------------\n\nvoid PolyNode::AddChild(PolyNode& child)\n{\n  unsigned cnt = (unsigned)Childs.size();\n  Childs.push_back(&child);\n  child.Parent = this;\n  child.Index = cnt;\n}\n//------------------------------------------------------------------------------\n\nPolyNode* PolyNode::GetNext() const\n{ \n  if (!Childs.empty()) \n      return Childs[0]; \n  else\n      return GetNextSiblingUp();    \n}  \n//------------------------------------------------------------------------------\n\nPolyNode* PolyNode::GetNextSiblingUp() const\n{ \n  if (!Parent) //protects against PolyTree.GetNextSiblingUp()\n      return 0;\n  else if (Index == Parent->Childs.size() - 1)\n      return Parent->GetNextSiblingUp();\n  else\n      return Parent->Childs[Index + 1];\n}  \n//------------------------------------------------------------------------------\n\nbool PolyNode::IsHole() const\n{ \n  bool result = true;\n  PolyNode* node = Parent;\n  while (node)\n  {\n      result = !result;\n      node = node->Parent;\n  }\n  return result;\n}  \n//------------------------------------------------------------------------------\n\nbool PolyNode::IsOpen() const\n{ \n  return m_IsOpen;\n}  \n//------------------------------------------------------------------------------\n\n#ifndef use_int32\n\n//------------------------------------------------------------------------------\n// Int128 class (enables safe math on signed 64bit integers)\n// eg Int128 val1((long64)9223372036854775807); //ie 2^63 -1\n//    Int128 val2((long64)9223372036854775807);\n//    Int128 val3 = val1 * val2;\n//    val3.AsString => \"85070591730234615847396907784232501249\" (8.5e+37)\n//------------------------------------------------------------------------------\n\nclass Int128\n{\n  public:\n    ulong64 lo;\n    long64 hi;\n\n    Int128(long64 _lo = 0)\n    {\n      lo = (ulong64)_lo;   \n      if (_lo < 0)  hi = -1; else hi = 0; \n    }\n\n\n    Int128(const Int128 &val): lo(val.lo), hi(val.hi){}\n\n    Int128(const long64& _hi, const ulong64& _lo): lo(_lo), hi(_hi){}\n    \n    Int128& operator = (const long64 &val)\n    {\n      lo = (ulong64)val;\n      if (val < 0) hi = -1; else hi = 0;\n      return *this;\n    }\n\n    bool operator == (const Int128 &val) const\n      {return (hi == val.hi && lo == val.lo);}\n\n    bool operator != (const Int128 &val) const\n      { return !(*this == val);}\n\n    bool operator > (const Int128 &val) const\n    {\n      if (hi != val.hi)\n        return hi > val.hi;\n      else\n        return lo > val.lo;\n    }\n\n    bool operator < (const Int128 &val) const\n    {\n      if (hi != val.hi)\n        return hi < val.hi;\n      else\n        return lo < val.lo;\n    }\n\n    bool operator >= (const Int128 &val) const\n      { return !(*this < val);}\n\n    bool operator <= (const Int128 &val) const\n      { return !(*this > val);}\n\n    Int128& operator += (const Int128 &rhs)\n    {\n      hi += rhs.hi;\n      lo += rhs.lo;\n      if (lo < rhs.lo) hi++;\n      return *this;\n    }\n\n    Int128 operator + (const Int128 &rhs) const\n    {\n      Int128 result(*this);\n      result+= rhs;\n      return result;\n    }\n\n    Int128& operator -= (const Int128 &rhs)\n    {\n      *this += -rhs;\n      return *this;\n    }\n\n    Int128 operator - (const Int128 &rhs) const\n    {\n      Int128 result(*this);\n      result -= rhs;\n      return result;\n    }\n\n    Int128 operator-() const //unary negation\n    {\n      if (lo == 0)\n        return Int128(-hi, 0);\n      else\n        return Int128(~hi, ~lo + 1);\n    }\n\n    operator double() const\n    {\n      const double shift64 = 18446744073709551616.0; //2^64\n      if (hi < 0)\n      {\n        if (lo == 0) return (double)hi * shift64;\n        else return -(double)(~lo + ~hi * shift64);\n      }\n      else\n        return (double)(lo + hi * shift64);\n    }\n\n};\n//------------------------------------------------------------------------------\n\nInt128 Int128Mul (long64 lhs, long64 rhs)\n{\n  bool negate = (lhs < 0) != (rhs < 0);\n\n  if (lhs < 0) lhs = -lhs;\n  ulong64 int1Hi = ulong64(lhs) >> 32;\n  ulong64 int1Lo = ulong64(lhs & 0xFFFFFFFF);\n\n  if (rhs < 0) rhs = -rhs;\n  ulong64 int2Hi = ulong64(rhs) >> 32;\n  ulong64 int2Lo = ulong64(rhs & 0xFFFFFFFF);\n\n  //nb: see comments in clipper.pas\n  ulong64 a = int1Hi * int2Hi;\n  ulong64 b = int1Lo * int2Lo;\n  ulong64 c = int1Hi * int2Lo + int1Lo * int2Hi;\n\n  Int128 tmp;\n  tmp.hi = long64(a + (c >> 32));\n  tmp.lo = long64(c << 32);\n  tmp.lo += long64(b);\n  if (tmp.lo < b) tmp.hi++;\n  if (negate) tmp = -tmp;\n  return tmp;\n};\n#endif\n\n//------------------------------------------------------------------------------\n// Miscellaneous global functions\n//------------------------------------------------------------------------------\n\nbool Orientation(const Path &poly)\n{\n    return Area(poly) >= 0;\n}\n//------------------------------------------------------------------------------\n\ndouble Area(const Path &poly)\n{\n  int size = (int)poly.size();\n  if (size < 3) return 0;\n\n  double a = 0;\n  for (int i = 0, j = size -1; i < size; ++i)\n  {\n    a += ((double)poly[j].X + poly[i].X) * ((double)poly[j].Y - poly[i].Y);\n    j = i;\n  }\n  return -a * 0.5;\n}\n//------------------------------------------------------------------------------\n\ndouble Area(const OutPt *op)\n{\n  const OutPt *startOp = op;\n  if (!op) return 0;\n  double a = 0;\n  do {\n    a +=  (double)(op->Prev->Pt.X + op->Pt.X) * (double)(op->Prev->Pt.Y - op->Pt.Y);\n    op = op->Next;\n  } while (op != startOp);\n  return a * 0.5;\n}\n//------------------------------------------------------------------------------\n\ndouble Area(const OutRec &outRec)\n{\n  return Area(outRec.Pts);\n}\n//------------------------------------------------------------------------------\n\nbool PointIsVertex(const IntPoint &Pt, OutPt *pp)\n{\n  OutPt *pp2 = pp;\n  do\n  {\n    if (pp2->Pt == Pt) return true;\n    pp2 = pp2->Next;\n  }\n  while (pp2 != pp);\n  return false;\n}\n//------------------------------------------------------------------------------\n\n//See \"The Point in Polygon Problem for Arbitrary Polygons\" by Hormann & Agathos\n//http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.88.5498&rep=rep1&type=pdf\nint PointInPolygon(const IntPoint &pt, const Path &path)\n{\n  //returns 0 if false, +1 if true, -1 if pt ON polygon boundary\n  int result = 0;\n  size_t cnt = path.size();\n  if (cnt < 3) return 0;\n  IntPoint ip = path[0];\n  for(size_t i = 1; i <= cnt; ++i)\n  {\n    IntPoint ipNext = (i == cnt ? path[0] : path[i]);\n    if (ipNext.Y == pt.Y)\n    {\n        if ((ipNext.X == pt.X) || (ip.Y == pt.Y && \n          ((ipNext.X > pt.X) == (ip.X < pt.X)))) return -1;\n    }\n    if ((ip.Y < pt.Y) != (ipNext.Y < pt.Y))\n    {\n      if (ip.X >= pt.X)\n      {\n        if (ipNext.X > pt.X) result = 1 - result;\n        else\n        {\n          double d = (double)(ip.X - pt.X) * (ipNext.Y - pt.Y) - \n            (double)(ipNext.X - pt.X) * (ip.Y - pt.Y);\n          if (!d) return -1;\n          if ((d > 0) == (ipNext.Y > ip.Y)) result = 1 - result;\n        }\n      } else\n      {\n        if (ipNext.X > pt.X)\n        {\n          double d = (double)(ip.X - pt.X) * (ipNext.Y - pt.Y) - \n            (double)(ipNext.X - pt.X) * (ip.Y - pt.Y);\n          if (!d) return -1;\n          if ((d > 0) == (ipNext.Y > ip.Y)) result = 1 - result;\n        }\n      }\n    }\n    ip = ipNext;\n  } \n  return result;\n}\n//------------------------------------------------------------------------------\n\nint PointInPolygon (const IntPoint &pt, OutPt *op)\n{\n  //returns 0 if false, +1 if true, -1 if pt ON polygon boundary\n  int result = 0;\n  OutPt* startOp = op;\n  for(;;)\n  {\n    if (op->Next->Pt.Y == pt.Y)\n    {\n        if ((op->Next->Pt.X == pt.X) || (op->Pt.Y == pt.Y && \n          ((op->Next->Pt.X > pt.X) == (op->Pt.X < pt.X)))) return -1;\n    }\n    if ((op->Pt.Y < pt.Y) != (op->Next->Pt.Y < pt.Y))\n    {\n      if (op->Pt.X >= pt.X)\n      {\n        if (op->Next->Pt.X > pt.X) result = 1 - result;\n        else\n        {\n          double d = (double)(op->Pt.X - pt.X) * (op->Next->Pt.Y - pt.Y) - \n            (double)(op->Next->Pt.X - pt.X) * (op->Pt.Y - pt.Y);\n          if (!d) return -1;\n          if ((d > 0) == (op->Next->Pt.Y > op->Pt.Y)) result = 1 - result;\n        }\n      } else\n      {\n        if (op->Next->Pt.X > pt.X)\n        {\n          double d = (double)(op->Pt.X - pt.X) * (op->Next->Pt.Y - pt.Y) - \n            (double)(op->Next->Pt.X - pt.X) * (op->Pt.Y - pt.Y);\n          if (!d) return -1;\n          if ((d > 0) == (op->Next->Pt.Y > op->Pt.Y)) result = 1 - result;\n        }\n      }\n    } \n    op = op->Next;\n    if (startOp == op) break;\n  } \n  return result;\n}\n//------------------------------------------------------------------------------\n\nbool Poly2ContainsPoly1(OutPt *OutPt1, OutPt *OutPt2)\n{\n  OutPt* op = OutPt1;\n  do\n  {\n    //nb: PointInPolygon returns 0 if false, +1 if true, -1 if pt on polygon\n    int res = PointInPolygon(op->Pt, OutPt2);\n    if (res >= 0) return res > 0;\n    op = op->Next; \n  }\n  while (op != OutPt1);\n  return true; \n}\n//----------------------------------------------------------------------\n\nbool SlopesEqual(const TEdge &e1, const TEdge &e2, bool UseFullInt64Range)\n{\n#ifndef use_int32\n  if (UseFullInt64Range)\n    return Int128Mul(e1.Top.Y - e1.Bot.Y, e2.Top.X - e2.Bot.X) == \n    Int128Mul(e1.Top.X - e1.Bot.X, e2.Top.Y - e2.Bot.Y);\n  else \n#endif\n    return (e1.Top.Y - e1.Bot.Y) * (e2.Top.X - e2.Bot.X) == \n    (e1.Top.X - e1.Bot.X) * (e2.Top.Y - e2.Bot.Y);\n}\n//------------------------------------------------------------------------------\n\nbool SlopesEqual(const IntPoint pt1, const IntPoint pt2,\n  const IntPoint pt3, bool UseFullInt64Range)\n{\n#ifndef use_int32\n  if (UseFullInt64Range)\n    return Int128Mul(pt1.Y-pt2.Y, pt2.X-pt3.X) == Int128Mul(pt1.X-pt2.X, pt2.Y-pt3.Y);\n  else \n#endif\n    return (pt1.Y-pt2.Y)*(pt2.X-pt3.X) == (pt1.X-pt2.X)*(pt2.Y-pt3.Y);\n}\n//------------------------------------------------------------------------------\n\nbool SlopesEqual(const IntPoint pt1, const IntPoint pt2,\n  const IntPoint pt3, const IntPoint pt4, bool UseFullInt64Range)\n{\n#ifndef use_int32\n  if (UseFullInt64Range)\n    return Int128Mul(pt1.Y-pt2.Y, pt3.X-pt4.X) == Int128Mul(pt1.X-pt2.X, pt3.Y-pt4.Y);\n  else \n#endif\n    return (pt1.Y-pt2.Y)*(pt3.X-pt4.X) == (pt1.X-pt2.X)*(pt3.Y-pt4.Y);\n}\n//------------------------------------------------------------------------------\n\ninline bool IsHorizontal(TEdge &e)\n{\n  return e.Dx == HORIZONTAL;\n}\n//------------------------------------------------------------------------------\n\ninline double GetDx(const IntPoint pt1, const IntPoint pt2)\n{\n  return (pt1.Y == pt2.Y) ?\n    HORIZONTAL : (double)(pt2.X - pt1.X) / (pt2.Y - pt1.Y);\n}\n//---------------------------------------------------------------------------\n\ninline void SetDx(TEdge &e)\n{\n  cInt dy  = (e.Top.Y - e.Bot.Y);\n  if (dy == 0) e.Dx = HORIZONTAL;\n  else e.Dx = (double)(e.Top.X - e.Bot.X) / dy;\n}\n//---------------------------------------------------------------------------\n\ninline void SwapSides(TEdge &Edge1, TEdge &Edge2)\n{\n  EdgeSide Side =  Edge1.Side;\n  Edge1.Side = Edge2.Side;\n  Edge2.Side = Side;\n}\n//------------------------------------------------------------------------------\n\ninline void SwapPolyIndexes(TEdge &Edge1, TEdge &Edge2)\n{\n  int OutIdx =  Edge1.OutIdx;\n  Edge1.OutIdx = Edge2.OutIdx;\n  Edge2.OutIdx = OutIdx;\n}\n//------------------------------------------------------------------------------\n\ninline cInt TopX(TEdge &edge, const cInt currentY)\n{\n  return ( currentY == edge.Top.Y ) ?\n    edge.Top.X : edge.Bot.X + Round(edge.Dx *(currentY - edge.Bot.Y));\n}\n//------------------------------------------------------------------------------\n\nvoid IntersectPoint(TEdge &Edge1, TEdge &Edge2, IntPoint &ip)\n{\n#ifdef use_xyz  \n  ip.Z = 0;\n#endif\n\n  double b1, b2;\n  if (Edge1.Dx == Edge2.Dx)\n  {\n    ip.Y = Edge1.Curr.Y;\n    ip.X = TopX(Edge1, ip.Y);\n    return;\n  }\n  else if (Edge1.Dx == 0)\n  {\n    ip.X = Edge1.Bot.X;\n    if (IsHorizontal(Edge2))\n      ip.Y = Edge2.Bot.Y;\n    else\n    {\n      b2 = Edge2.Bot.Y - (Edge2.Bot.X / Edge2.Dx);\n      ip.Y = Round(ip.X / Edge2.Dx + b2);\n    }\n  }\n  else if (Edge2.Dx == 0)\n  {\n    ip.X = Edge2.Bot.X;\n    if (IsHorizontal(Edge1))\n      ip.Y = Edge1.Bot.Y;\n    else\n    {\n      b1 = Edge1.Bot.Y - (Edge1.Bot.X / Edge1.Dx);\n      ip.Y = Round(ip.X / Edge1.Dx + b1);\n    }\n  } \n  else \n  {\n    b1 = Edge1.Bot.X - Edge1.Bot.Y * Edge1.Dx;\n    b2 = Edge2.Bot.X - Edge2.Bot.Y * Edge2.Dx;\n    double q = (b2-b1) / (Edge1.Dx - Edge2.Dx);\n    ip.Y = Round(q);\n    if (std::fabs(Edge1.Dx) < std::fabs(Edge2.Dx))\n      ip.X = Round(Edge1.Dx * q + b1);\n    else \n      ip.X = Round(Edge2.Dx * q + b2);\n  }\n\n  if (ip.Y < Edge1.Top.Y || ip.Y < Edge2.Top.Y) \n  {\n    if (Edge1.Top.Y > Edge2.Top.Y)\n      ip.Y = Edge1.Top.Y;\n    else\n      ip.Y = Edge2.Top.Y;\n    if (std::fabs(Edge1.Dx) < std::fabs(Edge2.Dx))\n      ip.X = TopX(Edge1, ip.Y);\n    else\n      ip.X = TopX(Edge2, ip.Y);\n  } \n  //finally, don't allow 'ip' to be BELOW curr.Y (ie bottom of scanbeam) ...\n  if (ip.Y > Edge1.Curr.Y)\n  {\n    ip.Y = Edge1.Curr.Y;\n    //use the more vertical edge to derive X ...\n    if (std::fabs(Edge1.Dx) > std::fabs(Edge2.Dx))\n      ip.X = TopX(Edge2, ip.Y); else\n      ip.X = TopX(Edge1, ip.Y);\n  }\n}\n//------------------------------------------------------------------------------\n\nvoid ReversePolyPtLinks(OutPt *pp)\n{\n  if (!pp) return;\n  OutPt *pp1, *pp2;\n  pp1 = pp;\n  do {\n  pp2 = pp1->Next;\n  pp1->Next = pp1->Prev;\n  pp1->Prev = pp2;\n  pp1 = pp2;\n  } while( pp1 != pp );\n}\n//------------------------------------------------------------------------------\n\nvoid DisposeOutPts(OutPt*& pp)\n{\n  if (pp == 0) return;\n    pp->Prev->Next = 0;\n  while( pp )\n  {\n    OutPt *tmpPp = pp;\n    pp = pp->Next;\n    delete tmpPp;\n  }\n}\n//------------------------------------------------------------------------------\n\ninline void InitEdge(TEdge* e, TEdge* eNext, TEdge* ePrev, const IntPoint& Pt)\n{\n  std::memset(e, 0, sizeof(TEdge));\n  e->Next = eNext;\n  e->Prev = ePrev;\n  e->Curr = Pt;\n  e->OutIdx = Unassigned;\n}\n//------------------------------------------------------------------------------\n\nvoid InitEdge2(TEdge& e, PolyType Pt)\n{\n  if (e.Curr.Y >= e.Next->Curr.Y)\n  {\n    e.Bot = e.Curr;\n    e.Top = e.Next->Curr;\n  } else\n  {\n    e.Top = e.Curr;\n    e.Bot = e.Next->Curr;\n  }\n  SetDx(e);\n  e.PolyTyp = Pt;\n}\n//------------------------------------------------------------------------------\n\nTEdge* RemoveEdge(TEdge* e)\n{\n  //removes e from double_linked_list (but without removing from memory)\n  e->Prev->Next = e->Next;\n  e->Next->Prev = e->Prev;\n  TEdge* result = e->Next;\n  e->Prev = 0; //flag as removed (see ClipperBase.Clear)\n  return result;\n}\n//------------------------------------------------------------------------------\n\ninline void ReverseHorizontal(TEdge &e)\n{\n  //swap horizontal edges' Top and Bottom x's so they follow the natural\n  //progression of the bounds - ie so their xbots will align with the\n  //adjoining lower edge. [Helpful in the ProcessHorizontal() method.]\n  std::swap(e.Top.X, e.Bot.X);\n#ifdef use_xyz  \n  std::swap(e.Top.Z, e.Bot.Z);\n#endif\n}\n//------------------------------------------------------------------------------\n\nvoid SwapPoints(IntPoint &pt1, IntPoint &pt2)\n{\n  IntPoint tmp = pt1;\n  pt1 = pt2;\n  pt2 = tmp;\n}\n//------------------------------------------------------------------------------\n\nbool GetOverlapSegment(IntPoint pt1a, IntPoint pt1b, IntPoint pt2a,\n  IntPoint pt2b, IntPoint &pt1, IntPoint &pt2)\n{\n  //precondition: segments are Collinear.\n  if (Abs(pt1a.X - pt1b.X) > Abs(pt1a.Y - pt1b.Y))\n  {\n    if (pt1a.X > pt1b.X) SwapPoints(pt1a, pt1b);\n    if (pt2a.X > pt2b.X) SwapPoints(pt2a, pt2b);\n    if (pt1a.X > pt2a.X) pt1 = pt1a; else pt1 = pt2a;\n    if (pt1b.X < pt2b.X) pt2 = pt1b; else pt2 = pt2b;\n    return pt1.X < pt2.X;\n  } else\n  {\n    if (pt1a.Y < pt1b.Y) SwapPoints(pt1a, pt1b);\n    if (pt2a.Y < pt2b.Y) SwapPoints(pt2a, pt2b);\n    if (pt1a.Y < pt2a.Y) pt1 = pt1a; else pt1 = pt2a;\n    if (pt1b.Y > pt2b.Y) pt2 = pt1b; else pt2 = pt2b;\n    return pt1.Y > pt2.Y;\n  }\n}\n//------------------------------------------------------------------------------\n\nbool FirstIsBottomPt(const OutPt* btmPt1, const OutPt* btmPt2)\n{\n  OutPt *p = btmPt1->Prev;\n  while ((p->Pt == btmPt1->Pt) && (p != btmPt1)) p = p->Prev;\n  double dx1p = std::fabs(GetDx(btmPt1->Pt, p->Pt));\n  p = btmPt1->Next;\n  while ((p->Pt == btmPt1->Pt) && (p != btmPt1)) p = p->Next;\n  double dx1n = std::fabs(GetDx(btmPt1->Pt, p->Pt));\n\n  p = btmPt2->Prev;\n  while ((p->Pt == btmPt2->Pt) && (p != btmPt2)) p = p->Prev;\n  double dx2p = std::fabs(GetDx(btmPt2->Pt, p->Pt));\n  p = btmPt2->Next;\n  while ((p->Pt == btmPt2->Pt) && (p != btmPt2)) p = p->Next;\n  double dx2n = std::fabs(GetDx(btmPt2->Pt, p->Pt));\n\n  if (std::max(dx1p, dx1n) == std::max(dx2p, dx2n) &&\n    std::min(dx1p, dx1n) == std::min(dx2p, dx2n))\n      return Area(btmPt1) > 0; //if otherwise identical use orientation\n  else\n    return (dx1p >= dx2p && dx1p >= dx2n) || (dx1n >= dx2p && dx1n >= dx2n);\n}\n//------------------------------------------------------------------------------\n\nOutPt* GetBottomPt(OutPt *pp)\n{\n  OutPt* dups = 0;\n  OutPt* p = pp->Next;\n  while (p != pp)\n  {\n    if (p->Pt.Y > pp->Pt.Y)\n    {\n      pp = p;\n      dups = 0;\n    }\n    else if (p->Pt.Y == pp->Pt.Y && p->Pt.X <= pp->Pt.X)\n    {\n      if (p->Pt.X < pp->Pt.X)\n      {\n        dups = 0;\n        pp = p;\n      } else\n      {\n        if (p->Next != pp && p->Prev != pp) dups = p;\n      }\n    }\n    p = p->Next;\n  }\n  if (dups)\n  {\n    //there appears to be at least 2 vertices at BottomPt so ...\n    while (dups != p)\n    {\n      if (!FirstIsBottomPt(p, dups)) pp = dups;\n      dups = dups->Next;\n      while (dups->Pt != pp->Pt) dups = dups->Next;\n    }\n  }\n  return pp;\n}\n//------------------------------------------------------------------------------\n\nbool Pt2IsBetweenPt1AndPt3(const IntPoint pt1,\n  const IntPoint pt2, const IntPoint pt3)\n{\n  if ((pt1 == pt3) || (pt1 == pt2) || (pt3 == pt2))\n    return false;\n  else if (pt1.X != pt3.X)\n    return (pt2.X > pt1.X) == (pt2.X < pt3.X);\n  else\n    return (pt2.Y > pt1.Y) == (pt2.Y < pt3.Y);\n}\n//------------------------------------------------------------------------------\n\nbool HorzSegmentsOverlap(cInt seg1a, cInt seg1b, cInt seg2a, cInt seg2b)\n{\n  if (seg1a > seg1b) std::swap(seg1a, seg1b);\n  if (seg2a > seg2b) std::swap(seg2a, seg2b);\n  return (seg1a < seg2b) && (seg2a < seg1b);\n}\n\n//------------------------------------------------------------------------------\n// ClipperBase class methods ...\n//------------------------------------------------------------------------------\n\nClipperBase::ClipperBase() //constructor\n{\n  m_CurrentLM = m_MinimaList.begin(); //begin() == end() here\n  m_UseFullRange = false;\n}\n//------------------------------------------------------------------------------\n\nClipperBase::~ClipperBase() //destructor\n{\n  Clear();\n}\n//------------------------------------------------------------------------------\n\nvoid RangeTest(const IntPoint& Pt, bool& useFullRange)\n{\n  if (useFullRange)\n  {\n    if (Pt.X > hiRange || Pt.Y > hiRange || -Pt.X > hiRange || -Pt.Y > hiRange) \n      throw clipperException(\"Coordinate outside allowed range\");\n  }\n  else if (Pt.X > loRange|| Pt.Y > loRange || -Pt.X > loRange || -Pt.Y > loRange) \n  {\n    useFullRange = true;\n    RangeTest(Pt, useFullRange);\n  }\n}\n//------------------------------------------------------------------------------\n\nTEdge* FindNextLocMin(TEdge* E)\n{\n  for (;;)\n  {\n    while (E->Bot != E->Prev->Bot || E->Curr == E->Top) E = E->Next;\n    if (!IsHorizontal(*E) && !IsHorizontal(*E->Prev)) break;\n    while (IsHorizontal(*E->Prev)) E = E->Prev;\n    TEdge* E2 = E;\n    while (IsHorizontal(*E)) E = E->Next;\n    if (E->Top.Y == E->Prev->Bot.Y) continue; //ie just an intermediate horz.\n    if (E2->Prev->Bot.X < E->Bot.X) E = E2;\n    break;\n  }\n  return E;\n}\n//------------------------------------------------------------------------------\n\nTEdge* ClipperBase::ProcessBound(TEdge* E, bool NextIsForward)\n{\n  TEdge *Result = E;\n  TEdge *Horz = 0;\n\n  if (E->OutIdx == Skip)\n  {\n    //if edges still remain in the current bound beyond the skip edge then\n    //create another LocMin and call ProcessBound once more\n    if (NextIsForward)\n    {\n      while (E->Top.Y == E->Next->Bot.Y) E = E->Next;\n      //don't include top horizontals when parsing a bound a second time,\n      //they will be contained in the opposite bound ...\n      while (E != Result && IsHorizontal(*E)) E = E->Prev;\n    }\n    else\n    {\n      while (E->Top.Y == E->Prev->Bot.Y) E = E->Prev;\n      while (E != Result && IsHorizontal(*E)) E = E->Next;\n    }\n\n    if (E == Result)\n    {\n      if (NextIsForward) Result = E->Next;\n      else Result = E->Prev;\n    }\n    else\n    {\n      //there are more edges in the bound beyond result starting with E\n      if (NextIsForward)\n        E = Result->Next;\n      else\n        E = Result->Prev;\n      MinimaList::value_type locMin;\n      locMin.Y = E->Bot.Y;\n      locMin.LeftBound = 0;\n      locMin.RightBound = E;\n      E->WindDelta = 0;\n      Result = ProcessBound(E, NextIsForward);\n      m_MinimaList.push_back(locMin);\n    }\n    return Result;\n  }\n\n  TEdge *EStart;\n\n  if (IsHorizontal(*E))\n  {\n    //We need to be careful with open paths because this may not be a\n    //true local minima (ie E may be following a skip edge).\n    //Also, consecutive horz. edges may start heading left before going right.\n    if (NextIsForward) \n      EStart = E->Prev;\n    else \n      EStart = E->Next;\n    if (IsHorizontal(*EStart)) //ie an adjoining horizontal skip edge\n      {\n        if (EStart->Bot.X != E->Bot.X && EStart->Top.X != E->Bot.X)\n          ReverseHorizontal(*E);\n      }\n      else if (EStart->Bot.X != E->Bot.X)\n        ReverseHorizontal(*E);\n  }\n  \n  EStart = E;\n  if (NextIsForward)\n  {\n    while (Result->Top.Y == Result->Next->Bot.Y && Result->Next->OutIdx != Skip)\n      Result = Result->Next;\n    if (IsHorizontal(*Result) && Result->Next->OutIdx != Skip)\n    {\n      //nb: at the top of a bound, horizontals are added to the bound\n      //only when the preceding edge attaches to the horizontal's left vertex\n      //unless a Skip edge is encountered when that becomes the top divide\n      Horz = Result;\n      while (IsHorizontal(*Horz->Prev)) Horz = Horz->Prev;\n      if (Horz->Prev->Top.X > Result->Next->Top.X) Result = Horz->Prev;\n    }\n    while (E != Result) \n    {\n      E->NextInLML = E->Next;\n      if (IsHorizontal(*E) && E != EStart &&\n        E->Bot.X != E->Prev->Top.X) ReverseHorizontal(*E);\n      E = E->Next;\n    }\n    if (IsHorizontal(*E) && E != EStart && E->Bot.X != E->Prev->Top.X) \n      ReverseHorizontal(*E);\n    Result = Result->Next; //move to the edge just beyond current bound\n  } else\n  {\n    while (Result->Top.Y == Result->Prev->Bot.Y && Result->Prev->OutIdx != Skip) \n      Result = Result->Prev;\n    if (IsHorizontal(*Result) && Result->Prev->OutIdx != Skip)\n    {\n      Horz = Result;\n      while (IsHorizontal(*Horz->Next)) Horz = Horz->Next;\n      if (Horz->Next->Top.X == Result->Prev->Top.X ||\n          Horz->Next->Top.X > Result->Prev->Top.X) Result = Horz->Next;\n    }\n\n    while (E != Result)\n    {\n      E->NextInLML = E->Prev;\n      if (IsHorizontal(*E) && E != EStart && E->Bot.X != E->Next->Top.X) \n        ReverseHorizontal(*E);\n      E = E->Prev;\n    }\n    if (IsHorizontal(*E) && E != EStart && E->Bot.X != E->Next->Top.X) \n      ReverseHorizontal(*E);\n    Result = Result->Prev; //move to the edge just beyond current bound\n  }\n\n  return Result;\n}\n//------------------------------------------------------------------------------\n\nbool ClipperBase::AddPath(const Path &pg, PolyType PolyTyp, bool Closed)\n{\n#ifdef use_lines\n  if (!Closed && PolyTyp == ptClip)\n    throw clipperException(\"AddPath: Open paths must be subject.\");\n#else\n  if (!Closed)\n    throw clipperException(\"AddPath: Open paths have been disabled.\");\n#endif\n\n  int highI = (int)pg.size() -1;\n  if (Closed) while (highI > 0 && (pg[highI] == pg[0])) --highI;\n  while (highI > 0 && (pg[highI] == pg[highI -1])) --highI;\n  if ((Closed && highI < 2) || (!Closed && highI < 1)) return false;\n\n  //create a new edge array ...\n  TEdge *edges = new TEdge [highI +1];\n\n  bool IsFlat = true;\n  //1. Basic (first) edge initialization ...\n  try\n  {\n    edges[1].Curr = pg[1];\n    RangeTest(pg[0], m_UseFullRange);\n    RangeTest(pg[highI], m_UseFullRange);\n    InitEdge(&edges[0], &edges[1], &edges[highI], pg[0]);\n    InitEdge(&edges[highI], &edges[0], &edges[highI-1], pg[highI]);\n    for (int i = highI - 1; i >= 1; --i)\n    {\n      RangeTest(pg[i], m_UseFullRange);\n      InitEdge(&edges[i], &edges[i+1], &edges[i-1], pg[i]);\n    }\n  }\n  catch(...)\n  {\n    delete [] edges;\n    throw; //range test fails\n  }\n  TEdge *eStart = &edges[0];\n\n  //2. Remove duplicate vertices, and (when closed) collinear edges ...\n  TEdge *E = eStart, *eLoopStop = eStart;\n  for (;;)\n  {\n    //nb: allows matching start and end points when not Closed ...\n    if (E->Curr == E->Next->Curr && (Closed || E->Next != eStart))\n    {\n      if (E == E->Next) break;\n      if (E == eStart) eStart = E->Next;\n      E = RemoveEdge(E);\n      eLoopStop = E;\n      continue;\n    }\n    if (E->Prev == E->Next) \n      break; //only two vertices\n    else if (Closed &&\n      SlopesEqual(E->Prev->Curr, E->Curr, E->Next->Curr, m_UseFullRange) && \n      (!m_PreserveCollinear ||\n      !Pt2IsBetweenPt1AndPt3(E->Prev->Curr, E->Curr, E->Next->Curr)))\n    {\n      //Collinear edges are allowed for open paths but in closed paths\n      //the default is to merge adjacent collinear edges into a single edge.\n      //However, if the PreserveCollinear property is enabled, only overlapping\n      //collinear edges (ie spikes) will be removed from closed paths.\n      if (E == eStart) eStart = E->Next;\n      E = RemoveEdge(E);\n      E = E->Prev;\n      eLoopStop = E;\n      continue;\n    }\n    E = E->Next;\n    if ((E == eLoopStop) || (!Closed && E->Next == eStart)) break;\n  }\n\n  if ((!Closed && (E == E->Next)) || (Closed && (E->Prev == E->Next)))\n  {\n    delete [] edges;\n    return false;\n  }\n\n  if (!Closed)\n  { \n    m_HasOpenPaths = true;\n    eStart->Prev->OutIdx = Skip;\n  }\n\n  //3. Do second stage of edge initialization ...\n  E = eStart;\n  do\n  {\n    InitEdge2(*E, PolyTyp);\n    E = E->Next;\n    if (IsFlat && E->Curr.Y != eStart->Curr.Y) IsFlat = false;\n  }\n  while (E != eStart);\n\n  //4. Finally, add edge bounds to LocalMinima list ...\n\n  //Totally flat paths must be handled differently when adding them\n  //to LocalMinima list to avoid endless loops etc ...\n  if (IsFlat) \n  {\n    if (Closed) \n    {\n      delete [] edges;\n      return false;\n    }\n    E->Prev->OutIdx = Skip;\n    MinimaList::value_type locMin;\n    locMin.Y = E->Bot.Y;\n    locMin.LeftBound = 0;\n    locMin.RightBound = E;\n    locMin.RightBound->Side = esRight;\n    locMin.RightBound->WindDelta = 0;\n    for (;;)\n    {\n      if (E->Bot.X != E->Prev->Top.X) ReverseHorizontal(*E);\n      if (E->Next->OutIdx == Skip) break;\n      E->NextInLML = E->Next;\n      E = E->Next;\n    }\n    m_MinimaList.push_back(locMin);\n    m_edges.push_back(edges);\n\t  return true;\n  }\n\n  m_edges.push_back(edges);\n  bool leftBoundIsForward;\n  TEdge* EMin = 0;\n\n  //workaround to avoid an endless loop in the while loop below when\n  //open paths have matching start and end points ...\n  if (E->Prev->Bot == E->Prev->Top) E = E->Next;\n\n  for (;;)\n  {\n    E = FindNextLocMin(E);\n    if (E == EMin) break;\n    else if (!EMin) EMin = E;\n\n    //E and E.Prev now share a local minima (left aligned if horizontal).\n    //Compare their slopes to find which starts which bound ...\n    MinimaList::value_type locMin;\n    locMin.Y = E->Bot.Y;\n    if (E->Dx < E->Prev->Dx) \n    {\n      locMin.LeftBound = E->Prev;\n      locMin.RightBound = E;\n      leftBoundIsForward = false; //Q.nextInLML = Q.prev\n    } else\n    {\n      locMin.LeftBound = E;\n      locMin.RightBound = E->Prev;\n      leftBoundIsForward = true; //Q.nextInLML = Q.next\n    }\n\n    if (!Closed) locMin.LeftBound->WindDelta = 0;\n    else if (locMin.LeftBound->Next == locMin.RightBound)\n      locMin.LeftBound->WindDelta = -1;\n    else locMin.LeftBound->WindDelta = 1;\n    locMin.RightBound->WindDelta = -locMin.LeftBound->WindDelta;\n\n    E = ProcessBound(locMin.LeftBound, leftBoundIsForward);\n    if (E->OutIdx == Skip) E = ProcessBound(E, leftBoundIsForward);\n\n    TEdge* E2 = ProcessBound(locMin.RightBound, !leftBoundIsForward);\n    if (E2->OutIdx == Skip) E2 = ProcessBound(E2, !leftBoundIsForward);\n\n    if (locMin.LeftBound->OutIdx == Skip)\n      locMin.LeftBound = 0;\n    else if (locMin.RightBound->OutIdx == Skip)\n      locMin.RightBound = 0;\n    m_MinimaList.push_back(locMin);\n    if (!leftBoundIsForward) E = E2;\n  }\n  return true;\n}\n//------------------------------------------------------------------------------\n\nbool ClipperBase::AddPaths(const Paths &ppg, PolyType PolyTyp, bool Closed)\n{\n  bool result = false;\n  for (Paths::size_type i = 0; i < ppg.size(); ++i)\n    if (AddPath(ppg[i], PolyTyp, Closed)) result = true;\n  return result;\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperBase::Clear()\n{\n  DisposeLocalMinimaList();\n  for (EdgeList::size_type i = 0; i < m_edges.size(); ++i)\n  {\n    TEdge* edges = m_edges[i];\n    delete [] edges;\n  }\n  m_edges.clear();\n  m_UseFullRange = false;\n  m_HasOpenPaths = false;\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperBase::Reset()\n{\n  m_CurrentLM = m_MinimaList.begin();\n  if (m_CurrentLM == m_MinimaList.end()) return; //ie nothing to process\n  std::sort(m_MinimaList.begin(), m_MinimaList.end(), LocMinSorter());\n\n  m_Scanbeam = ScanbeamList(); //clears/resets priority_queue\n  //reset all edges ...\n  for (MinimaList::iterator lm = m_MinimaList.begin(); lm != m_MinimaList.end(); ++lm)\n  {\n    InsertScanbeam(lm->Y);\n    TEdge* e = lm->LeftBound;\n    if (e)\n    {\n      e->Curr = e->Bot;\n      e->Side = esLeft;\n      e->OutIdx = Unassigned;\n    }\n\n    e = lm->RightBound;\n    if (e)\n    {\n      e->Curr = e->Bot;\n      e->Side = esRight;\n      e->OutIdx = Unassigned;\n    }\n  }\n  m_ActiveEdges = 0;\n  m_CurrentLM = m_MinimaList.begin();\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperBase::DisposeLocalMinimaList()\n{\n  m_MinimaList.clear();\n  m_CurrentLM = m_MinimaList.begin();\n}\n//------------------------------------------------------------------------------\n\nbool ClipperBase::PopLocalMinima(cInt Y, const LocalMinimum *&locMin)\n{\n  if (m_CurrentLM == m_MinimaList.end() || (*m_CurrentLM).Y != Y) return false;\n  locMin = &(*m_CurrentLM);\n  ++m_CurrentLM;\n  return true;\n}\n//------------------------------------------------------------------------------\n\nIntRect ClipperBase::GetBounds()\n{\n  IntRect result;\n  MinimaList::iterator lm = m_MinimaList.begin();\n  if (lm == m_MinimaList.end())\n  {\n    result.left = result.top = result.right = result.bottom = 0;\n    return result;\n  }\n  result.left = lm->LeftBound->Bot.X;\n  result.top = lm->LeftBound->Bot.Y;\n  result.right = lm->LeftBound->Bot.X;\n  result.bottom = lm->LeftBound->Bot.Y;\n  while (lm != m_MinimaList.end())\n  {\n    //todo - needs fixing for open paths\n    result.bottom = std::max(result.bottom, lm->LeftBound->Bot.Y);\n    TEdge* e = lm->LeftBound;\n    for (;;) {\n      TEdge* bottomE = e;\n      while (e->NextInLML)\n      {\n        if (e->Bot.X < result.left) result.left = e->Bot.X;\n        if (e->Bot.X > result.right) result.right = e->Bot.X;\n        e = e->NextInLML;\n      }\n      result.left = std::min(result.left, e->Bot.X);\n      result.right = std::max(result.right, e->Bot.X);\n      result.left = std::min(result.left, e->Top.X);\n      result.right = std::max(result.right, e->Top.X);\n      result.top = std::min(result.top, e->Top.Y);\n      if (bottomE == lm->LeftBound) e = lm->RightBound;\n      else break;\n    }\n    ++lm;\n  }\n  return result;\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperBase::InsertScanbeam(const cInt Y)\n{\n  m_Scanbeam.push(Y);\n}\n//------------------------------------------------------------------------------\n\nbool ClipperBase::PopScanbeam(cInt &Y)\n{\n  if (m_Scanbeam.empty()) return false;\n  Y = m_Scanbeam.top();\n  m_Scanbeam.pop();\n  while (!m_Scanbeam.empty() && Y == m_Scanbeam.top()) { m_Scanbeam.pop(); } // Pop duplicates.\n  return true;\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperBase::DisposeAllOutRecs(){\n  for (PolyOutList::size_type i = 0; i < m_PolyOuts.size(); ++i)\n    DisposeOutRec(i);\n  m_PolyOuts.clear();\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperBase::DisposeOutRec(PolyOutList::size_type index)\n{\n  OutRec *outRec = m_PolyOuts[index];\n  if (outRec->Pts) DisposeOutPts(outRec->Pts);\n  delete outRec;\n  m_PolyOuts[index] = 0;\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperBase::DeleteFromAEL(TEdge *e)\n{\n  TEdge* AelPrev = e->PrevInAEL;\n  TEdge* AelNext = e->NextInAEL;\n  if (!AelPrev &&  !AelNext && (e != m_ActiveEdges)) return; //already deleted\n  if (AelPrev) AelPrev->NextInAEL = AelNext;\n  else m_ActiveEdges = AelNext;\n  if (AelNext) AelNext->PrevInAEL = AelPrev;\n  e->NextInAEL = 0;\n  e->PrevInAEL = 0;\n}\n//------------------------------------------------------------------------------\n\nOutRec* ClipperBase::CreateOutRec()\n{\n  OutRec* result = new OutRec;\n  result->IsHole = false;\n  result->IsOpen = false;\n  result->FirstLeft = 0;\n  result->Pts = 0;\n  result->BottomPt = 0;\n  result->PolyNd = 0;\n  m_PolyOuts.push_back(result);\n  result->Idx = (int)m_PolyOuts.size() - 1;\n  return result;\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperBase::SwapPositionsInAEL(TEdge *Edge1, TEdge *Edge2)\n{\n  //check that one or other edge hasn't already been removed from AEL ...\n  if (Edge1->NextInAEL == Edge1->PrevInAEL ||\n    Edge2->NextInAEL == Edge2->PrevInAEL) return;\n\n  if (Edge1->NextInAEL == Edge2)\n  {\n    TEdge* Next = Edge2->NextInAEL;\n    if (Next) Next->PrevInAEL = Edge1;\n    TEdge* Prev = Edge1->PrevInAEL;\n    if (Prev) Prev->NextInAEL = Edge2;\n    Edge2->PrevInAEL = Prev;\n    Edge2->NextInAEL = Edge1;\n    Edge1->PrevInAEL = Edge2;\n    Edge1->NextInAEL = Next;\n  }\n  else if (Edge2->NextInAEL == Edge1)\n  {\n    TEdge* Next = Edge1->NextInAEL;\n    if (Next) Next->PrevInAEL = Edge2;\n    TEdge* Prev = Edge2->PrevInAEL;\n    if (Prev) Prev->NextInAEL = Edge1;\n    Edge1->PrevInAEL = Prev;\n    Edge1->NextInAEL = Edge2;\n    Edge2->PrevInAEL = Edge1;\n    Edge2->NextInAEL = Next;\n  }\n  else\n  {\n    TEdge* Next = Edge1->NextInAEL;\n    TEdge* Prev = Edge1->PrevInAEL;\n    Edge1->NextInAEL = Edge2->NextInAEL;\n    if (Edge1->NextInAEL) Edge1->NextInAEL->PrevInAEL = Edge1;\n    Edge1->PrevInAEL = Edge2->PrevInAEL;\n    if (Edge1->PrevInAEL) Edge1->PrevInAEL->NextInAEL = Edge1;\n    Edge2->NextInAEL = Next;\n    if (Edge2->NextInAEL) Edge2->NextInAEL->PrevInAEL = Edge2;\n    Edge2->PrevInAEL = Prev;\n    if (Edge2->PrevInAEL) Edge2->PrevInAEL->NextInAEL = Edge2;\n  }\n\n  if (!Edge1->PrevInAEL) m_ActiveEdges = Edge1;\n  else if (!Edge2->PrevInAEL) m_ActiveEdges = Edge2;\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperBase::UpdateEdgeIntoAEL(TEdge *&e)\n{\n  if (!e->NextInLML) \n    throw clipperException(\"UpdateEdgeIntoAEL: invalid call\");\n\n  e->NextInLML->OutIdx = e->OutIdx;\n  TEdge* AelPrev = e->PrevInAEL;\n  TEdge* AelNext = e->NextInAEL;\n  if (AelPrev) AelPrev->NextInAEL = e->NextInLML;\n  else m_ActiveEdges = e->NextInLML;\n  if (AelNext) AelNext->PrevInAEL = e->NextInLML;\n  e->NextInLML->Side = e->Side;\n  e->NextInLML->WindDelta = e->WindDelta;\n  e->NextInLML->WindCnt = e->WindCnt;\n  e->NextInLML->WindCnt2 = e->WindCnt2;\n  e = e->NextInLML;\n  e->Curr = e->Bot;\n  e->PrevInAEL = AelPrev;\n  e->NextInAEL = AelNext;\n  if (!IsHorizontal(*e)) InsertScanbeam(e->Top.Y);\n}\n//------------------------------------------------------------------------------\n\nbool ClipperBase::LocalMinimaPending()\n{\n  return (m_CurrentLM != m_MinimaList.end());\n}\n\n//------------------------------------------------------------------------------\n// TClipper methods ...\n//------------------------------------------------------------------------------\n\nClipper::Clipper(int initOptions) : ClipperBase() //constructor\n{\n  m_ExecuteLocked = false;\n  m_UseFullRange = false;\n  m_ReverseOutput = ((initOptions & ioReverseSolution) != 0);\n  m_StrictSimple = ((initOptions & ioStrictlySimple) != 0);\n  m_PreserveCollinear = ((initOptions & ioPreserveCollinear) != 0);\n  m_HasOpenPaths = false;\n#ifdef use_xyz  \n  m_ZFill = 0;\n#endif\n}\n//------------------------------------------------------------------------------\n\n#ifdef use_xyz  \nvoid Clipper::ZFillFunction(ZFillCallback zFillFunc)\n{  \n  m_ZFill = zFillFunc;\n}\n//------------------------------------------------------------------------------\n#endif\n\nbool Clipper::Execute(ClipType clipType, Paths &solution, PolyFillType fillType)\n{\n    return Execute(clipType, solution, fillType, fillType);\n}\n//------------------------------------------------------------------------------\n\nbool Clipper::Execute(ClipType clipType, PolyTree &polytree, PolyFillType fillType)\n{\n    return Execute(clipType, polytree, fillType, fillType);\n}\n//------------------------------------------------------------------------------\n\nbool Clipper::Execute(ClipType clipType, Paths &solution,\n    PolyFillType subjFillType, PolyFillType clipFillType)\n{\n  if( m_ExecuteLocked ) return false;\n  if (m_HasOpenPaths)\n    throw clipperException(\"Error: PolyTree struct is needed for open path clipping.\");\n  m_ExecuteLocked = true;\n  solution.resize(0);\n  m_SubjFillType = subjFillType;\n  m_ClipFillType = clipFillType;\n  m_ClipType = clipType;\n  m_UsingPolyTree = false;\n  bool succeeded = ExecuteInternal();\n  if (succeeded) BuildResult(solution);\n  DisposeAllOutRecs();\n  m_ExecuteLocked = false;\n  return succeeded;\n}\n//------------------------------------------------------------------------------\n\nbool Clipper::Execute(ClipType clipType, PolyTree& polytree,\n    PolyFillType subjFillType, PolyFillType clipFillType)\n{\n  if( m_ExecuteLocked ) return false;\n  m_ExecuteLocked = true;\n  m_SubjFillType = subjFillType;\n  m_ClipFillType = clipFillType;\n  m_ClipType = clipType;\n  m_UsingPolyTree = true;\n  bool succeeded = ExecuteInternal();\n  if (succeeded) BuildResult2(polytree);\n  DisposeAllOutRecs();\n  m_ExecuteLocked = false;\n  return succeeded;\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::FixHoleLinkage(OutRec &outrec)\n{\n  //skip OutRecs that (a) contain outermost polygons or\n  //(b) already have the correct owner/child linkage ...\n  if (!outrec.FirstLeft ||                \n      (outrec.IsHole != outrec.FirstLeft->IsHole &&\n      outrec.FirstLeft->Pts)) return;\n\n  OutRec* orfl = outrec.FirstLeft;\n  while (orfl && ((orfl->IsHole == outrec.IsHole) || !orfl->Pts))\n      orfl = orfl->FirstLeft;\n  outrec.FirstLeft = orfl;\n}\n//------------------------------------------------------------------------------\n\nbool Clipper::ExecuteInternal()\n{\n  bool succeeded = true;\n  try {\n    Reset();\n    m_Maxima = MaximaList();\n    m_SortedEdges = 0;\n\n    succeeded = true;\n    cInt botY, topY;\n    if (!PopScanbeam(botY)) return false;\n    InsertLocalMinimaIntoAEL(botY);\n    while (PopScanbeam(topY) || LocalMinimaPending())\n    {\n      ProcessHorizontals();\n\t    ClearGhostJoins();\n      if (!ProcessIntersections(topY))\n      {\n        succeeded = false;\n        break;\n      }\n      ProcessEdgesAtTopOfScanbeam(topY);\n      botY = topY;\n      InsertLocalMinimaIntoAEL(botY);\n    }\n  }\n  catch(...) \n  {\n    succeeded = false;\n  }\n\n  if (succeeded)\n  {\n    //fix orientations ...\n    for (PolyOutList::size_type i = 0; i < m_PolyOuts.size(); ++i)\n    {\n      OutRec *outRec = m_PolyOuts[i];\n      if (!outRec->Pts || outRec->IsOpen) continue;\n      if ((outRec->IsHole ^ m_ReverseOutput) == (Area(*outRec) > 0))\n        ReversePolyPtLinks(outRec->Pts);\n    }\n\n    if (!m_Joins.empty()) JoinCommonEdges();\n\n    //unfortunately FixupOutPolygon() must be done after JoinCommonEdges()\n    for (PolyOutList::size_type i = 0; i < m_PolyOuts.size(); ++i)\n    {\n      OutRec *outRec = m_PolyOuts[i];\n      if (!outRec->Pts) continue;\n      if (outRec->IsOpen)\n        FixupOutPolyline(*outRec);\n      else\n        FixupOutPolygon(*outRec);\n    }\n\n    if (m_StrictSimple) DoSimplePolygons();\n  }\n\n  ClearJoins();\n  ClearGhostJoins();\n  return succeeded;\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::SetWindingCount(TEdge &edge)\n{\n  TEdge *e = edge.PrevInAEL;\n  //find the edge of the same polytype that immediately preceeds 'edge' in AEL\n  while (e  && ((e->PolyTyp != edge.PolyTyp) || (e->WindDelta == 0))) e = e->PrevInAEL;\n  if (!e)\n  {\n    if (edge.WindDelta == 0)\n    {\n      PolyFillType pft = (edge.PolyTyp == ptSubject ? m_SubjFillType : m_ClipFillType);\n      edge.WindCnt = (pft == pftNegative ? -1 : 1);\n    }\n    else\n      edge.WindCnt = edge.WindDelta;\n    edge.WindCnt2 = 0;\n    e = m_ActiveEdges; //ie get ready to calc WindCnt2\n  }   \n  else if (edge.WindDelta == 0 && m_ClipType != ctUnion)\n  {\n    edge.WindCnt = 1;\n    edge.WindCnt2 = e->WindCnt2;\n    e = e->NextInAEL; //ie get ready to calc WindCnt2\n  }\n  else if (IsEvenOddFillType(edge))\n  {\n    //EvenOdd filling ...\n    if (edge.WindDelta == 0)\n    {\n      //are we inside a subj polygon ...\n      bool Inside = true;\n      TEdge *e2 = e->PrevInAEL;\n      while (e2)\n      {\n        if (e2->PolyTyp == e->PolyTyp && e2->WindDelta != 0) \n          Inside = !Inside;\n        e2 = e2->PrevInAEL;\n      }\n      edge.WindCnt = (Inside ? 0 : 1);\n    }\n    else\n    {\n      edge.WindCnt = edge.WindDelta;\n    }\n    edge.WindCnt2 = e->WindCnt2;\n    e = e->NextInAEL; //ie get ready to calc WindCnt2\n  } \n  else\n  {\n    //nonZero, Positive or Negative filling ...\n    if (e->WindCnt * e->WindDelta < 0)\n    {\n      //prev edge is 'decreasing' WindCount (WC) toward zero\n      //so we're outside the previous polygon ...\n      if (Abs(e->WindCnt) > 1)\n      {\n        //outside prev poly but still inside another.\n        //when reversing direction of prev poly use the same WC \n        if (e->WindDelta * edge.WindDelta < 0) edge.WindCnt = e->WindCnt;\n        //otherwise continue to 'decrease' WC ...\n        else edge.WindCnt = e->WindCnt + edge.WindDelta;\n      } \n      else\n        //now outside all polys of same polytype so set own WC ...\n        edge.WindCnt = (edge.WindDelta == 0 ? 1 : edge.WindDelta);\n    } else\n    {\n      //prev edge is 'increasing' WindCount (WC) away from zero\n      //so we're inside the previous polygon ...\n      if (edge.WindDelta == 0) \n        edge.WindCnt = (e->WindCnt < 0 ? e->WindCnt - 1 : e->WindCnt + 1);\n      //if wind direction is reversing prev then use same WC\n      else if (e->WindDelta * edge.WindDelta < 0) edge.WindCnt = e->WindCnt;\n      //otherwise add to WC ...\n      else edge.WindCnt = e->WindCnt + edge.WindDelta;\n    }\n    edge.WindCnt2 = e->WindCnt2;\n    e = e->NextInAEL; //ie get ready to calc WindCnt2\n  }\n\n  //update WindCnt2 ...\n  if (IsEvenOddAltFillType(edge))\n  {\n    //EvenOdd filling ...\n    while (e != &edge)\n    {\n      if (e->WindDelta != 0)\n        edge.WindCnt2 = (edge.WindCnt2 == 0 ? 1 : 0);\n      e = e->NextInAEL;\n    }\n  } else\n  {\n    //nonZero, Positive or Negative filling ...\n    while ( e != &edge )\n    {\n      edge.WindCnt2 += e->WindDelta;\n      e = e->NextInAEL;\n    }\n  }\n}\n//------------------------------------------------------------------------------\n\nbool Clipper::IsEvenOddFillType(const TEdge& edge) const\n{\n  if (edge.PolyTyp == ptSubject)\n    return m_SubjFillType == pftEvenOdd; else\n    return m_ClipFillType == pftEvenOdd;\n}\n//------------------------------------------------------------------------------\n\nbool Clipper::IsEvenOddAltFillType(const TEdge& edge) const\n{\n  if (edge.PolyTyp == ptSubject)\n    return m_ClipFillType == pftEvenOdd; else\n    return m_SubjFillType == pftEvenOdd;\n}\n//------------------------------------------------------------------------------\n\nbool Clipper::IsContributing(const TEdge& edge) const\n{\n  PolyFillType pft, pft2;\n  if (edge.PolyTyp == ptSubject)\n  {\n    pft = m_SubjFillType;\n    pft2 = m_ClipFillType;\n  } else\n  {\n    pft = m_ClipFillType;\n    pft2 = m_SubjFillType;\n  }\n\n  switch(pft)\n  {\n    case pftEvenOdd: \n      //return false if a subj line has been flagged as inside a subj polygon\n      if (edge.WindDelta == 0 && edge.WindCnt != 1) return false;\n      break;\n    case pftNonZero:\n      if (Abs(edge.WindCnt) != 1) return false;\n      break;\n    case pftPositive: \n      if (edge.WindCnt != 1) return false;\n      break;\n    default: //pftNegative\n      if (edge.WindCnt != -1) return false;\n  }\n\n  switch(m_ClipType)\n  {\n    case ctIntersection:\n      switch(pft2)\n      {\n        case pftEvenOdd: \n        case pftNonZero: \n          return (edge.WindCnt2 != 0);\n        case pftPositive: \n          return (edge.WindCnt2 > 0);\n        default: \n          return (edge.WindCnt2 < 0);\n      }\n      break;\n    case ctUnion:\n      switch(pft2)\n      {\n        case pftEvenOdd: \n        case pftNonZero: \n          return (edge.WindCnt2 == 0);\n        case pftPositive: \n          return (edge.WindCnt2 <= 0);\n        default: \n          return (edge.WindCnt2 >= 0);\n      }\n      break;\n    case ctDifference:\n      if (edge.PolyTyp == ptSubject)\n        switch(pft2)\n        {\n          case pftEvenOdd: \n          case pftNonZero: \n            return (edge.WindCnt2 == 0);\n          case pftPositive: \n            return (edge.WindCnt2 <= 0);\n          default: \n            return (edge.WindCnt2 >= 0);\n        }\n      else\n        switch(pft2)\n        {\n          case pftEvenOdd: \n          case pftNonZero: \n            return (edge.WindCnt2 != 0);\n          case pftPositive: \n            return (edge.WindCnt2 > 0);\n          default: \n            return (edge.WindCnt2 < 0);\n        }\n      break;\n    case ctXor:\n      if (edge.WindDelta == 0) //XOr always contributing unless open\n        switch(pft2)\n        {\n          case pftEvenOdd: \n          case pftNonZero: \n            return (edge.WindCnt2 == 0);\n          case pftPositive: \n            return (edge.WindCnt2 <= 0);\n          default: \n            return (edge.WindCnt2 >= 0);\n        }\n      else \n        return true;\n      break;\n    default:\n      return true;\n  }\n}\n//------------------------------------------------------------------------------\n\nOutPt* Clipper::AddLocalMinPoly(TEdge *e1, TEdge *e2, const IntPoint &Pt)\n{\n  OutPt* result;\n  TEdge *e, *prevE;\n  if (IsHorizontal(*e2) || ( e1->Dx > e2->Dx ))\n  {\n    result = AddOutPt(e1, Pt);\n    e2->OutIdx = e1->OutIdx;\n    e1->Side = esLeft;\n    e2->Side = esRight;\n    e = e1;\n    if (e->PrevInAEL == e2)\n      prevE = e2->PrevInAEL; \n    else\n      prevE = e->PrevInAEL;\n  } else\n  {\n    result = AddOutPt(e2, Pt);\n    e1->OutIdx = e2->OutIdx;\n    e1->Side = esRight;\n    e2->Side = esLeft;\n    e = e2;\n    if (e->PrevInAEL == e1)\n        prevE = e1->PrevInAEL;\n    else\n        prevE = e->PrevInAEL;\n  }\n\n  if (prevE && prevE->OutIdx >= 0 && prevE->Top.Y < Pt.Y && e->Top.Y < Pt.Y) \n  {\n    cInt xPrev = TopX(*prevE, Pt.Y);\n    cInt xE = TopX(*e, Pt.Y);\n    if (xPrev == xE && (e->WindDelta != 0) && (prevE->WindDelta != 0) &&\n      SlopesEqual(IntPoint(xPrev, Pt.Y), prevE->Top, IntPoint(xE, Pt.Y), e->Top, m_UseFullRange))\n    {\n      OutPt* outPt = AddOutPt(prevE, Pt);\n      AddJoin(result, outPt, e->Top);\n    }\n  }\n  return result;\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::AddLocalMaxPoly(TEdge *e1, TEdge *e2, const IntPoint &Pt)\n{\n  AddOutPt( e1, Pt );\n  if (e2->WindDelta == 0) AddOutPt(e2, Pt);\n  if( e1->OutIdx == e2->OutIdx )\n  {\n    e1->OutIdx = Unassigned;\n    e2->OutIdx = Unassigned;\n  }\n  else if (e1->OutIdx < e2->OutIdx) \n    AppendPolygon(e1, e2); \n  else \n    AppendPolygon(e2, e1);\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::AddEdgeToSEL(TEdge *edge)\n{\n  //SEL pointers in PEdge are reused to build a list of horizontal edges.\n  //However, we don't need to worry about order with horizontal edge processing.\n  if( !m_SortedEdges )\n  {\n    m_SortedEdges = edge;\n    edge->PrevInSEL = 0;\n    edge->NextInSEL = 0;\n  }\n  else\n  {\n    edge->NextInSEL = m_SortedEdges;\n    edge->PrevInSEL = 0;\n    m_SortedEdges->PrevInSEL = edge;\n    m_SortedEdges = edge;\n  }\n}\n//------------------------------------------------------------------------------\n\nbool Clipper::PopEdgeFromSEL(TEdge *&edge)\n{\n  if (!m_SortedEdges) return false;\n  edge = m_SortedEdges;\n  DeleteFromSEL(m_SortedEdges);\n  return true;\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::CopyAELToSEL()\n{\n  TEdge* e = m_ActiveEdges;\n  m_SortedEdges = e;\n  while ( e )\n  {\n    e->PrevInSEL = e->PrevInAEL;\n    e->NextInSEL = e->NextInAEL;\n    e = e->NextInAEL;\n  }\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::AddJoin(OutPt *op1, OutPt *op2, const IntPoint OffPt)\n{\n  Join* j = new Join;\n  j->OutPt1 = op1;\n  j->OutPt2 = op2;\n  j->OffPt = OffPt;\n  m_Joins.push_back(j);\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::ClearJoins()\n{\n  for (JoinList::size_type i = 0; i < m_Joins.size(); i++)\n    delete m_Joins[i];\n  m_Joins.resize(0);\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::ClearGhostJoins()\n{\n  for (JoinList::size_type i = 0; i < m_GhostJoins.size(); i++)\n    delete m_GhostJoins[i];\n  m_GhostJoins.resize(0);\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::AddGhostJoin(OutPt *op, const IntPoint OffPt)\n{\n  Join* j = new Join;\n  j->OutPt1 = op;\n  j->OutPt2 = 0;\n  j->OffPt = OffPt;\n  m_GhostJoins.push_back(j);\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::InsertLocalMinimaIntoAEL(const cInt botY)\n{\n  const LocalMinimum *lm;\n  while (PopLocalMinima(botY, lm))\n  {\n    TEdge* lb = lm->LeftBound;\n    TEdge* rb = lm->RightBound;\n    \n    OutPt *Op1 = 0;\n    if (!lb)\n    {\n      //nb: don't insert LB into either AEL or SEL\n      InsertEdgeIntoAEL(rb, 0);\n      SetWindingCount(*rb);\n      if (IsContributing(*rb))\n        Op1 = AddOutPt(rb, rb->Bot); \n    } \n    else if (!rb)\n    {\n      InsertEdgeIntoAEL(lb, 0);\n      SetWindingCount(*lb);\n      if (IsContributing(*lb))\n        Op1 = AddOutPt(lb, lb->Bot);\n      InsertScanbeam(lb->Top.Y);\n    }\n    else\n    {\n      InsertEdgeIntoAEL(lb, 0);\n      InsertEdgeIntoAEL(rb, lb);\n      SetWindingCount( *lb );\n      rb->WindCnt = lb->WindCnt;\n      rb->WindCnt2 = lb->WindCnt2;\n      if (IsContributing(*lb))\n        Op1 = AddLocalMinPoly(lb, rb, lb->Bot);      \n      InsertScanbeam(lb->Top.Y);\n    }\n\n     if (rb)\n     {\n\t\t if (IsHorizontal(*rb))\n\t\t {\n\t\t\t AddEdgeToSEL(rb);\n\t\t\t if (rb->NextInLML) \n\t\t\t\t InsertScanbeam(rb->NextInLML->Top.Y);\n\t\t }\n\t\t else InsertScanbeam( rb->Top.Y );\n     }\n\n    if (!lb || !rb) continue;\n\n    //if any output polygons share an edge, they'll need joining later ...\n    if (Op1 && IsHorizontal(*rb) && \n      m_GhostJoins.size() > 0 && (rb->WindDelta != 0))\n    {\n      for (JoinList::size_type i = 0; i < m_GhostJoins.size(); ++i)\n      {\n        Join* jr = m_GhostJoins[i];\n        //if the horizontal Rb and a 'ghost' horizontal overlap, then convert\n        //the 'ghost' join to a real join ready for later ...\n        if (HorzSegmentsOverlap(jr->OutPt1->Pt.X, jr->OffPt.X, rb->Bot.X, rb->Top.X))\n          AddJoin(jr->OutPt1, Op1, jr->OffPt);\n      }\n    }\n\n    if (lb->OutIdx >= 0 && lb->PrevInAEL && \n      lb->PrevInAEL->Curr.X == lb->Bot.X &&\n      lb->PrevInAEL->OutIdx >= 0 &&\n      SlopesEqual(lb->PrevInAEL->Bot, lb->PrevInAEL->Top, lb->Curr, lb->Top, m_UseFullRange) &&\n      (lb->WindDelta != 0) && (lb->PrevInAEL->WindDelta != 0))\n    {\n        OutPt *Op2 = AddOutPt(lb->PrevInAEL, lb->Bot);\n        AddJoin(Op1, Op2, lb->Top);\n    }\n\n    if(lb->NextInAEL != rb)\n    {\n\n      if (rb->OutIdx >= 0 && rb->PrevInAEL->OutIdx >= 0 &&\n        SlopesEqual(rb->PrevInAEL->Curr, rb->PrevInAEL->Top, rb->Curr, rb->Top, m_UseFullRange) &&\n        (rb->WindDelta != 0) && (rb->PrevInAEL->WindDelta != 0))\n      {\n          OutPt *Op2 = AddOutPt(rb->PrevInAEL, rb->Bot);\n          AddJoin(Op1, Op2, rb->Top);\n      }\n\n      TEdge* e = lb->NextInAEL;\n      if (e)\n      {\n        while( e != rb )\n        {\n          //nb: For calculating winding counts etc, IntersectEdges() assumes\n          //that param1 will be to the Right of param2 ABOVE the intersection ...\n          IntersectEdges(rb , e , lb->Curr); //order important here\n          e = e->NextInAEL;\n        }\n      }\n    }\n    \n  }\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::DeleteFromSEL(TEdge *e)\n{\n  TEdge* SelPrev = e->PrevInSEL;\n  TEdge* SelNext = e->NextInSEL;\n  if( !SelPrev &&  !SelNext && (e != m_SortedEdges) ) return; //already deleted\n  if( SelPrev ) SelPrev->NextInSEL = SelNext;\n  else m_SortedEdges = SelNext;\n  if( SelNext ) SelNext->PrevInSEL = SelPrev;\n  e->NextInSEL = 0;\n  e->PrevInSEL = 0;\n}\n//------------------------------------------------------------------------------\n\n#ifdef use_xyz\nvoid Clipper::SetZ(IntPoint& pt, TEdge& e1, TEdge& e2)\n{\n  if (pt.Z != 0 || !m_ZFill) return;\n  else if (pt == e1.Bot) pt.Z = e1.Bot.Z;\n  else if (pt == e1.Top) pt.Z = e1.Top.Z;\n  else if (pt == e2.Bot) pt.Z = e2.Bot.Z;\n  else if (pt == e2.Top) pt.Z = e2.Top.Z;\n  else (*m_ZFill)(e1.Bot, e1.Top, e2.Bot, e2.Top, pt); \n}\n//------------------------------------------------------------------------------\n#endif\n\nvoid Clipper::IntersectEdges(TEdge *e1, TEdge *e2, IntPoint &Pt)\n{\n  bool e1Contributing = ( e1->OutIdx >= 0 );\n  bool e2Contributing = ( e2->OutIdx >= 0 );\n\n#ifdef use_xyz\n        SetZ(Pt, *e1, *e2);\n#endif\n\n#ifdef use_lines\n  //if either edge is on an OPEN path ...\n  if (e1->WindDelta == 0 || e2->WindDelta == 0)\n  {\n    //ignore subject-subject open path intersections UNLESS they\n    //are both open paths, AND they are both 'contributing maximas' ...\n\tif (e1->WindDelta == 0 && e2->WindDelta == 0) return;\n\n    //if intersecting a subj line with a subj poly ...\n    else if (e1->PolyTyp == e2->PolyTyp && \n      e1->WindDelta != e2->WindDelta && m_ClipType == ctUnion)\n    {\n      if (e1->WindDelta == 0)\n      {\n        if (e2Contributing)\n        {\n          AddOutPt(e1, Pt);\n          if (e1Contributing) e1->OutIdx = Unassigned;\n        }\n      }\n      else\n      {\n        if (e1Contributing)\n        {\n          AddOutPt(e2, Pt);\n          if (e2Contributing) e2->OutIdx = Unassigned;\n        }\n      }\n    }\n    else if (e1->PolyTyp != e2->PolyTyp)\n    {\n      //toggle subj open path OutIdx on/off when Abs(clip.WndCnt) == 1 ...\n      if ((e1->WindDelta == 0) && abs(e2->WindCnt) == 1 && \n        (m_ClipType != ctUnion || e2->WindCnt2 == 0))\n      {\n        AddOutPt(e1, Pt);\n        if (e1Contributing) e1->OutIdx = Unassigned;\n      }\n      else if ((e2->WindDelta == 0) && (abs(e1->WindCnt) == 1) && \n        (m_ClipType != ctUnion || e1->WindCnt2 == 0))\n      {\n        AddOutPt(e2, Pt);\n        if (e2Contributing) e2->OutIdx = Unassigned;\n      }\n    }\n    return;\n  }\n#endif\n\n  //update winding counts...\n  //assumes that e1 will be to the Right of e2 ABOVE the intersection\n  if ( e1->PolyTyp == e2->PolyTyp )\n  {\n    if ( IsEvenOddFillType( *e1) )\n    {\n      int oldE1WindCnt = e1->WindCnt;\n      e1->WindCnt = e2->WindCnt;\n      e2->WindCnt = oldE1WindCnt;\n    } else\n    {\n      if (e1->WindCnt + e2->WindDelta == 0 ) e1->WindCnt = -e1->WindCnt;\n      else e1->WindCnt += e2->WindDelta;\n      if ( e2->WindCnt - e1->WindDelta == 0 ) e2->WindCnt = -e2->WindCnt;\n      else e2->WindCnt -= e1->WindDelta;\n    }\n  } else\n  {\n    if (!IsEvenOddFillType(*e2)) e1->WindCnt2 += e2->WindDelta;\n    else e1->WindCnt2 = ( e1->WindCnt2 == 0 ) ? 1 : 0;\n    if (!IsEvenOddFillType(*e1)) e2->WindCnt2 -= e1->WindDelta;\n    else e2->WindCnt2 = ( e2->WindCnt2 == 0 ) ? 1 : 0;\n  }\n\n  PolyFillType e1FillType, e2FillType, e1FillType2, e2FillType2;\n  if (e1->PolyTyp == ptSubject)\n  {\n    e1FillType = m_SubjFillType;\n    e1FillType2 = m_ClipFillType;\n  } else\n  {\n    e1FillType = m_ClipFillType;\n    e1FillType2 = m_SubjFillType;\n  }\n  if (e2->PolyTyp == ptSubject)\n  {\n    e2FillType = m_SubjFillType;\n    e2FillType2 = m_ClipFillType;\n  } else\n  {\n    e2FillType = m_ClipFillType;\n    e2FillType2 = m_SubjFillType;\n  }\n\n  cInt e1Wc, e2Wc;\n  switch (e1FillType)\n  {\n    case pftPositive: e1Wc = e1->WindCnt; break;\n    case pftNegative: e1Wc = -e1->WindCnt; break;\n    default: e1Wc = Abs(e1->WindCnt);\n  }\n  switch(e2FillType)\n  {\n    case pftPositive: e2Wc = e2->WindCnt; break;\n    case pftNegative: e2Wc = -e2->WindCnt; break;\n    default: e2Wc = Abs(e2->WindCnt);\n  }\n\n  if ( e1Contributing && e2Contributing )\n  {\n    if ((e1Wc != 0 && e1Wc != 1) || (e2Wc != 0 && e2Wc != 1) ||\n      (e1->PolyTyp != e2->PolyTyp && m_ClipType != ctXor) )\n    {\n      AddLocalMaxPoly(e1, e2, Pt); \n    }\n    else\n    {\n      AddOutPt(e1, Pt);\n      AddOutPt(e2, Pt);\n      SwapSides( *e1 , *e2 );\n      SwapPolyIndexes( *e1 , *e2 );\n    }\n  }\n  else if ( e1Contributing )\n  {\n    if (e2Wc == 0 || e2Wc == 1) \n    {\n      AddOutPt(e1, Pt);\n      SwapSides(*e1, *e2);\n      SwapPolyIndexes(*e1, *e2);\n    }\n  }\n  else if ( e2Contributing )\n  {\n    if (e1Wc == 0 || e1Wc == 1) \n    {\n      AddOutPt(e2, Pt);\n      SwapSides(*e1, *e2);\n      SwapPolyIndexes(*e1, *e2);\n    }\n  } \n  else if ( (e1Wc == 0 || e1Wc == 1) && (e2Wc == 0 || e2Wc == 1))\n  {\n    //neither edge is currently contributing ...\n\n    cInt e1Wc2, e2Wc2;\n    switch (e1FillType2)\n    {\n      case pftPositive: e1Wc2 = e1->WindCnt2; break;\n      case pftNegative : e1Wc2 = -e1->WindCnt2; break;\n      default: e1Wc2 = Abs(e1->WindCnt2);\n    }\n    switch (e2FillType2)\n    {\n      case pftPositive: e2Wc2 = e2->WindCnt2; break;\n      case pftNegative: e2Wc2 = -e2->WindCnt2; break;\n      default: e2Wc2 = Abs(e2->WindCnt2);\n    }\n\n    if (e1->PolyTyp != e2->PolyTyp)\n    {\n      AddLocalMinPoly(e1, e2, Pt);\n    }\n    else if (e1Wc == 1 && e2Wc == 1)\n      switch( m_ClipType ) {\n        case ctIntersection:\n          if (e1Wc2 > 0 && e2Wc2 > 0)\n            AddLocalMinPoly(e1, e2, Pt);\n          break;\n        case ctUnion:\n          if ( e1Wc2 <= 0 && e2Wc2 <= 0 )\n            AddLocalMinPoly(e1, e2, Pt);\n          break;\n        case ctDifference:\n          if (((e1->PolyTyp == ptClip) && (e1Wc2 > 0) && (e2Wc2 > 0)) ||\n              ((e1->PolyTyp == ptSubject) && (e1Wc2 <= 0) && (e2Wc2 <= 0)))\n                AddLocalMinPoly(e1, e2, Pt);\n          break;\n        case ctXor:\n          AddLocalMinPoly(e1, e2, Pt);\n      }\n    else\n      SwapSides( *e1, *e2 );\n  }\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::SetHoleState(TEdge *e, OutRec *outrec)\n{\n  TEdge *e2 = e->PrevInAEL;\n  TEdge *eTmp = 0;\n  while (e2)\n  {\n    if (e2->OutIdx >= 0 && e2->WindDelta != 0)\n    {\n      if (!eTmp) eTmp = e2;\n      else if (eTmp->OutIdx == e2->OutIdx) eTmp = 0;        \n    }\n    e2 = e2->PrevInAEL;\n  }\n  if (!eTmp)\n  {\n    outrec->FirstLeft = 0;\n    outrec->IsHole = false;\n  }\n  else\n  {\n    outrec->FirstLeft = m_PolyOuts[eTmp->OutIdx];\n    outrec->IsHole = !outrec->FirstLeft->IsHole;\n  }\n}\n//------------------------------------------------------------------------------\n\nOutRec* GetLowermostRec(OutRec *outRec1, OutRec *outRec2)\n{\n  //work out which polygon fragment has the correct hole state ...\n  if (!outRec1->BottomPt) \n    outRec1->BottomPt = GetBottomPt(outRec1->Pts);\n  if (!outRec2->BottomPt) \n    outRec2->BottomPt = GetBottomPt(outRec2->Pts);\n  OutPt *OutPt1 = outRec1->BottomPt;\n  OutPt *OutPt2 = outRec2->BottomPt;\n  if (OutPt1->Pt.Y > OutPt2->Pt.Y) return outRec1;\n  else if (OutPt1->Pt.Y < OutPt2->Pt.Y) return outRec2;\n  else if (OutPt1->Pt.X < OutPt2->Pt.X) return outRec1;\n  else if (OutPt1->Pt.X > OutPt2->Pt.X) return outRec2;\n  else if (OutPt1->Next == OutPt1) return outRec2;\n  else if (OutPt2->Next == OutPt2) return outRec1;\n  else if (FirstIsBottomPt(OutPt1, OutPt2)) return outRec1;\n  else return outRec2;\n}\n//------------------------------------------------------------------------------\n\nbool OutRec1RightOfOutRec2(OutRec* outRec1, OutRec* outRec2)\n{\n  do\n  {\n    outRec1 = outRec1->FirstLeft;\n    if (outRec1 == outRec2) return true;\n  } while (outRec1);\n  return false;\n}\n//------------------------------------------------------------------------------\n\nOutRec* Clipper::GetOutRec(int Idx)\n{\n  OutRec* outrec = m_PolyOuts[Idx];\n  while (outrec != m_PolyOuts[outrec->Idx])\n    outrec = m_PolyOuts[outrec->Idx];\n  return outrec;\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::AppendPolygon(TEdge *e1, TEdge *e2)\n{\n  //get the start and ends of both output polygons ...\n  OutRec *outRec1 = m_PolyOuts[e1->OutIdx];\n  OutRec *outRec2 = m_PolyOuts[e2->OutIdx];\n\n  OutRec *holeStateRec;\n  if (OutRec1RightOfOutRec2(outRec1, outRec2))\n    holeStateRec = outRec2;\n  else if (OutRec1RightOfOutRec2(outRec2, outRec1))\n    holeStateRec = outRec1;\n  else \n    holeStateRec = GetLowermostRec(outRec1, outRec2);\n\n  //get the start and ends of both output polygons and\n  //join e2 poly onto e1 poly and delete pointers to e2 ...\n\n  OutPt* p1_lft = outRec1->Pts;\n  OutPt* p1_rt = p1_lft->Prev;\n  OutPt* p2_lft = outRec2->Pts;\n  OutPt* p2_rt = p2_lft->Prev;\n\n  //join e2 poly onto e1 poly and delete pointers to e2 ...\n  if(  e1->Side == esLeft )\n  {\n    if(  e2->Side == esLeft )\n    {\n      //z y x a b c\n      ReversePolyPtLinks(p2_lft);\n      p2_lft->Next = p1_lft;\n      p1_lft->Prev = p2_lft;\n      p1_rt->Next = p2_rt;\n      p2_rt->Prev = p1_rt;\n      outRec1->Pts = p2_rt;\n    } else\n    {\n      //x y z a b c\n      p2_rt->Next = p1_lft;\n      p1_lft->Prev = p2_rt;\n      p2_lft->Prev = p1_rt;\n      p1_rt->Next = p2_lft;\n      outRec1->Pts = p2_lft;\n    }\n  } else\n  {\n    if(  e2->Side == esRight )\n    {\n      //a b c z y x\n      ReversePolyPtLinks(p2_lft);\n      p1_rt->Next = p2_rt;\n      p2_rt->Prev = p1_rt;\n      p2_lft->Next = p1_lft;\n      p1_lft->Prev = p2_lft;\n    } else\n    {\n      //a b c x y z\n      p1_rt->Next = p2_lft;\n      p2_lft->Prev = p1_rt;\n      p1_lft->Prev = p2_rt;\n      p2_rt->Next = p1_lft;\n    }\n  }\n\n  outRec1->BottomPt = 0;\n  if (holeStateRec == outRec2)\n  {\n    if (outRec2->FirstLeft != outRec1)\n      outRec1->FirstLeft = outRec2->FirstLeft;\n    outRec1->IsHole = outRec2->IsHole;\n  }\n  outRec2->Pts = 0;\n  outRec2->BottomPt = 0;\n  outRec2->FirstLeft = outRec1;\n\n  int OKIdx = e1->OutIdx;\n  int ObsoleteIdx = e2->OutIdx;\n\n  e1->OutIdx = Unassigned; //nb: safe because we only get here via AddLocalMaxPoly\n  e2->OutIdx = Unassigned;\n\n  TEdge* e = m_ActiveEdges;\n  while( e )\n  {\n    if( e->OutIdx == ObsoleteIdx )\n    {\n      e->OutIdx = OKIdx;\n      e->Side = e1->Side;\n      break;\n    }\n    e = e->NextInAEL;\n  }\n\n  outRec2->Idx = outRec1->Idx;\n}\n//------------------------------------------------------------------------------\n\nOutPt* Clipper::AddOutPt(TEdge *e, const IntPoint &pt)\n{\n  if(  e->OutIdx < 0 )\n  {\n    OutRec *outRec = CreateOutRec();\n    outRec->IsOpen = (e->WindDelta == 0);\n    OutPt* newOp = new OutPt;\n    outRec->Pts = newOp;\n    newOp->Idx = outRec->Idx;\n    newOp->Pt = pt;\n    newOp->Next = newOp;\n    newOp->Prev = newOp;\n    if (!outRec->IsOpen)\n      SetHoleState(e, outRec);\n    e->OutIdx = outRec->Idx;\n    return newOp;\n  } else\n  {\n    OutRec *outRec = m_PolyOuts[e->OutIdx];\n    //OutRec.Pts is the 'Left-most' point & OutRec.Pts.Prev is the 'Right-most'\n    OutPt* op = outRec->Pts;\n\n\tbool ToFront = (e->Side == esLeft);\n\tif (ToFront && (pt == op->Pt)) return op;\n    else if (!ToFront && (pt == op->Prev->Pt)) return op->Prev;\n\n    OutPt* newOp = new OutPt;\n    newOp->Idx = outRec->Idx;\n    newOp->Pt = pt;\n    newOp->Next = op;\n    newOp->Prev = op->Prev;\n    newOp->Prev->Next = newOp;\n    op->Prev = newOp;\n    if (ToFront) outRec->Pts = newOp;\n    return newOp;\n  }\n}\n//------------------------------------------------------------------------------\n\nOutPt* Clipper::GetLastOutPt(TEdge *e)\n{\n\tOutRec *outRec = m_PolyOuts[e->OutIdx];\n\tif (e->Side == esLeft)\n\t\treturn outRec->Pts;\n\telse\n\t\treturn outRec->Pts->Prev;\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::ProcessHorizontals()\n{\n  TEdge* horzEdge;\n  while (PopEdgeFromSEL(horzEdge))\n    ProcessHorizontal(horzEdge);\n}\n//------------------------------------------------------------------------------\n\ninline bool IsMinima(TEdge *e)\n{\n  return e  && (e->Prev->NextInLML != e) && (e->Next->NextInLML != e);\n}\n//------------------------------------------------------------------------------\n\ninline bool IsMaxima(TEdge *e, const cInt Y)\n{\n  return e && e->Top.Y == Y && !e->NextInLML;\n}\n//------------------------------------------------------------------------------\n\ninline bool IsIntermediate(TEdge *e, const cInt Y)\n{\n  return e->Top.Y == Y && e->NextInLML;\n}\n//------------------------------------------------------------------------------\n\nTEdge *GetMaximaPair(TEdge *e)\n{\n  if ((e->Next->Top == e->Top) && !e->Next->NextInLML)\n    return e->Next;\n  else if ((e->Prev->Top == e->Top) && !e->Prev->NextInLML)\n    return e->Prev;\n  else return 0;\n}\n//------------------------------------------------------------------------------\n\nTEdge *GetMaximaPairEx(TEdge *e)\n{\n  //as GetMaximaPair() but returns 0 if MaxPair isn't in AEL (unless it's horizontal)\n  TEdge* result = GetMaximaPair(e);\n  if (result && (result->OutIdx == Skip ||\n    (result->NextInAEL == result->PrevInAEL && !IsHorizontal(*result)))) return 0;\n  return result;\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::SwapPositionsInSEL(TEdge *Edge1, TEdge *Edge2)\n{\n  if(  !( Edge1->NextInSEL ) &&  !( Edge1->PrevInSEL ) ) return;\n  if(  !( Edge2->NextInSEL ) &&  !( Edge2->PrevInSEL ) ) return;\n\n  if(  Edge1->NextInSEL == Edge2 )\n  {\n    TEdge* Next = Edge2->NextInSEL;\n    if( Next ) Next->PrevInSEL = Edge1;\n    TEdge* Prev = Edge1->PrevInSEL;\n    if( Prev ) Prev->NextInSEL = Edge2;\n    Edge2->PrevInSEL = Prev;\n    Edge2->NextInSEL = Edge1;\n    Edge1->PrevInSEL = Edge2;\n    Edge1->NextInSEL = Next;\n  }\n  else if(  Edge2->NextInSEL == Edge1 )\n  {\n    TEdge* Next = Edge1->NextInSEL;\n    if( Next ) Next->PrevInSEL = Edge2;\n    TEdge* Prev = Edge2->PrevInSEL;\n    if( Prev ) Prev->NextInSEL = Edge1;\n    Edge1->PrevInSEL = Prev;\n    Edge1->NextInSEL = Edge2;\n    Edge2->PrevInSEL = Edge1;\n    Edge2->NextInSEL = Next;\n  }\n  else\n  {\n    TEdge* Next = Edge1->NextInSEL;\n    TEdge* Prev = Edge1->PrevInSEL;\n    Edge1->NextInSEL = Edge2->NextInSEL;\n    if( Edge1->NextInSEL ) Edge1->NextInSEL->PrevInSEL = Edge1;\n    Edge1->PrevInSEL = Edge2->PrevInSEL;\n    if( Edge1->PrevInSEL ) Edge1->PrevInSEL->NextInSEL = Edge1;\n    Edge2->NextInSEL = Next;\n    if( Edge2->NextInSEL ) Edge2->NextInSEL->PrevInSEL = Edge2;\n    Edge2->PrevInSEL = Prev;\n    if( Edge2->PrevInSEL ) Edge2->PrevInSEL->NextInSEL = Edge2;\n  }\n\n  if( !Edge1->PrevInSEL ) m_SortedEdges = Edge1;\n  else if( !Edge2->PrevInSEL ) m_SortedEdges = Edge2;\n}\n//------------------------------------------------------------------------------\n\nTEdge* GetNextInAEL(TEdge *e, Direction dir)\n{\n  return dir == dLeftToRight ? e->NextInAEL : e->PrevInAEL;\n}\n//------------------------------------------------------------------------------\n\nvoid GetHorzDirection(TEdge& HorzEdge, Direction& Dir, cInt& Left, cInt& Right)\n{\n  if (HorzEdge.Bot.X < HorzEdge.Top.X)\n  {\n    Left = HorzEdge.Bot.X;\n    Right = HorzEdge.Top.X;\n    Dir = dLeftToRight;\n  } else\n  {\n    Left = HorzEdge.Top.X;\n    Right = HorzEdge.Bot.X;\n    Dir = dRightToLeft;\n  }\n}\n//------------------------------------------------------------------------\n\n/*******************************************************************************\n* Notes: Horizontal edges (HEs) at scanline intersections (ie at the Top or    *\n* Bottom of a scanbeam) are processed as if layered. The order in which HEs    *\n* are processed doesn't matter. HEs intersect with other HE Bot.Xs only [#]    *\n* (or they could intersect with Top.Xs only, ie EITHER Bot.Xs OR Top.Xs),      *\n* and with other non-horizontal edges [*]. Once these intersections are        *\n* processed, intermediate HEs then 'promote' the Edge above (NextInLML) into   *\n* the AEL. These 'promoted' edges may in turn intersect [%] with other HEs.    *\n*******************************************************************************/\n\nvoid Clipper::ProcessHorizontal(TEdge *horzEdge)\n{\n  Direction dir;\n  cInt horzLeft, horzRight;\n  bool IsOpen = (horzEdge->WindDelta == 0);\n\n  GetHorzDirection(*horzEdge, dir, horzLeft, horzRight);\n\n  TEdge* eLastHorz = horzEdge, *eMaxPair = 0;\n  while (eLastHorz->NextInLML && IsHorizontal(*eLastHorz->NextInLML)) \n    eLastHorz = eLastHorz->NextInLML;\n  if (!eLastHorz->NextInLML)\n    eMaxPair = GetMaximaPair(eLastHorz);\n\n  MaximaList::const_iterator maxIt;\n  MaximaList::const_reverse_iterator maxRit;\n  if (m_Maxima.size() > 0)\n  {\n      //get the first maxima in range (X) ...\n      if (dir == dLeftToRight)\n      {\n          maxIt = m_Maxima.begin();\n          while (maxIt != m_Maxima.end() && *maxIt <= horzEdge->Bot.X) maxIt++;\n          if (maxIt != m_Maxima.end() && *maxIt >= eLastHorz->Top.X)\n              maxIt = m_Maxima.end();\n      }\n      else\n      {\n          maxRit = m_Maxima.rbegin();\n          while (maxRit != m_Maxima.rend() && *maxRit > horzEdge->Bot.X) maxRit++;\n          if (maxRit != m_Maxima.rend() && *maxRit <= eLastHorz->Top.X)\n              maxRit = m_Maxima.rend();\n      }\n  }\n\n  OutPt* op1 = 0;\n\n  for (;;) //loop through consec. horizontal edges\n  {\n\t\t  \n    bool IsLastHorz = (horzEdge == eLastHorz);\n    TEdge* e = GetNextInAEL(horzEdge, dir);\n    while(e)\n    {\n\n        //this code block inserts extra coords into horizontal edges (in output\n        //polygons) whereever maxima touch these horizontal edges. This helps\n        //'simplifying' polygons (ie if the Simplify property is set).\n        if (m_Maxima.size() > 0)\n        {\n            if (dir == dLeftToRight)\n            {\n                while (maxIt != m_Maxima.end() && *maxIt < e->Curr.X) \n                {\n                  if (horzEdge->OutIdx >= 0 && !IsOpen)\n                    AddOutPt(horzEdge, IntPoint(*maxIt, horzEdge->Bot.Y));\n                  maxIt++;\n                }\n            }\n            else\n            {\n                while (maxRit != m_Maxima.rend() && *maxRit > e->Curr.X)\n                {\n                  if (horzEdge->OutIdx >= 0 && !IsOpen)\n                    AddOutPt(horzEdge, IntPoint(*maxRit, horzEdge->Bot.Y));\n                  maxRit++;\n                }\n            }\n        };\n\n        if ((dir == dLeftToRight && e->Curr.X > horzRight) ||\n\t\t\t(dir == dRightToLeft && e->Curr.X < horzLeft)) break;\n\n\t\t//Also break if we've got to the end of an intermediate horizontal edge ...\n\t\t//nb: Smaller Dx's are to the right of larger Dx's ABOVE the horizontal.\n\t\tif (e->Curr.X == horzEdge->Top.X && horzEdge->NextInLML && \n\t\t\te->Dx < horzEdge->NextInLML->Dx) break;\n\n    if (horzEdge->OutIdx >= 0 && !IsOpen)  //note: may be done multiple times\n\t\t{\n#ifdef use_xyz\n\t\t\tif (dir == dLeftToRight) SetZ(e->Curr, *horzEdge, *e);\n\t\t\telse SetZ(e->Curr, *e, *horzEdge);\n#endif      \n\t\t\top1 = AddOutPt(horzEdge, e->Curr);\n\t\t\tTEdge* eNextHorz = m_SortedEdges;\n\t\t\twhile (eNextHorz)\n\t\t\t{\n\t\t\t\tif (eNextHorz->OutIdx >= 0 &&\n\t\t\t\t\tHorzSegmentsOverlap(horzEdge->Bot.X,\n\t\t\t\t\thorzEdge->Top.X, eNextHorz->Bot.X, eNextHorz->Top.X))\n\t\t\t\t{\n                    OutPt* op2 = GetLastOutPt(eNextHorz);\n                    AddJoin(op2, op1, eNextHorz->Top);\n\t\t\t\t}\n\t\t\t\teNextHorz = eNextHorz->NextInSEL;\n\t\t\t}\n\t\t\tAddGhostJoin(op1, horzEdge->Bot);\n\t\t}\n\t\t\n\t\t//OK, so far we're still in range of the horizontal Edge  but make sure\n        //we're at the last of consec. horizontals when matching with eMaxPair\n        if(e == eMaxPair && IsLastHorz)\n        {\n          if (horzEdge->OutIdx >= 0)\n            AddLocalMaxPoly(horzEdge, eMaxPair, horzEdge->Top);\n          DeleteFromAEL(horzEdge);\n          DeleteFromAEL(eMaxPair);\n          return;\n        }\n        \n\t\tif(dir == dLeftToRight)\n        {\n          IntPoint Pt = IntPoint(e->Curr.X, horzEdge->Curr.Y);\n          IntersectEdges(horzEdge, e, Pt);\n        }\n        else\n        {\n          IntPoint Pt = IntPoint(e->Curr.X, horzEdge->Curr.Y);\n          IntersectEdges( e, horzEdge, Pt);\n        }\n        TEdge* eNext = GetNextInAEL(e, dir);\n        SwapPositionsInAEL( horzEdge, e );\n        e = eNext;\n    } //end while(e)\n\n\t//Break out of loop if HorzEdge.NextInLML is not also horizontal ...\n\tif (!horzEdge->NextInLML || !IsHorizontal(*horzEdge->NextInLML)) break;\n\n\tUpdateEdgeIntoAEL(horzEdge);\n    if (horzEdge->OutIdx >= 0) AddOutPt(horzEdge, horzEdge->Bot);\n    GetHorzDirection(*horzEdge, dir, horzLeft, horzRight);\n\n  } //end for (;;)\n\n  if (horzEdge->OutIdx >= 0 && !op1)\n  {\n      op1 = GetLastOutPt(horzEdge);\n      TEdge* eNextHorz = m_SortedEdges;\n      while (eNextHorz)\n      {\n          if (eNextHorz->OutIdx >= 0 &&\n              HorzSegmentsOverlap(horzEdge->Bot.X,\n              horzEdge->Top.X, eNextHorz->Bot.X, eNextHorz->Top.X))\n          {\n              OutPt* op2 = GetLastOutPt(eNextHorz);\n              AddJoin(op2, op1, eNextHorz->Top);\n          }\n          eNextHorz = eNextHorz->NextInSEL;\n      }\n      AddGhostJoin(op1, horzEdge->Top);\n  }\n\n  if (horzEdge->NextInLML)\n  {\n    if(horzEdge->OutIdx >= 0)\n    {\n      op1 = AddOutPt( horzEdge, horzEdge->Top);\n      UpdateEdgeIntoAEL(horzEdge);\n      if (horzEdge->WindDelta == 0) return;\n      //nb: HorzEdge is no longer horizontal here\n      TEdge* ePrev = horzEdge->PrevInAEL;\n      TEdge* eNext = horzEdge->NextInAEL;\n      if (ePrev && ePrev->Curr.X == horzEdge->Bot.X &&\n        ePrev->Curr.Y == horzEdge->Bot.Y && ePrev->WindDelta != 0 &&\n        (ePrev->OutIdx >= 0 && ePrev->Curr.Y > ePrev->Top.Y &&\n        SlopesEqual(*horzEdge, *ePrev, m_UseFullRange)))\n      {\n        OutPt* op2 = AddOutPt(ePrev, horzEdge->Bot);\n        AddJoin(op1, op2, horzEdge->Top);\n      }\n      else if (eNext && eNext->Curr.X == horzEdge->Bot.X &&\n        eNext->Curr.Y == horzEdge->Bot.Y && eNext->WindDelta != 0 &&\n        eNext->OutIdx >= 0 && eNext->Curr.Y > eNext->Top.Y &&\n        SlopesEqual(*horzEdge, *eNext, m_UseFullRange))\n      {\n        OutPt* op2 = AddOutPt(eNext, horzEdge->Bot);\n        AddJoin(op1, op2, horzEdge->Top);\n      }\n    }\n    else\n      UpdateEdgeIntoAEL(horzEdge); \n  }\n  else\n  {\n    if (horzEdge->OutIdx >= 0) AddOutPt(horzEdge, horzEdge->Top);\n    DeleteFromAEL(horzEdge);\n  }\n}\n//------------------------------------------------------------------------------\n\nbool Clipper::ProcessIntersections(const cInt topY)\n{\n  if( !m_ActiveEdges ) return true;\n  try {\n    BuildIntersectList(topY);\n    size_t IlSize = m_IntersectList.size();\n    if (IlSize == 0) return true;\n    if (IlSize == 1 || FixupIntersectionOrder()) ProcessIntersectList();\n    else return false;\n  }\n  catch(...) \n  {\n    m_SortedEdges = 0;\n    DisposeIntersectNodes();\n    throw clipperException(\"ProcessIntersections error\");\n  }\n  m_SortedEdges = 0;\n  return true;\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::DisposeIntersectNodes()\n{\n  for (size_t i = 0; i < m_IntersectList.size(); ++i )\n    delete m_IntersectList[i];\n  m_IntersectList.clear();\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::BuildIntersectList(const cInt topY)\n{\n  if ( !m_ActiveEdges ) return;\n\n  //prepare for sorting ...\n  TEdge* e = m_ActiveEdges;\n  m_SortedEdges = e;\n  while( e )\n  {\n    e->PrevInSEL = e->PrevInAEL;\n    e->NextInSEL = e->NextInAEL;\n    e->Curr.X = TopX( *e, topY );\n    e = e->NextInAEL;\n  }\n\n  //bubblesort ...\n  bool isModified;\n  do\n  {\n    isModified = false;\n    e = m_SortedEdges;\n    while( e->NextInSEL )\n    {\n      TEdge *eNext = e->NextInSEL;\n      IntPoint Pt;\n      if(e->Curr.X > eNext->Curr.X)\n      {\n        IntersectPoint(*e, *eNext, Pt);\n        if (Pt.Y < topY) Pt = IntPoint(TopX(*e, topY), topY);\n        IntersectNode * newNode = new IntersectNode;\n        newNode->Edge1 = e;\n        newNode->Edge2 = eNext;\n        newNode->Pt = Pt;\n        m_IntersectList.push_back(newNode);\n\n        SwapPositionsInSEL(e, eNext);\n        isModified = true;\n      }\n      else\n        e = eNext;\n    }\n    if( e->PrevInSEL ) e->PrevInSEL->NextInSEL = 0;\n    else break;\n  }\n  while ( isModified );\n  m_SortedEdges = 0; //important\n}\n//------------------------------------------------------------------------------\n\n\nvoid Clipper::ProcessIntersectList()\n{\n  for (size_t i = 0; i < m_IntersectList.size(); ++i)\n  {\n    IntersectNode* iNode = m_IntersectList[i];\n    {\n      IntersectEdges( iNode->Edge1, iNode->Edge2, iNode->Pt);\n      SwapPositionsInAEL( iNode->Edge1 , iNode->Edge2 );\n    }\n    delete iNode;\n  }\n  m_IntersectList.clear();\n}\n//------------------------------------------------------------------------------\n\nbool IntersectListSort(IntersectNode* node1, IntersectNode* node2)\n{\n  return node2->Pt.Y < node1->Pt.Y;\n}\n//------------------------------------------------------------------------------\n\ninline bool EdgesAdjacent(const IntersectNode &inode)\n{\n  return (inode.Edge1->NextInSEL == inode.Edge2) ||\n    (inode.Edge1->PrevInSEL == inode.Edge2);\n}\n//------------------------------------------------------------------------------\n\nbool Clipper::FixupIntersectionOrder()\n{\n  //pre-condition: intersections are sorted Bottom-most first.\n  //Now it's crucial that intersections are made only between adjacent edges,\n  //so to ensure this the order of intersections may need adjusting ...\n  CopyAELToSEL();\n  std::sort(m_IntersectList.begin(), m_IntersectList.end(), IntersectListSort);\n  size_t cnt = m_IntersectList.size();\n  for (size_t i = 0; i < cnt; ++i) \n  {\n    if (!EdgesAdjacent(*m_IntersectList[i]))\n    {\n      size_t j = i + 1;\n      while (j < cnt && !EdgesAdjacent(*m_IntersectList[j])) j++;\n      if (j == cnt)  return false;\n      std::swap(m_IntersectList[i], m_IntersectList[j]);\n    }\n    SwapPositionsInSEL(m_IntersectList[i]->Edge1, m_IntersectList[i]->Edge2);\n  }\n  return true;\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::DoMaxima(TEdge *e)\n{\n  TEdge* eMaxPair = GetMaximaPairEx(e);\n  if (!eMaxPair)\n  {\n    if (e->OutIdx >= 0)\n      AddOutPt(e, e->Top);\n    DeleteFromAEL(e);\n    return;\n  }\n\n  TEdge* eNext = e->NextInAEL;\n  while(eNext && eNext != eMaxPair)\n  {\n    IntersectEdges(e, eNext, e->Top);\n    SwapPositionsInAEL(e, eNext);\n    eNext = e->NextInAEL;\n  }\n\n  if(e->OutIdx == Unassigned && eMaxPair->OutIdx == Unassigned)\n  {\n    DeleteFromAEL(e);\n    DeleteFromAEL(eMaxPair);\n  }\n  else if( e->OutIdx >= 0 && eMaxPair->OutIdx >= 0 )\n  {\n    if (e->OutIdx >= 0) AddLocalMaxPoly(e, eMaxPair, e->Top);\n    DeleteFromAEL(e);\n    DeleteFromAEL(eMaxPair);\n  }\n#ifdef use_lines\n  else if (e->WindDelta == 0)\n  {\n    if (e->OutIdx >= 0) \n    {\n      AddOutPt(e, e->Top);\n      e->OutIdx = Unassigned;\n    }\n    DeleteFromAEL(e);\n\n    if (eMaxPair->OutIdx >= 0)\n    {\n      AddOutPt(eMaxPair, e->Top);\n      eMaxPair->OutIdx = Unassigned;\n    }\n    DeleteFromAEL(eMaxPair);\n  } \n#endif\n  else throw clipperException(\"DoMaxima error\");\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::ProcessEdgesAtTopOfScanbeam(const cInt topY)\n{\n  TEdge* e = m_ActiveEdges;\n  while( e )\n  {\n    //1. process maxima, treating them as if they're 'bent' horizontal edges,\n    //   but exclude maxima with horizontal edges. nb: e can't be a horizontal.\n    bool IsMaximaEdge = IsMaxima(e, topY);\n\n    if(IsMaximaEdge)\n    {\n      TEdge* eMaxPair = GetMaximaPairEx(e);\n      IsMaximaEdge = (!eMaxPair || !IsHorizontal(*eMaxPair));\n    }\n\n    if(IsMaximaEdge)\n    {\n      if (m_StrictSimple) m_Maxima.push_back(e->Top.X);\n      TEdge* ePrev = e->PrevInAEL;\n      DoMaxima(e);\n      if( !ePrev ) e = m_ActiveEdges;\n      else e = ePrev->NextInAEL;\n    }\n    else\n    {\n      //2. promote horizontal edges, otherwise update Curr.X and Curr.Y ...\n      if (IsIntermediate(e, topY) && IsHorizontal(*e->NextInLML))\n      {\n        UpdateEdgeIntoAEL(e);\n        if (e->OutIdx >= 0)\n          AddOutPt(e, e->Bot);\n        AddEdgeToSEL(e);\n      } \n      else\n      {\n        e->Curr.X = TopX( *e, topY );\n        e->Curr.Y = topY;\n#ifdef use_xyz\n\t\te->Curr.Z = topY == e->Top.Y ? e->Top.Z : (topY == e->Bot.Y ? e->Bot.Z : 0);\n#endif\n\t  }\n\n      //When StrictlySimple and 'e' is being touched by another edge, then\n      //make sure both edges have a vertex here ...\n      if (m_StrictSimple)\n      {  \n        TEdge* ePrev = e->PrevInAEL;\n        if ((e->OutIdx >= 0) && (e->WindDelta != 0) && ePrev && (ePrev->OutIdx >= 0) &&\n          (ePrev->Curr.X == e->Curr.X) && (ePrev->WindDelta != 0))\n        {\n          IntPoint pt = e->Curr;\n#ifdef use_xyz\n          SetZ(pt, *ePrev, *e);\n#endif\n          OutPt* op = AddOutPt(ePrev, pt);\n          OutPt* op2 = AddOutPt(e, pt);\n          AddJoin(op, op2, pt); //StrictlySimple (type-3) join\n        }\n      }\n\n      e = e->NextInAEL;\n    }\n  }\n\n  //3. Process horizontals at the Top of the scanbeam ...\n  m_Maxima.sort();\n  ProcessHorizontals();\n  m_Maxima.clear();\n\n  //4. Promote intermediate vertices ...\n  e = m_ActiveEdges;\n  while(e)\n  {\n    if(IsIntermediate(e, topY))\n    {\n      OutPt* op = 0;\n      if( e->OutIdx >= 0 ) \n        op = AddOutPt(e, e->Top);\n      UpdateEdgeIntoAEL(e);\n\n      //if output polygons share an edge, they'll need joining later ...\n      TEdge* ePrev = e->PrevInAEL;\n      TEdge* eNext = e->NextInAEL;\n      if (ePrev && ePrev->Curr.X == e->Bot.X &&\n        ePrev->Curr.Y == e->Bot.Y && op &&\n        ePrev->OutIdx >= 0 && ePrev->Curr.Y > ePrev->Top.Y &&\n        SlopesEqual(e->Curr, e->Top, ePrev->Curr, ePrev->Top, m_UseFullRange) &&\n        (e->WindDelta != 0) && (ePrev->WindDelta != 0))\n      {\n        OutPt* op2 = AddOutPt(ePrev, e->Bot);\n        AddJoin(op, op2, e->Top);\n      }\n      else if (eNext && eNext->Curr.X == e->Bot.X &&\n        eNext->Curr.Y == e->Bot.Y && op &&\n        eNext->OutIdx >= 0 && eNext->Curr.Y > eNext->Top.Y &&\n        SlopesEqual(e->Curr, e->Top, eNext->Curr, eNext->Top, m_UseFullRange) &&\n        (e->WindDelta != 0) && (eNext->WindDelta != 0))\n      {\n        OutPt* op2 = AddOutPt(eNext, e->Bot);\n        AddJoin(op, op2, e->Top);\n      }\n    }\n    e = e->NextInAEL;\n  }\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::FixupOutPolyline(OutRec &outrec)\n{\n  OutPt *pp = outrec.Pts;\n  OutPt *lastPP = pp->Prev;\n  while (pp != lastPP)\n  {\n    pp = pp->Next;\n    if (pp->Pt == pp->Prev->Pt)\n    {\n      if (pp == lastPP) lastPP = pp->Prev;\n      OutPt *tmpPP = pp->Prev;\n      tmpPP->Next = pp->Next;\n      pp->Next->Prev = tmpPP;\n      delete pp;\n      pp = tmpPP;\n    }\n  }\n\n  if (pp == pp->Prev)\n  {\n    DisposeOutPts(pp);\n    outrec.Pts = 0;\n    return;\n  }\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::FixupOutPolygon(OutRec &outrec)\n{\n    //FixupOutPolygon() - removes duplicate points and simplifies consecutive\n    //parallel edges by removing the middle vertex.\n    OutPt *lastOK = 0;\n    outrec.BottomPt = 0;\n    OutPt *pp = outrec.Pts;\n    bool preserveCol = m_PreserveCollinear || m_StrictSimple;\n\n    for (;;)\n    {\n        if (pp->Prev == pp || pp->Prev == pp->Next)\n        {\n            DisposeOutPts(pp);\n            outrec.Pts = 0;\n            return;\n        }\n\n        //test for duplicate points and collinear edges ...\n        if ((pp->Pt == pp->Next->Pt) || (pp->Pt == pp->Prev->Pt) ||\n            (SlopesEqual(pp->Prev->Pt, pp->Pt, pp->Next->Pt, m_UseFullRange) &&\n            (!preserveCol || !Pt2IsBetweenPt1AndPt3(pp->Prev->Pt, pp->Pt, pp->Next->Pt))))\n        {\n            lastOK = 0;\n            OutPt *tmp = pp;\n            pp->Prev->Next = pp->Next;\n            pp->Next->Prev = pp->Prev;\n            pp = pp->Prev;\n            delete tmp;\n        }\n        else if (pp == lastOK) break;\n        else\n        {\n            if (!lastOK) lastOK = pp;\n            pp = pp->Next;\n        }\n    }\n    outrec.Pts = pp;\n}\n//------------------------------------------------------------------------------\n\nint PointCount(OutPt *Pts)\n{\n    if (!Pts) return 0;\n    int result = 0;\n    OutPt* p = Pts;\n    do\n    {\n        result++;\n        p = p->Next;\n    }\n    while (p != Pts);\n    return result;\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::BuildResult(Paths &polys)\n{\n  polys.reserve(m_PolyOuts.size());\n  for (PolyOutList::size_type i = 0; i < m_PolyOuts.size(); ++i)\n  {\n    if (!m_PolyOuts[i]->Pts) continue;\n    Path pg;\n    OutPt* p = m_PolyOuts[i]->Pts->Prev;\n    int cnt = PointCount(p);\n    if (cnt < 2) continue;\n    pg.reserve(cnt);\n    for (int i = 0; i < cnt; ++i)\n    {\n      pg.push_back(p->Pt);\n      p = p->Prev;\n    }\n    polys.push_back(pg);\n  }\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::BuildResult2(PolyTree& polytree)\n{\n    polytree.Clear();\n    polytree.AllNodes.reserve(m_PolyOuts.size());\n    //add each output polygon/contour to polytree ...\n    for (PolyOutList::size_type i = 0; i < m_PolyOuts.size(); i++)\n    {\n        OutRec* outRec = m_PolyOuts[i];\n        int cnt = PointCount(outRec->Pts);\n        if ((outRec->IsOpen && cnt < 2) || (!outRec->IsOpen && cnt < 3)) continue;\n        FixHoleLinkage(*outRec);\n        PolyNode* pn = new PolyNode();\n        //nb: polytree takes ownership of all the PolyNodes\n        polytree.AllNodes.push_back(pn);\n        outRec->PolyNd = pn;\n        pn->Parent = 0;\n        pn->Index = 0;\n        pn->Contour.reserve(cnt);\n        OutPt *op = outRec->Pts->Prev;\n        for (int j = 0; j < cnt; j++)\n        {\n            pn->Contour.push_back(op->Pt);\n            op = op->Prev;\n        }\n    }\n\n    //fixup PolyNode links etc ...\n    polytree.Childs.reserve(m_PolyOuts.size());\n    for (PolyOutList::size_type i = 0; i < m_PolyOuts.size(); i++)\n    {\n        OutRec* outRec = m_PolyOuts[i];\n        if (!outRec->PolyNd) continue;\n        if (outRec->IsOpen) \n        {\n          outRec->PolyNd->m_IsOpen = true;\n          polytree.AddChild(*outRec->PolyNd);\n        }\n        else if (outRec->FirstLeft && outRec->FirstLeft->PolyNd) \n          outRec->FirstLeft->PolyNd->AddChild(*outRec->PolyNd);\n        else\n          polytree.AddChild(*outRec->PolyNd);\n    }\n}\n//------------------------------------------------------------------------------\n\nvoid SwapIntersectNodes(IntersectNode &int1, IntersectNode &int2)\n{\n  //just swap the contents (because fIntersectNodes is a single-linked-list)\n  IntersectNode inode = int1; //gets a copy of Int1\n  int1.Edge1 = int2.Edge1;\n  int1.Edge2 = int2.Edge2;\n  int1.Pt = int2.Pt;\n  int2.Edge1 = inode.Edge1;\n  int2.Edge2 = inode.Edge2;\n  int2.Pt = inode.Pt;\n}\n//------------------------------------------------------------------------------\n\ninline bool E2InsertsBeforeE1(TEdge &e1, TEdge &e2)\n{\n  if (e2.Curr.X == e1.Curr.X) \n  {\n    if (e2.Top.Y > e1.Top.Y)\n      return e2.Top.X < TopX(e1, e2.Top.Y); \n      else return e1.Top.X > TopX(e2, e1.Top.Y);\n  } \n  else return e2.Curr.X < e1.Curr.X;\n}\n//------------------------------------------------------------------------------\n\nbool GetOverlap(const cInt a1, const cInt a2, const cInt b1, const cInt b2, \n    cInt& Left, cInt& Right)\n{\n  if (a1 < a2)\n  {\n    if (b1 < b2) {Left = std::max(a1,b1); Right = std::min(a2,b2);}\n    else {Left = std::max(a1,b2); Right = std::min(a2,b1);}\n  } \n  else\n  {\n    if (b1 < b2) {Left = std::max(a2,b1); Right = std::min(a1,b2);}\n    else {Left = std::max(a2,b2); Right = std::min(a1,b1);}\n  }\n  return Left < Right;\n}\n//------------------------------------------------------------------------------\n\ninline void UpdateOutPtIdxs(OutRec& outrec)\n{  \n  OutPt* op = outrec.Pts;\n  do\n  {\n    op->Idx = outrec.Idx;\n    op = op->Prev;\n  }\n  while(op != outrec.Pts);\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::InsertEdgeIntoAEL(TEdge *edge, TEdge* startEdge)\n{\n  if(!m_ActiveEdges)\n  {\n    edge->PrevInAEL = 0;\n    edge->NextInAEL = 0;\n    m_ActiveEdges = edge;\n  }\n  else if(!startEdge && E2InsertsBeforeE1(*m_ActiveEdges, *edge))\n  {\n      edge->PrevInAEL = 0;\n      edge->NextInAEL = m_ActiveEdges;\n      m_ActiveEdges->PrevInAEL = edge;\n      m_ActiveEdges = edge;\n  } \n  else\n  {\n    if(!startEdge) startEdge = m_ActiveEdges;\n    while(startEdge->NextInAEL  && \n      !E2InsertsBeforeE1(*startEdge->NextInAEL , *edge))\n        startEdge = startEdge->NextInAEL;\n    edge->NextInAEL = startEdge->NextInAEL;\n    if(startEdge->NextInAEL) startEdge->NextInAEL->PrevInAEL = edge;\n    edge->PrevInAEL = startEdge;\n    startEdge->NextInAEL = edge;\n  }\n}\n//----------------------------------------------------------------------\n\nOutPt* DupOutPt(OutPt* outPt, bool InsertAfter)\n{\n  OutPt* result = new OutPt;\n  result->Pt = outPt->Pt;\n  result->Idx = outPt->Idx;\n  if (InsertAfter)\n  {\n    result->Next = outPt->Next;\n    result->Prev = outPt;\n    outPt->Next->Prev = result;\n    outPt->Next = result;\n  } \n  else\n  {\n    result->Prev = outPt->Prev;\n    result->Next = outPt;\n    outPt->Prev->Next = result;\n    outPt->Prev = result;\n  }\n  return result;\n}\n//------------------------------------------------------------------------------\n\nbool JoinHorz(OutPt* op1, OutPt* op1b, OutPt* op2, OutPt* op2b,\n  const IntPoint Pt, bool DiscardLeft)\n{\n  Direction Dir1 = (op1->Pt.X > op1b->Pt.X ? dRightToLeft : dLeftToRight);\n  Direction Dir2 = (op2->Pt.X > op2b->Pt.X ? dRightToLeft : dLeftToRight);\n  if (Dir1 == Dir2) return false;\n\n  //When DiscardLeft, we want Op1b to be on the Left of Op1, otherwise we\n  //want Op1b to be on the Right. (And likewise with Op2 and Op2b.)\n  //So, to facilitate this while inserting Op1b and Op2b ...\n  //when DiscardLeft, make sure we're AT or RIGHT of Pt before adding Op1b,\n  //otherwise make sure we're AT or LEFT of Pt. (Likewise with Op2b.)\n  if (Dir1 == dLeftToRight) \n  {\n    while (op1->Next->Pt.X <= Pt.X && \n      op1->Next->Pt.X >= op1->Pt.X && op1->Next->Pt.Y == Pt.Y)  \n        op1 = op1->Next;\n    if (DiscardLeft && (op1->Pt.X != Pt.X)) op1 = op1->Next;\n    op1b = DupOutPt(op1, !DiscardLeft);\n    if (op1b->Pt != Pt) \n    {\n      op1 = op1b;\n      op1->Pt = Pt;\n      op1b = DupOutPt(op1, !DiscardLeft);\n    }\n  } \n  else\n  {\n    while (op1->Next->Pt.X >= Pt.X && \n      op1->Next->Pt.X <= op1->Pt.X && op1->Next->Pt.Y == Pt.Y) \n        op1 = op1->Next;\n    if (!DiscardLeft && (op1->Pt.X != Pt.X)) op1 = op1->Next;\n    op1b = DupOutPt(op1, DiscardLeft);\n    if (op1b->Pt != Pt)\n    {\n      op1 = op1b;\n      op1->Pt = Pt;\n      op1b = DupOutPt(op1, DiscardLeft);\n    }\n  }\n\n  if (Dir2 == dLeftToRight)\n  {\n    while (op2->Next->Pt.X <= Pt.X && \n      op2->Next->Pt.X >= op2->Pt.X && op2->Next->Pt.Y == Pt.Y)\n        op2 = op2->Next;\n    if (DiscardLeft && (op2->Pt.X != Pt.X)) op2 = op2->Next;\n    op2b = DupOutPt(op2, !DiscardLeft);\n    if (op2b->Pt != Pt)\n    {\n      op2 = op2b;\n      op2->Pt = Pt;\n      op2b = DupOutPt(op2, !DiscardLeft);\n    };\n  } else\n  {\n    while (op2->Next->Pt.X >= Pt.X && \n      op2->Next->Pt.X <= op2->Pt.X && op2->Next->Pt.Y == Pt.Y) \n        op2 = op2->Next;\n    if (!DiscardLeft && (op2->Pt.X != Pt.X)) op2 = op2->Next;\n    op2b = DupOutPt(op2, DiscardLeft);\n    if (op2b->Pt != Pt)\n    {\n      op2 = op2b;\n      op2->Pt = Pt;\n      op2b = DupOutPt(op2, DiscardLeft);\n    };\n  };\n\n  if ((Dir1 == dLeftToRight) == DiscardLeft)\n  {\n    op1->Prev = op2;\n    op2->Next = op1;\n    op1b->Next = op2b;\n    op2b->Prev = op1b;\n  }\n  else\n  {\n    op1->Next = op2;\n    op2->Prev = op1;\n    op1b->Prev = op2b;\n    op2b->Next = op1b;\n  }\n  return true;\n}\n//------------------------------------------------------------------------------\n\nbool Clipper::JoinPoints(Join *j, OutRec* outRec1, OutRec* outRec2)\n{\n  OutPt *op1 = j->OutPt1, *op1b;\n  OutPt *op2 = j->OutPt2, *op2b;\n\n  //There are 3 kinds of joins for output polygons ...\n  //1. Horizontal joins where Join.OutPt1 & Join.OutPt2 are vertices anywhere\n  //along (horizontal) collinear edges (& Join.OffPt is on the same horizontal).\n  //2. Non-horizontal joins where Join.OutPt1 & Join.OutPt2 are at the same\n  //location at the Bottom of the overlapping segment (& Join.OffPt is above).\n  //3. StrictSimple joins where edges touch but are not collinear and where\n  //Join.OutPt1, Join.OutPt2 & Join.OffPt all share the same point.\n  bool isHorizontal = (j->OutPt1->Pt.Y == j->OffPt.Y);\n\n  if (isHorizontal  && (j->OffPt == j->OutPt1->Pt) &&\n  (j->OffPt == j->OutPt2->Pt))\n  {\n    //Strictly Simple join ...\n    if (outRec1 != outRec2) return false;\n    op1b = j->OutPt1->Next;\n    while (op1b != op1 && (op1b->Pt == j->OffPt)) \n      op1b = op1b->Next;\n    bool reverse1 = (op1b->Pt.Y > j->OffPt.Y);\n    op2b = j->OutPt2->Next;\n    while (op2b != op2 && (op2b->Pt == j->OffPt)) \n      op2b = op2b->Next;\n    bool reverse2 = (op2b->Pt.Y > j->OffPt.Y);\n    if (reverse1 == reverse2) return false;\n    if (reverse1)\n    {\n      op1b = DupOutPt(op1, false);\n      op2b = DupOutPt(op2, true);\n      op1->Prev = op2;\n      op2->Next = op1;\n      op1b->Next = op2b;\n      op2b->Prev = op1b;\n      j->OutPt1 = op1;\n      j->OutPt2 = op1b;\n      return true;\n    } else\n    {\n      op1b = DupOutPt(op1, true);\n      op2b = DupOutPt(op2, false);\n      op1->Next = op2;\n      op2->Prev = op1;\n      op1b->Prev = op2b;\n      op2b->Next = op1b;\n      j->OutPt1 = op1;\n      j->OutPt2 = op1b;\n      return true;\n    }\n  } \n  else if (isHorizontal)\n  {\n    //treat horizontal joins differently to non-horizontal joins since with\n    //them we're not yet sure where the overlapping is. OutPt1.Pt & OutPt2.Pt\n    //may be anywhere along the horizontal edge.\n    op1b = op1;\n    while (op1->Prev->Pt.Y == op1->Pt.Y && op1->Prev != op1b && op1->Prev != op2)\n      op1 = op1->Prev;\n    while (op1b->Next->Pt.Y == op1b->Pt.Y && op1b->Next != op1 && op1b->Next != op2)\n      op1b = op1b->Next;\n    if (op1b->Next == op1 || op1b->Next == op2) return false; //a flat 'polygon'\n\n    op2b = op2;\n    while (op2->Prev->Pt.Y == op2->Pt.Y && op2->Prev != op2b && op2->Prev != op1b)\n      op2 = op2->Prev;\n    while (op2b->Next->Pt.Y == op2b->Pt.Y && op2b->Next != op2 && op2b->Next != op1)\n      op2b = op2b->Next;\n    if (op2b->Next == op2 || op2b->Next == op1) return false; //a flat 'polygon'\n\n    cInt Left, Right;\n    //Op1 --> Op1b & Op2 --> Op2b are the extremites of the horizontal edges\n    if (!GetOverlap(op1->Pt.X, op1b->Pt.X, op2->Pt.X, op2b->Pt.X, Left, Right))\n      return false;\n\n    //DiscardLeftSide: when overlapping edges are joined, a spike will created\n    //which needs to be cleaned up. However, we don't want Op1 or Op2 caught up\n    //on the discard Side as either may still be needed for other joins ...\n    IntPoint Pt;\n    bool DiscardLeftSide;\n    if (op1->Pt.X >= Left && op1->Pt.X <= Right) \n    {\n      Pt = op1->Pt; DiscardLeftSide = (op1->Pt.X > op1b->Pt.X);\n    } \n    else if (op2->Pt.X >= Left&& op2->Pt.X <= Right) \n    {\n      Pt = op2->Pt; DiscardLeftSide = (op2->Pt.X > op2b->Pt.X);\n    } \n    else if (op1b->Pt.X >= Left && op1b->Pt.X <= Right)\n    {\n      Pt = op1b->Pt; DiscardLeftSide = op1b->Pt.X > op1->Pt.X;\n    } \n    else\n    {\n      Pt = op2b->Pt; DiscardLeftSide = (op2b->Pt.X > op2->Pt.X);\n    }\n    j->OutPt1 = op1; j->OutPt2 = op2;\n    return JoinHorz(op1, op1b, op2, op2b, Pt, DiscardLeftSide);\n  } else\n  {\n    //nb: For non-horizontal joins ...\n    //    1. Jr.OutPt1.Pt.Y == Jr.OutPt2.Pt.Y\n    //    2. Jr.OutPt1.Pt > Jr.OffPt.Y\n\n    //make sure the polygons are correctly oriented ...\n    op1b = op1->Next;\n    while ((op1b->Pt == op1->Pt) && (op1b != op1)) op1b = op1b->Next;\n    bool Reverse1 = ((op1b->Pt.Y > op1->Pt.Y) ||\n      !SlopesEqual(op1->Pt, op1b->Pt, j->OffPt, m_UseFullRange));\n    if (Reverse1)\n    {\n      op1b = op1->Prev;\n      while ((op1b->Pt == op1->Pt) && (op1b != op1)) op1b = op1b->Prev;\n      if ((op1b->Pt.Y > op1->Pt.Y) ||\n        !SlopesEqual(op1->Pt, op1b->Pt, j->OffPt, m_UseFullRange)) return false;\n    };\n    op2b = op2->Next;\n    while ((op2b->Pt == op2->Pt) && (op2b != op2))op2b = op2b->Next;\n    bool Reverse2 = ((op2b->Pt.Y > op2->Pt.Y) ||\n      !SlopesEqual(op2->Pt, op2b->Pt, j->OffPt, m_UseFullRange));\n    if (Reverse2)\n    {\n      op2b = op2->Prev;\n      while ((op2b->Pt == op2->Pt) && (op2b != op2)) op2b = op2b->Prev;\n      if ((op2b->Pt.Y > op2->Pt.Y) ||\n        !SlopesEqual(op2->Pt, op2b->Pt, j->OffPt, m_UseFullRange)) return false;\n    }\n\n    if ((op1b == op1) || (op2b == op2) || (op1b == op2b) ||\n      ((outRec1 == outRec2) && (Reverse1 == Reverse2))) return false;\n\n    if (Reverse1)\n    {\n      op1b = DupOutPt(op1, false);\n      op2b = DupOutPt(op2, true);\n      op1->Prev = op2;\n      op2->Next = op1;\n      op1b->Next = op2b;\n      op2b->Prev = op1b;\n      j->OutPt1 = op1;\n      j->OutPt2 = op1b;\n      return true;\n    } else\n    {\n      op1b = DupOutPt(op1, true);\n      op2b = DupOutPt(op2, false);\n      op1->Next = op2;\n      op2->Prev = op1;\n      op1b->Prev = op2b;\n      op2b->Next = op1b;\n      j->OutPt1 = op1;\n      j->OutPt2 = op1b;\n      return true;\n    }\n  }\n}\n//----------------------------------------------------------------------\n\nstatic OutRec* ParseFirstLeft(OutRec* FirstLeft)\n{\n  while (FirstLeft && !FirstLeft->Pts)\n    FirstLeft = FirstLeft->FirstLeft;\n  return FirstLeft;\n}\n//------------------------------------------------------------------------------\n\nvoid Clipper::FixupFirstLefts1(OutRec* OldOutRec, OutRec* NewOutRec)\n{ \n  //tests if NewOutRec contains the polygon before reassigning FirstLeft\n  for (PolyOutList::size_type i = 0; i < m_PolyOuts.size(); ++i)\n  {\n    OutRec* outRec = m_PolyOuts[i];\n    OutRec* firstLeft = ParseFirstLeft(outRec->FirstLeft);\n    if (outRec->Pts  && firstLeft == OldOutRec)\n    {\n      if (Poly2ContainsPoly1(outRec->Pts, NewOutRec->Pts))\n        outRec->FirstLeft = NewOutRec;\n    }\n  }\n}\n//----------------------------------------------------------------------\n\nvoid Clipper::FixupFirstLefts2(OutRec* InnerOutRec, OutRec* OuterOutRec)\n{\n  //A polygon has split into two such that one is now the inner of the other.\n  //It's possible that these polygons now wrap around other polygons, so check\n  //every polygon that's also contained by OuterOutRec's FirstLeft container\n  //(including 0) to see if they've become inner to the new inner polygon ...\n  OutRec* orfl = OuterOutRec->FirstLeft;\n  for (PolyOutList::size_type i = 0; i < m_PolyOuts.size(); ++i)\n  {\n    OutRec* outRec = m_PolyOuts[i];\n\n    if (!outRec->Pts || outRec == OuterOutRec || outRec == InnerOutRec)\n      continue;\n    OutRec* firstLeft = ParseFirstLeft(outRec->FirstLeft);\n    if (firstLeft != orfl && firstLeft != InnerOutRec && firstLeft != OuterOutRec)\n      continue;\n    if (Poly2ContainsPoly1(outRec->Pts, InnerOutRec->Pts))\n      outRec->FirstLeft = InnerOutRec;\n    else if (Poly2ContainsPoly1(outRec->Pts, OuterOutRec->Pts))\n      outRec->FirstLeft = OuterOutRec;\n    else if (outRec->FirstLeft == InnerOutRec || outRec->FirstLeft == OuterOutRec)\n      outRec->FirstLeft = orfl;\n  }\n}\n//----------------------------------------------------------------------\nvoid Clipper::FixupFirstLefts3(OutRec* OldOutRec, OutRec* NewOutRec)\n{\n  //reassigns FirstLeft WITHOUT testing if NewOutRec contains the polygon\n  for (PolyOutList::size_type i = 0; i < m_PolyOuts.size(); ++i)\n  {\n    OutRec* outRec = m_PolyOuts[i];\n    OutRec* firstLeft = ParseFirstLeft(outRec->FirstLeft);\n    if (outRec->Pts && firstLeft == OldOutRec)\n      outRec->FirstLeft = NewOutRec;\n  }\n}\n//----------------------------------------------------------------------\n\nvoid Clipper::JoinCommonEdges()\n{\n  for (JoinList::size_type i = 0; i < m_Joins.size(); i++)\n  {\n    Join* join = m_Joins[i];\n\n    OutRec *outRec1 = GetOutRec(join->OutPt1->Idx);\n    OutRec *outRec2 = GetOutRec(join->OutPt2->Idx);\n\n    if (!outRec1->Pts || !outRec2->Pts) continue;\n    if (outRec1->IsOpen || outRec2->IsOpen) continue;\n\n    //get the polygon fragment with the correct hole state (FirstLeft)\n    //before calling JoinPoints() ...\n    OutRec *holeStateRec;\n    if (outRec1 == outRec2) holeStateRec = outRec1;\n    else if (OutRec1RightOfOutRec2(outRec1, outRec2)) holeStateRec = outRec2;\n    else if (OutRec1RightOfOutRec2(outRec2, outRec1)) holeStateRec = outRec1;\n    else holeStateRec = GetLowermostRec(outRec1, outRec2);\n\n    if (!JoinPoints(join, outRec1, outRec2)) continue;\n\n    if (outRec1 == outRec2)\n    {\n      //instead of joining two polygons, we've just created a new one by\n      //splitting one polygon into two.\n      outRec1->Pts = join->OutPt1;\n      outRec1->BottomPt = 0;\n      outRec2 = CreateOutRec();\n      outRec2->Pts = join->OutPt2;\n\n      //update all OutRec2.Pts Idx's ...\n      UpdateOutPtIdxs(*outRec2);\n\n      if (Poly2ContainsPoly1(outRec2->Pts, outRec1->Pts))\n      {\n        //outRec1 contains outRec2 ...\n        outRec2->IsHole = !outRec1->IsHole;\n        outRec2->FirstLeft = outRec1;\n\n        if (m_UsingPolyTree) FixupFirstLefts2(outRec2, outRec1);\n\n        if ((outRec2->IsHole ^ m_ReverseOutput) == (Area(*outRec2) > 0))\n          ReversePolyPtLinks(outRec2->Pts);\n            \n      } else if (Poly2ContainsPoly1(outRec1->Pts, outRec2->Pts))\n      {\n        //outRec2 contains outRec1 ...\n        outRec2->IsHole = outRec1->IsHole;\n        outRec1->IsHole = !outRec2->IsHole;\n        outRec2->FirstLeft = outRec1->FirstLeft;\n        outRec1->FirstLeft = outRec2;\n\n        if (m_UsingPolyTree) FixupFirstLefts2(outRec1, outRec2);\n\n        if ((outRec1->IsHole ^ m_ReverseOutput) == (Area(*outRec1) > 0))\n          ReversePolyPtLinks(outRec1->Pts);\n      } \n      else\n      {\n        //the 2 polygons are completely separate ...\n        outRec2->IsHole = outRec1->IsHole;\n        outRec2->FirstLeft = outRec1->FirstLeft;\n\n        //fixup FirstLeft pointers that may need reassigning to OutRec2\n        if (m_UsingPolyTree) FixupFirstLefts1(outRec1, outRec2);\n      }\n     \n    } else\n    {\n      //joined 2 polygons together ...\n\n      outRec2->Pts = 0;\n      outRec2->BottomPt = 0;\n      outRec2->Idx = outRec1->Idx;\n\n      outRec1->IsHole = holeStateRec->IsHole;\n      if (holeStateRec == outRec2) \n        outRec1->FirstLeft = outRec2->FirstLeft;\n      outRec2->FirstLeft = outRec1;\n\n      if (m_UsingPolyTree) FixupFirstLefts3(outRec2, outRec1);\n    }\n  }\n}\n\n//------------------------------------------------------------------------------\n// ClipperOffset support functions ...\n//------------------------------------------------------------------------------\n\nDoublePoint GetUnitNormal(const IntPoint &pt1, const IntPoint &pt2)\n{\n  if(pt2.X == pt1.X && pt2.Y == pt1.Y) \n    return DoublePoint(0, 0);\n\n  double Dx = (double)(pt2.X - pt1.X);\n  double dy = (double)(pt2.Y - pt1.Y);\n  double f = 1 *1.0/ std::sqrt( Dx*Dx + dy*dy );\n  Dx *= f;\n  dy *= f;\n  return DoublePoint(dy, -Dx);\n}\n\n//------------------------------------------------------------------------------\n// ClipperOffset class\n//------------------------------------------------------------------------------\n\nClipperOffset::ClipperOffset(double miterLimit, double arcTolerance)\n{\n  this->MiterLimit = miterLimit;\n  this->ArcTolerance = arcTolerance;\n  m_lowest.X = -1;\n}\n//------------------------------------------------------------------------------\n\nClipperOffset::~ClipperOffset()\n{\n  Clear();\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperOffset::Clear()\n{\n  for (int i = 0; i < m_polyNodes.ChildCount(); ++i)\n    delete m_polyNodes.Childs[i];\n  m_polyNodes.Childs.clear();\n  m_lowest.X = -1;\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperOffset::AddPath(const Path& path, JoinType joinType, EndType endType)\n{\n  int highI = (int)path.size() - 1;\n  if (highI < 0) return;\n  PolyNode* newNode = new PolyNode();\n  newNode->m_jointype = joinType;\n  newNode->m_endtype = endType;\n\n  //strip duplicate points from path and also get index to the lowest point ...\n  if (endType == etClosedLine || endType == etClosedPolygon)\n    while (highI > 0 && path[0] == path[highI]) highI--;\n  newNode->Contour.reserve(highI + 1);\n  newNode->Contour.push_back(path[0]);\n  int j = 0, k = 0;\n  for (int i = 1; i <= highI; i++)\n    if (newNode->Contour[j] != path[i])\n    {\n      j++;\n      newNode->Contour.push_back(path[i]);\n      if (path[i].Y > newNode->Contour[k].Y ||\n        (path[i].Y == newNode->Contour[k].Y &&\n        path[i].X < newNode->Contour[k].X)) k = j;\n    }\n  if (endType == etClosedPolygon && j < 2)\n  {\n    delete newNode;\n    return;\n  }\n  m_polyNodes.AddChild(*newNode);\n\n  //if this path's lowest pt is lower than all the others then update m_lowest\n  if (endType != etClosedPolygon) return;\n  if (m_lowest.X < 0)\n    m_lowest = IntPoint(m_polyNodes.ChildCount() - 1, k);\n  else\n  {\n    IntPoint ip = m_polyNodes.Childs[(int)m_lowest.X]->Contour[(int)m_lowest.Y];\n    if (newNode->Contour[k].Y > ip.Y ||\n      (newNode->Contour[k].Y == ip.Y &&\n      newNode->Contour[k].X < ip.X))\n      m_lowest = IntPoint(m_polyNodes.ChildCount() - 1, k);\n  }\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperOffset::AddPaths(const Paths& paths, JoinType joinType, EndType endType)\n{\n  for (Paths::size_type i = 0; i < paths.size(); ++i)\n    AddPath(paths[i], joinType, endType);\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperOffset::FixOrientations()\n{\n  //fixup orientations of all closed paths if the orientation of the\n  //closed path with the lowermost vertex is wrong ...\n  if (m_lowest.X >= 0 && \n    !Orientation(m_polyNodes.Childs[(int)m_lowest.X]->Contour))\n  {\n    for (int i = 0; i < m_polyNodes.ChildCount(); ++i)\n    {\n      PolyNode& node = *m_polyNodes.Childs[i];\n      if (node.m_endtype == etClosedPolygon ||\n        (node.m_endtype == etClosedLine && Orientation(node.Contour)))\n          ReversePath(node.Contour);\n    }\n  } else\n  {\n    for (int i = 0; i < m_polyNodes.ChildCount(); ++i)\n    {\n      PolyNode& node = *m_polyNodes.Childs[i];\n      if (node.m_endtype == etClosedLine && !Orientation(node.Contour))\n        ReversePath(node.Contour);\n    }\n  }\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperOffset::Execute(Paths& solution, double delta)\n{\n  solution.clear();\n  FixOrientations();\n  DoOffset(delta);\n  \n  //now clean up 'corners' ...\n  Clipper clpr;\n  clpr.AddPaths(m_destPolys, ptSubject, true);\n  if (delta > 0)\n  {\n    clpr.Execute(ctUnion, solution, pftPositive, pftPositive);\n  }\n  else\n  {\n    IntRect r = clpr.GetBounds();\n    Path outer(4);\n    outer[0] = IntPoint(r.left - 10, r.bottom + 10);\n    outer[1] = IntPoint(r.right + 10, r.bottom + 10);\n    outer[2] = IntPoint(r.right + 10, r.top - 10);\n    outer[3] = IntPoint(r.left - 10, r.top - 10);\n\n    clpr.AddPath(outer, ptSubject, true);\n    clpr.ReverseSolution(true);\n    clpr.Execute(ctUnion, solution, pftNegative, pftNegative);\n    if (solution.size() > 0) solution.erase(solution.begin());\n  }\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperOffset::Execute(PolyTree& solution, double delta)\n{\n  solution.Clear();\n  FixOrientations();\n  DoOffset(delta);\n\n  //now clean up 'corners' ...\n  Clipper clpr;\n  clpr.AddPaths(m_destPolys, ptSubject, true);\n  if (delta > 0)\n  {\n    clpr.Execute(ctUnion, solution, pftPositive, pftPositive);\n  }\n  else\n  {\n    IntRect r = clpr.GetBounds();\n    Path outer(4);\n    outer[0] = IntPoint(r.left - 10, r.bottom + 10);\n    outer[1] = IntPoint(r.right + 10, r.bottom + 10);\n    outer[2] = IntPoint(r.right + 10, r.top - 10);\n    outer[3] = IntPoint(r.left - 10, r.top - 10);\n\n    clpr.AddPath(outer, ptSubject, true);\n    clpr.ReverseSolution(true);\n    clpr.Execute(ctUnion, solution, pftNegative, pftNegative);\n    //remove the outer PolyNode rectangle ...\n    if (solution.ChildCount() == 1 && solution.Childs[0]->ChildCount() > 0)\n    {\n      PolyNode* outerNode = solution.Childs[0];\n      solution.Childs.reserve(outerNode->ChildCount());\n      solution.Childs[0] = outerNode->Childs[0];\n      solution.Childs[0]->Parent = outerNode->Parent;\n      for (int i = 1; i < outerNode->ChildCount(); ++i)\n        solution.AddChild(*outerNode->Childs[i]);\n    }\n    else\n      solution.Clear();\n  }\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperOffset::DoOffset(double delta)\n{\n  m_destPolys.clear();\n  m_delta = delta;\n\n  //if Zero offset, just copy any CLOSED polygons to m_p and return ...\n  if (NEAR_ZERO(delta)) \n  {\n    m_destPolys.reserve(m_polyNodes.ChildCount());\n    for (int i = 0; i < m_polyNodes.ChildCount(); i++)\n    {\n      PolyNode& node = *m_polyNodes.Childs[i];\n      if (node.m_endtype == etClosedPolygon)\n        m_destPolys.push_back(node.Contour);\n    }\n    return;\n  }\n\n  //see offset_triginometry3.svg in the documentation folder ...\n  if (MiterLimit > 2) m_miterLim = 2/(MiterLimit * MiterLimit);\n  else m_miterLim = 0.5;\n\n  double y;\n  if (ArcTolerance <= 0.0) y = def_arc_tolerance;\n  else if (ArcTolerance > std::fabs(delta) * def_arc_tolerance) \n    y = std::fabs(delta) * def_arc_tolerance;\n  else y = ArcTolerance;\n  //see offset_triginometry2.svg in the documentation folder ...\n  double steps = pi / std::acos(1 - y / std::fabs(delta));\n  if (steps > std::fabs(delta) * pi) \n    steps = std::fabs(delta) * pi;  //ie excessive precision check\n  m_sin = std::sin(two_pi / steps);\n  m_cos = std::cos(two_pi / steps);\n  m_StepsPerRad = steps / two_pi;\n  if (delta < 0.0) m_sin = -m_sin;\n\n  m_destPolys.reserve(m_polyNodes.ChildCount() * 2);\n  for (int i = 0; i < m_polyNodes.ChildCount(); i++)\n  {\n    PolyNode& node = *m_polyNodes.Childs[i];\n    m_srcPoly = node.Contour;\n\n    int len = (int)m_srcPoly.size();\n    if (len == 0 || (delta <= 0 && (len < 3 || node.m_endtype != etClosedPolygon)))\n        continue;\n\n    m_destPoly.clear();\n    if (len == 1)\n    {\n      if (node.m_jointype == jtRound)\n      {\n        double X = 1.0, Y = 0.0;\n        for (cInt j = 1; j <= steps; j++)\n        {\n          m_destPoly.push_back(IntPoint(\n            Round(m_srcPoly[0].X + X * delta),\n            Round(m_srcPoly[0].Y + Y * delta)));\n          double X2 = X;\n          X = X * m_cos - m_sin * Y;\n          Y = X2 * m_sin + Y * m_cos;\n        }\n      }\n      else\n      {\n        double X = -1.0, Y = -1.0;\n        for (int j = 0; j < 4; ++j)\n        {\n          m_destPoly.push_back(IntPoint(\n            Round(m_srcPoly[0].X + X * delta),\n            Round(m_srcPoly[0].Y + Y * delta)));\n          if (X < 0) X = 1;\n          else if (Y < 0) Y = 1;\n          else X = -1;\n        }\n      }\n      m_destPolys.push_back(m_destPoly);\n      continue;\n    }\n    //build m_normals ...\n    m_normals.clear();\n    m_normals.reserve(len);\n    for (int j = 0; j < len - 1; ++j)\n      m_normals.push_back(GetUnitNormal(m_srcPoly[j], m_srcPoly[j + 1]));\n    if (node.m_endtype == etClosedLine || node.m_endtype == etClosedPolygon)\n      m_normals.push_back(GetUnitNormal(m_srcPoly[len - 1], m_srcPoly[0]));\n    else\n      m_normals.push_back(DoublePoint(m_normals[len - 2]));\n\n    if (node.m_endtype == etClosedPolygon)\n    {\n      int k = len - 1;\n      for (int j = 0; j < len; ++j)\n        OffsetPoint(j, k, node.m_jointype);\n      m_destPolys.push_back(m_destPoly);\n    }\n    else if (node.m_endtype == etClosedLine)\n    {\n      int k = len - 1;\n      for (int j = 0; j < len; ++j)\n        OffsetPoint(j, k, node.m_jointype);\n      m_destPolys.push_back(m_destPoly);\n      m_destPoly.clear();\n      //re-build m_normals ...\n      DoublePoint n = m_normals[len -1];\n      for (int j = len - 1; j > 0; j--)\n        m_normals[j] = DoublePoint(-m_normals[j - 1].X, -m_normals[j - 1].Y);\n      m_normals[0] = DoublePoint(-n.X, -n.Y);\n      k = 0;\n      for (int j = len - 1; j >= 0; j--)\n        OffsetPoint(j, k, node.m_jointype);\n      m_destPolys.push_back(m_destPoly);\n    }\n    else\n    {\n      int k = 0;\n      for (int j = 1; j < len - 1; ++j)\n        OffsetPoint(j, k, node.m_jointype);\n\n      IntPoint pt1;\n      if (node.m_endtype == etOpenButt)\n      {\n        int j = len - 1;\n        pt1 = IntPoint((cInt)Round(m_srcPoly[j].X + m_normals[j].X *\n          delta), (cInt)Round(m_srcPoly[j].Y + m_normals[j].Y * delta));\n        m_destPoly.push_back(pt1);\n        pt1 = IntPoint((cInt)Round(m_srcPoly[j].X - m_normals[j].X *\n          delta), (cInt)Round(m_srcPoly[j].Y - m_normals[j].Y * delta));\n        m_destPoly.push_back(pt1);\n      }\n      else\n      {\n        int j = len - 1;\n        k = len - 2;\n        m_sinA = 0;\n        m_normals[j] = DoublePoint(-m_normals[j].X, -m_normals[j].Y);\n        if (node.m_endtype == etOpenSquare)\n          DoSquare(j, k);\n        else\n          DoRound(j, k);\n      }\n\n      //re-build m_normals ...\n      for (int j = len - 1; j > 0; j--)\n        m_normals[j] = DoublePoint(-m_normals[j - 1].X, -m_normals[j - 1].Y);\n      m_normals[0] = DoublePoint(-m_normals[1].X, -m_normals[1].Y);\n\n      k = len - 1;\n      for (int j = k - 1; j > 0; --j) OffsetPoint(j, k, node.m_jointype);\n\n      if (node.m_endtype == etOpenButt)\n      {\n        pt1 = IntPoint((cInt)Round(m_srcPoly[0].X - m_normals[0].X * delta),\n          (cInt)Round(m_srcPoly[0].Y - m_normals[0].Y * delta));\n        m_destPoly.push_back(pt1);\n        pt1 = IntPoint((cInt)Round(m_srcPoly[0].X + m_normals[0].X * delta),\n          (cInt)Round(m_srcPoly[0].Y + m_normals[0].Y * delta));\n        m_destPoly.push_back(pt1);\n      }\n      else\n      {\n        k = 1;\n        m_sinA = 0;\n        if (node.m_endtype == etOpenSquare)\n          DoSquare(0, 1);\n        else\n          DoRound(0, 1);\n      }\n      m_destPolys.push_back(m_destPoly);\n    }\n  }\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperOffset::OffsetPoint(int j, int& k, JoinType jointype)\n{\n  //cross product ...\n  m_sinA = (m_normals[k].X * m_normals[j].Y - m_normals[j].X * m_normals[k].Y);\n  if (std::fabs(m_sinA * m_delta) < 1.0) \n  {\n    //dot product ...\n    double cosA = (m_normals[k].X * m_normals[j].X + m_normals[j].Y * m_normals[k].Y ); \n    if (cosA > 0) // angle => 0 degrees\n    {\n      m_destPoly.push_back(IntPoint(Round(m_srcPoly[j].X + m_normals[k].X * m_delta),\n        Round(m_srcPoly[j].Y + m_normals[k].Y * m_delta)));\n      return; \n    }\n    //else angle => 180 degrees   \n  }\n  else if (m_sinA > 1.0) m_sinA = 1.0;\n  else if (m_sinA < -1.0) m_sinA = -1.0;\n\n  if (m_sinA * m_delta < 0)\n  {\n    m_destPoly.push_back(IntPoint(Round(m_srcPoly[j].X + m_normals[k].X * m_delta),\n      Round(m_srcPoly[j].Y + m_normals[k].Y * m_delta)));\n    m_destPoly.push_back(m_srcPoly[j]);\n    m_destPoly.push_back(IntPoint(Round(m_srcPoly[j].X + m_normals[j].X * m_delta),\n      Round(m_srcPoly[j].Y + m_normals[j].Y * m_delta)));\n  }\n  else\n    switch (jointype)\n    {\n      case jtMiter:\n        {\n          double r = 1 + (m_normals[j].X * m_normals[k].X +\n            m_normals[j].Y * m_normals[k].Y);\n          if (r >= m_miterLim) DoMiter(j, k, r); else DoSquare(j, k);\n          break;\n        }\n      case jtSquare: DoSquare(j, k); break;\n      case jtRound: DoRound(j, k); break;\n    }\n  k = j;\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperOffset::DoSquare(int j, int k)\n{\n  double dx = std::tan(std::atan2(m_sinA,\n      m_normals[k].X * m_normals[j].X + m_normals[k].Y * m_normals[j].Y) / 4);\n  m_destPoly.push_back(IntPoint(\n      Round(m_srcPoly[j].X + m_delta * (m_normals[k].X - m_normals[k].Y * dx)),\n      Round(m_srcPoly[j].Y + m_delta * (m_normals[k].Y + m_normals[k].X * dx))));\n  m_destPoly.push_back(IntPoint(\n      Round(m_srcPoly[j].X + m_delta * (m_normals[j].X + m_normals[j].Y * dx)),\n      Round(m_srcPoly[j].Y + m_delta * (m_normals[j].Y - m_normals[j].X * dx))));\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperOffset::DoMiter(int j, int k, double r)\n{\n  double q = m_delta / r;\n  m_destPoly.push_back(IntPoint(Round(m_srcPoly[j].X + (m_normals[k].X + m_normals[j].X) * q),\n      Round(m_srcPoly[j].Y + (m_normals[k].Y + m_normals[j].Y) * q)));\n}\n//------------------------------------------------------------------------------\n\nvoid ClipperOffset::DoRound(int j, int k)\n{\n  double a = std::atan2(m_sinA,\n  m_normals[k].X * m_normals[j].X + m_normals[k].Y * m_normals[j].Y);\n  int steps = std::max((int)Round(m_StepsPerRad * std::fabs(a)), 1);\n\n  double X = m_normals[k].X, Y = m_normals[k].Y, X2;\n  for (int i = 0; i < steps; ++i)\n  {\n    m_destPoly.push_back(IntPoint(\n        Round(m_srcPoly[j].X + X * m_delta),\n        Round(m_srcPoly[j].Y + Y * m_delta)));\n    X2 = X;\n    X = X * m_cos - m_sin * Y;\n    Y = X2 * m_sin + Y * m_cos;\n  }\n  m_destPoly.push_back(IntPoint(\n  Round(m_srcPoly[j].X + m_normals[j].X * m_delta),\n  Round(m_srcPoly[j].Y + m_normals[j].Y * m_delta)));\n}\n\n//------------------------------------------------------------------------------\n// Miscellaneous public functions\n//------------------------------------------------------------------------------\n\nvoid Clipper::DoSimplePolygons()\n{\n  PolyOutList::size_type i = 0;\n  while (i < m_PolyOuts.size()) \n  {\n    OutRec* outrec = m_PolyOuts[i++];\n    OutPt* op = outrec->Pts;\n    if (!op || outrec->IsOpen) continue;\n    do //for each Pt in Polygon until duplicate found do ...\n    {\n      OutPt* op2 = op->Next;\n      while (op2 != outrec->Pts) \n      {\n        if ((op->Pt == op2->Pt) && op2->Next != op && op2->Prev != op) \n        {\n          //split the polygon into two ...\n          OutPt* op3 = op->Prev;\n          OutPt* op4 = op2->Prev;\n          op->Prev = op4;\n          op4->Next = op;\n          op2->Prev = op3;\n          op3->Next = op2;\n\n          outrec->Pts = op;\n          OutRec* outrec2 = CreateOutRec();\n          outrec2->Pts = op2;\n          UpdateOutPtIdxs(*outrec2);\n          if (Poly2ContainsPoly1(outrec2->Pts, outrec->Pts))\n          {\n            //OutRec2 is contained by OutRec1 ...\n            outrec2->IsHole = !outrec->IsHole;\n            outrec2->FirstLeft = outrec;\n            if (m_UsingPolyTree) FixupFirstLefts2(outrec2, outrec);\n          }\n          else\n            if (Poly2ContainsPoly1(outrec->Pts, outrec2->Pts))\n          {\n            //OutRec1 is contained by OutRec2 ...\n            outrec2->IsHole = outrec->IsHole;\n            outrec->IsHole = !outrec2->IsHole;\n            outrec2->FirstLeft = outrec->FirstLeft;\n            outrec->FirstLeft = outrec2;\n            if (m_UsingPolyTree) FixupFirstLefts2(outrec, outrec2);\n            }\n            else\n          {\n            //the 2 polygons are separate ...\n            outrec2->IsHole = outrec->IsHole;\n            outrec2->FirstLeft = outrec->FirstLeft;\n            if (m_UsingPolyTree) FixupFirstLefts1(outrec, outrec2);\n            }\n          op2 = op; //ie get ready for the Next iteration\n        }\n        op2 = op2->Next;\n      }\n      op = op->Next;\n    }\n    while (op != outrec->Pts);\n  }\n}\n//------------------------------------------------------------------------------\n\nvoid ReversePath(Path& p)\n{\n  std::reverse(p.begin(), p.end());\n}\n//------------------------------------------------------------------------------\n\nvoid ReversePaths(Paths& p)\n{\n  for (Paths::size_type i = 0; i < p.size(); ++i)\n    ReversePath(p[i]);\n}\n//------------------------------------------------------------------------------\n\nvoid SimplifyPolygon(const Path &in_poly, Paths &out_polys, PolyFillType fillType)\n{\n  Clipper c;\n  c.StrictlySimple(true);\n  c.AddPath(in_poly, ptSubject, true);\n  c.Execute(ctUnion, out_polys, fillType, fillType);\n}\n//------------------------------------------------------------------------------\n\nvoid SimplifyPolygons(const Paths &in_polys, Paths &out_polys, PolyFillType fillType)\n{\n  Clipper c;\n  c.StrictlySimple(true);\n  c.AddPaths(in_polys, ptSubject, true);\n  c.Execute(ctUnion, out_polys, fillType, fillType);\n}\n//------------------------------------------------------------------------------\n\nvoid SimplifyPolygons(Paths &polys, PolyFillType fillType)\n{\n  SimplifyPolygons(polys, polys, fillType);\n}\n//------------------------------------------------------------------------------\n\ninline double DistanceSqrd(const IntPoint& pt1, const IntPoint& pt2)\n{\n  double Dx = ((double)pt1.X - pt2.X);\n  double dy = ((double)pt1.Y - pt2.Y);\n  return (Dx*Dx + dy*dy);\n}\n//------------------------------------------------------------------------------\n\ndouble DistanceFromLineSqrd(\n  const IntPoint& pt, const IntPoint& ln1, const IntPoint& ln2)\n{\n  //The equation of a line in general form (Ax + By + C = 0)\n  //given 2 points (x�,y�) & (x�,y�) is ...\n  //(y� - y�)x + (x� - x�)y + (y� - y�)x� - (x� - x�)y� = 0\n  //A = (y� - y�); B = (x� - x�); C = (y� - y�)x� - (x� - x�)y�\n  //perpendicular distance of point (x�,y�) = (Ax� + By� + C)/Sqrt(A� + B�)\n  //see http://en.wikipedia.org/wiki/Perpendicular_distance\n  double A = double(ln1.Y - ln2.Y);\n  double B = double(ln2.X - ln1.X);\n  double C = A * ln1.X  + B * ln1.Y;\n  C = A * pt.X + B * pt.Y - C;\n  return (C * C) / (A * A + B * B);\n}\n//---------------------------------------------------------------------------\n\nbool SlopesNearCollinear(const IntPoint& pt1, \n    const IntPoint& pt2, const IntPoint& pt3, double distSqrd)\n{\n  //this function is more accurate when the point that's geometrically\n  //between the other 2 points is the one that's tested for distance.\n  //ie makes it more likely to pick up 'spikes' ...\n\tif (Abs(pt1.X - pt2.X) > Abs(pt1.Y - pt2.Y))\n\t{\n    if ((pt1.X > pt2.X) == (pt1.X < pt3.X))\n      return DistanceFromLineSqrd(pt1, pt2, pt3) < distSqrd;\n    else if ((pt2.X > pt1.X) == (pt2.X < pt3.X))\n      return DistanceFromLineSqrd(pt2, pt1, pt3) < distSqrd;\n\t\telse\n\t    return DistanceFromLineSqrd(pt3, pt1, pt2) < distSqrd;\n\t}\n\telse\n\t{\n    if ((pt1.Y > pt2.Y) == (pt1.Y < pt3.Y))\n      return DistanceFromLineSqrd(pt1, pt2, pt3) < distSqrd;\n    else if ((pt2.Y > pt1.Y) == (pt2.Y < pt3.Y))\n      return DistanceFromLineSqrd(pt2, pt1, pt3) < distSqrd;\n\t\telse\n      return DistanceFromLineSqrd(pt3, pt1, pt2) < distSqrd;\n\t}\n}\n//------------------------------------------------------------------------------\n\nbool PointsAreClose(IntPoint pt1, IntPoint pt2, double distSqrd)\n{\n    double Dx = (double)pt1.X - pt2.X;\n    double dy = (double)pt1.Y - pt2.Y;\n    return ((Dx * Dx) + (dy * dy) <= distSqrd);\n}\n//------------------------------------------------------------------------------\n\nOutPt* ExcludeOp(OutPt* op)\n{\n  OutPt* result = op->Prev;\n  result->Next = op->Next;\n  op->Next->Prev = result;\n  result->Idx = 0;\n  return result;\n}\n//------------------------------------------------------------------------------\n\nvoid CleanPolygon(const Path& in_poly, Path& out_poly, double distance)\n{\n  //distance = proximity in units/pixels below which vertices\n  //will be stripped. Default ~= sqrt(2).\n  \n  size_t size = in_poly.size();\n  \n  if (size == 0) \n  {\n    out_poly.clear();\n    return;\n  }\n\n  OutPt* outPts = new OutPt[size];\n  for (size_t i = 0; i < size; ++i)\n  {\n    outPts[i].Pt = in_poly[i];\n    outPts[i].Next = &outPts[(i + 1) % size];\n    outPts[i].Next->Prev = &outPts[i];\n    outPts[i].Idx = 0;\n  }\n\n  double distSqrd = distance * distance;\n  OutPt* op = &outPts[0];\n  while (op->Idx == 0 && op->Next != op->Prev) \n  {\n    if (PointsAreClose(op->Pt, op->Prev->Pt, distSqrd))\n    {\n      op = ExcludeOp(op);\n      size--;\n    } \n    else if (PointsAreClose(op->Prev->Pt, op->Next->Pt, distSqrd))\n    {\n      ExcludeOp(op->Next);\n      op = ExcludeOp(op);\n      size -= 2;\n    }\n    else if (SlopesNearCollinear(op->Prev->Pt, op->Pt, op->Next->Pt, distSqrd))\n    {\n      op = ExcludeOp(op);\n      size--;\n    }\n    else\n    {\n      op->Idx = 1;\n      op = op->Next;\n    }\n  }\n\n  if (size < 3) size = 0;\n  out_poly.resize(size);\n  for (size_t i = 0; i < size; ++i)\n  {\n    out_poly[i] = op->Pt;\n    op = op->Next;\n  }\n  delete [] outPts;\n}\n//------------------------------------------------------------------------------\n\nvoid CleanPolygon(Path& poly, double distance)\n{\n  CleanPolygon(poly, poly, distance);\n}\n//------------------------------------------------------------------------------\n\nvoid CleanPolygons(const Paths& in_polys, Paths& out_polys, double distance)\n{\n  out_polys.resize(in_polys.size());\n  for (Paths::size_type i = 0; i < in_polys.size(); ++i)\n    CleanPolygon(in_polys[i], out_polys[i], distance);\n}\n//------------------------------------------------------------------------------\n\nvoid CleanPolygons(Paths& polys, double distance)\n{\n  CleanPolygons(polys, polys, distance);\n}\n//------------------------------------------------------------------------------\n\nvoid Minkowski(const Path& poly, const Path& path, \n  Paths& solution, bool isSum, bool isClosed)\n{\n  int delta = (isClosed ? 1 : 0);\n  size_t polyCnt = poly.size();\n  size_t pathCnt = path.size();\n  Paths pp;\n  pp.reserve(pathCnt);\n  if (isSum)\n    for (size_t i = 0; i < pathCnt; ++i)\n    {\n      Path p;\n      p.reserve(polyCnt);\n      for (size_t j = 0; j < poly.size(); ++j)\n        p.push_back(IntPoint(path[i].X + poly[j].X, path[i].Y + poly[j].Y));\n      pp.push_back(p);\n    }\n  else\n    for (size_t i = 0; i < pathCnt; ++i)\n    {\n      Path p;\n      p.reserve(polyCnt);\n      for (size_t j = 0; j < poly.size(); ++j)\n        p.push_back(IntPoint(path[i].X - poly[j].X, path[i].Y - poly[j].Y));\n      pp.push_back(p);\n    }\n\n  solution.clear();\n  solution.reserve((pathCnt + delta) * (polyCnt + 1));\n  for (size_t i = 0; i < pathCnt - 1 + delta; ++i)\n    for (size_t j = 0; j < polyCnt; ++j)\n    {\n      Path quad;\n      quad.reserve(4);\n      quad.push_back(pp[i % pathCnt][j % polyCnt]);\n      quad.push_back(pp[(i + 1) % pathCnt][j % polyCnt]);\n      quad.push_back(pp[(i + 1) % pathCnt][(j + 1) % polyCnt]);\n      quad.push_back(pp[i % pathCnt][(j + 1) % polyCnt]);\n      if (!Orientation(quad)) ReversePath(quad);\n      solution.push_back(quad);\n    }\n}\n//------------------------------------------------------------------------------\n\nvoid MinkowskiSum(const Path& pattern, const Path& path, Paths& solution, bool pathIsClosed)\n{\n  Minkowski(pattern, path, solution, true, pathIsClosed);\n  Clipper c;\n  c.AddPaths(solution, ptSubject, true);\n  c.Execute(ctUnion, solution, pftNonZero, pftNonZero);\n}\n//------------------------------------------------------------------------------\n\nvoid TranslatePath(const Path& input, Path& output, const IntPoint delta)\n{\n  //precondition: input != output\n  output.resize(input.size());\n  for (size_t i = 0; i < input.size(); ++i)\n    output[i] = IntPoint(input[i].X + delta.X, input[i].Y + delta.Y);\n}\n//------------------------------------------------------------------------------\n\nvoid MinkowskiSum(const Path& pattern, const Paths& paths, Paths& solution, bool pathIsClosed)\n{\n  Clipper c;\n  for (size_t i = 0; i < paths.size(); ++i)\n  {\n    Paths tmp;\n    Minkowski(pattern, paths[i], tmp, true, pathIsClosed);\n    c.AddPaths(tmp, ptSubject, true);\n    if (pathIsClosed)\n    {\n      Path tmp2;\n      TranslatePath(paths[i], tmp2, pattern[0]);\n      c.AddPath(tmp2, ptClip, true);\n    }\n  }\n    c.Execute(ctUnion, solution, pftNonZero, pftNonZero);\n}\n//------------------------------------------------------------------------------\n\nvoid MinkowskiDiff(const Path& poly1, const Path& poly2, Paths& solution)\n{\n  Minkowski(poly1, poly2, solution, false, true);\n  Clipper c;\n  c.AddPaths(solution, ptSubject, true);\n  c.Execute(ctUnion, solution, pftNonZero, pftNonZero);\n}\n//------------------------------------------------------------------------------\n\nenum NodeType {ntAny, ntOpen, ntClosed};\n\nvoid AddPolyNodeToPaths(const PolyNode& polynode, NodeType nodetype, Paths& paths)\n{\n  bool match = true;\n  if (nodetype == ntClosed) match = !polynode.IsOpen();\n  else if (nodetype == ntOpen) return;\n\n  if (!polynode.Contour.empty() && match)\n    paths.push_back(polynode.Contour);\n  for (int i = 0; i < polynode.ChildCount(); ++i)\n    AddPolyNodeToPaths(*polynode.Childs[i], nodetype, paths);\n}\n//------------------------------------------------------------------------------\n\nvoid PolyTreeToPaths(const PolyTree& polytree, Paths& paths)\n{\n  paths.resize(0); \n  paths.reserve(polytree.Total());\n  AddPolyNodeToPaths(polytree, ntAny, paths);\n}\n//------------------------------------------------------------------------------\n\nvoid ClosedPathsFromPolyTree(const PolyTree& polytree, Paths& paths)\n{\n  paths.resize(0); \n  paths.reserve(polytree.Total());\n  AddPolyNodeToPaths(polytree, ntClosed, paths);\n}\n//------------------------------------------------------------------------------\n\nvoid OpenPathsFromPolyTree(PolyTree& polytree, Paths& paths)\n{\n  paths.resize(0); \n  paths.reserve(polytree.Total());\n  //Open paths are top level only, so ...\n  for (int i = 0; i < polytree.ChildCount(); ++i)\n    if (polytree.Childs[i]->IsOpen())\n      paths.push_back(polytree.Childs[i]->Contour);\n}\n//------------------------------------------------------------------------------\n\nstd::ostream& operator <<(std::ostream &s, const IntPoint &p)\n{\n  s << \"(\" << p.X << \",\" << p.Y << \")\";\n  return s;\n}\n//------------------------------------------------------------------------------\n\nstd::ostream& operator <<(std::ostream &s, const Path &p)\n{\n  if (p.empty()) return s;\n  Path::size_type last = p.size() -1;\n  for (Path::size_type i = 0; i < last; i++)\n    s << \"(\" << p[i].X << \",\" << p[i].Y << \"), \";\n  s << \"(\" << p[last].X << \",\" << p[last].Y << \")\\n\";\n  return s;\n}\n//------------------------------------------------------------------------------\n\nstd::ostream& operator <<(std::ostream &s, const Paths &p)\n{\n  for (Paths::size_type i = 0; i < p.size(); i++)\n    s << p[i];\n  s << \"\\n\";\n  return s;\n}\n//------------------------------------------------------------------------------\n\n} //ClipperLib namespace\n"
  },
  {
    "path": "dbnet/clipper/clipper.hpp",
    "content": "/*******************************************************************************\n*                                                                              *\n* Author    :  Angus Johnson                                                   *\n* Version   :  6.4.2                                                           *\n* Date      :  27 February 2017                                                *\n* Website   :  http://www.angusj.com                                           *\n* Copyright :  Angus Johnson 2010-2017                                         *\n*                                                                              *\n* License:                                                                     *\n* Use, modification & distribution is subject to Boost Software License Ver 1. *\n* http://www.boost.org/LICENSE_1_0.txt                                         *\n*                                                                              *\n* Attributions:                                                                *\n* The code in this library is an extension of Bala Vatti's clipping algorithm: *\n* \"A generic solution to polygon clipping\"                                     *\n* Communications of the ACM, Vol 35, Issue 7 (July 1992) pp 56-63.             *\n* http://portal.acm.org/citation.cfm?id=129906                                 *\n*                                                                              *\n* Computer graphics and geometric modeling: implementation and algorithms      *\n* By Max K. Agoston                                                            *\n* Springer; 1 edition (January 4, 2005)                                        *\n* http://books.google.com/books?q=vatti+clipping+agoston                       *\n*                                                                              *\n* See also:                                                                    *\n* \"Polygon Offsetting by Computing Winding Numbers\"                            *\n* Paper no. DETC2005-85513 pp. 565-575                                         *\n* ASME 2005 International Design Engineering Technical Conferences             *\n* and Computers and Information in Engineering Conference (IDETC/CIE2005)      *\n* September 24-28, 2005 , Long Beach, California, USA                          *\n* http://www.me.berkeley.edu/~mcmains/pubs/DAC05OffsetPolygon.pdf              *\n*                                                                              *\n*******************************************************************************/\n\n#ifndef clipper_hpp\n#define clipper_hpp\n\n#define CLIPPER_VERSION \"6.4.2\"\n\n//use_int32: When enabled 32bit ints are used instead of 64bit ints. This\n//improve performance but coordinate values are limited to the range +/- 46340\n//#define use_int32\n\n//use_xyz: adds a Z member to IntPoint. Adds a minor cost to perfomance.\n//#define use_xyz\n\n//use_lines: Enables line clipping. Adds a very minor cost to performance.\n#define use_lines\n  \n//use_deprecated: Enables temporary support for the obsolete functions\n//#define use_deprecated  \n\n#include <vector>\n#include <list>\n#include <set>\n#include <stdexcept>\n#include <cstring>\n#include <cstdlib>\n#include <ostream>\n#include <functional>\n#include <queue>\n\nnamespace ClipperLib {\n\nenum ClipType { ctIntersection, ctUnion, ctDifference, ctXor };\nenum PolyType { ptSubject, ptClip };\n//By far the most widely used winding rules for polygon filling are\n//EvenOdd & NonZero (GDI, GDI+, XLib, OpenGL, Cairo, AGG, Quartz, SVG, Gr32)\n//Others rules include Positive, Negative and ABS_GTR_EQ_TWO (only in OpenGL)\n//see http://glprogramming.com/red/chapter11.html\nenum PolyFillType { pftEvenOdd, pftNonZero, pftPositive, pftNegative };\n\n#ifdef use_int32\n  typedef int cInt;\n  static cInt const loRange = 0x7FFF;\n  static cInt const hiRange = 0x7FFF;\n#else\n  typedef signed long long cInt;\n  static cInt const loRange = 0x3FFFFFFF;\n  static cInt const hiRange = 0x3FFFFFFFFFFFFFFFLL;\n  typedef signed long long long64;     //used by Int128 class\n  typedef unsigned long long ulong64;\n\n#endif\n\nstruct IntPoint {\n  cInt X;\n  cInt Y;\n#ifdef use_xyz\n  cInt Z;\n  IntPoint(cInt x = 0, cInt y = 0, cInt z = 0): X(x), Y(y), Z(z) {};\n#else\n  IntPoint(cInt x = 0, cInt y = 0): X(x), Y(y) {};\n#endif\n\n  friend inline bool operator== (const IntPoint& a, const IntPoint& b)\n  {\n    return a.X == b.X && a.Y == b.Y;\n  }\n  friend inline bool operator!= (const IntPoint& a, const IntPoint& b)\n  {\n    return a.X != b.X  || a.Y != b.Y; \n  }\n};\n//------------------------------------------------------------------------------\n\ntypedef std::vector< IntPoint > Path;\ntypedef std::vector< Path > Paths;\n\ninline Path& operator <<(Path& poly, const IntPoint& p) {poly.push_back(p); return poly;}\ninline Paths& operator <<(Paths& polys, const Path& p) {polys.push_back(p); return polys;}\n\nstd::ostream& operator <<(std::ostream &s, const IntPoint &p);\nstd::ostream& operator <<(std::ostream &s, const Path &p);\nstd::ostream& operator <<(std::ostream &s, const Paths &p);\n\nstruct DoublePoint\n{\n  double X;\n  double Y;\n  DoublePoint(double x = 0, double y = 0) : X(x), Y(y) {}\n  DoublePoint(IntPoint ip) : X((double)ip.X), Y((double)ip.Y) {}\n};\n//------------------------------------------------------------------------------\n\n#ifdef use_xyz\ntypedef void (*ZFillCallback)(IntPoint& e1bot, IntPoint& e1top, IntPoint& e2bot, IntPoint& e2top, IntPoint& pt);\n#endif\n\nenum InitOptions {ioReverseSolution = 1, ioStrictlySimple = 2, ioPreserveCollinear = 4};\nenum JoinType {jtSquare, jtRound, jtMiter};\nenum EndType {etClosedPolygon, etClosedLine, etOpenButt, etOpenSquare, etOpenRound};\n\nclass PolyNode;\ntypedef std::vector< PolyNode* > PolyNodes;\n\nclass PolyNode \n{ \npublic:\n    PolyNode();\n    virtual ~PolyNode(){};\n    Path Contour;\n    PolyNodes Childs;\n    PolyNode* Parent;\n    PolyNode* GetNext() const;\n    bool IsHole() const;\n    bool IsOpen() const;\n    int ChildCount() const;\nprivate:\n    //PolyNode& operator =(PolyNode& other); \n    unsigned Index; //node index in Parent.Childs\n    bool m_IsOpen;\n    JoinType m_jointype;\n    EndType m_endtype;\n    PolyNode* GetNextSiblingUp() const;\n    void AddChild(PolyNode& child);\n    friend class Clipper; //to access Index\n    friend class ClipperOffset; \n};\n\nclass PolyTree: public PolyNode\n{ \npublic:\n    ~PolyTree(){ Clear(); };\n    PolyNode* GetFirst() const;\n    void Clear();\n    int Total() const;\nprivate:\n  //PolyTree& operator =(PolyTree& other);\n  PolyNodes AllNodes;\n    friend class Clipper; //to access AllNodes\n};\n\nbool Orientation(const Path &poly);\ndouble Area(const Path &poly);\nint PointInPolygon(const IntPoint &pt, const Path &path);\n\nvoid SimplifyPolygon(const Path &in_poly, Paths &out_polys, PolyFillType fillType = pftEvenOdd);\nvoid SimplifyPolygons(const Paths &in_polys, Paths &out_polys, PolyFillType fillType = pftEvenOdd);\nvoid SimplifyPolygons(Paths &polys, PolyFillType fillType = pftEvenOdd);\n\nvoid CleanPolygon(const Path& in_poly, Path& out_poly, double distance = 1.415);\nvoid CleanPolygon(Path& poly, double distance = 1.415);\nvoid CleanPolygons(const Paths& in_polys, Paths& out_polys, double distance = 1.415);\nvoid CleanPolygons(Paths& polys, double distance = 1.415);\n\nvoid MinkowskiSum(const Path& pattern, const Path& path, Paths& solution, bool pathIsClosed);\nvoid MinkowskiSum(const Path& pattern, const Paths& paths, Paths& solution, bool pathIsClosed);\nvoid MinkowskiDiff(const Path& poly1, const Path& poly2, Paths& solution);\n\nvoid PolyTreeToPaths(const PolyTree& polytree, Paths& paths);\nvoid ClosedPathsFromPolyTree(const PolyTree& polytree, Paths& paths);\nvoid OpenPathsFromPolyTree(PolyTree& polytree, Paths& paths);\n\nvoid ReversePath(Path& p);\nvoid ReversePaths(Paths& p);\n\nstruct IntRect { cInt left; cInt top; cInt right; cInt bottom; };\n\n//enums that are used internally ...\nenum EdgeSide { esLeft = 1, esRight = 2};\n\n//forward declarations (for stuff used internally) ...\nstruct TEdge;\nstruct IntersectNode;\nstruct LocalMinimum;\nstruct OutPt;\nstruct OutRec;\nstruct Join;\n\ntypedef std::vector < OutRec* > PolyOutList;\ntypedef std::vector < TEdge* > EdgeList;\ntypedef std::vector < Join* > JoinList;\ntypedef std::vector < IntersectNode* > IntersectList;\n\n//------------------------------------------------------------------------------\n\n//ClipperBase is the ancestor to the Clipper class. It should not be\n//instantiated directly. This class simply abstracts the conversion of sets of\n//polygon coordinates into edge objects that are stored in a LocalMinima list.\nclass ClipperBase\n{\npublic:\n  ClipperBase();\n  virtual ~ClipperBase();\n  virtual bool AddPath(const Path &pg, PolyType PolyTyp, bool Closed);\n  bool AddPaths(const Paths &ppg, PolyType PolyTyp, bool Closed);\n  virtual void Clear();\n  IntRect GetBounds();\n  bool PreserveCollinear() {return m_PreserveCollinear;};\n  void PreserveCollinear(bool value) {m_PreserveCollinear = value;};\nprotected:\n  void DisposeLocalMinimaList();\n  TEdge* AddBoundsToLML(TEdge *e, bool IsClosed);\n  virtual void Reset();\n  TEdge* ProcessBound(TEdge* E, bool IsClockwise);\n  void InsertScanbeam(const cInt Y);\n  bool PopScanbeam(cInt &Y);\n  bool LocalMinimaPending();\n  bool PopLocalMinima(cInt Y, const LocalMinimum *&locMin);\n  OutRec* CreateOutRec();\n  void DisposeAllOutRecs();\n  void DisposeOutRec(PolyOutList::size_type index);\n  void SwapPositionsInAEL(TEdge *edge1, TEdge *edge2);\n  void DeleteFromAEL(TEdge *e);\n  void UpdateEdgeIntoAEL(TEdge *&e);\n\n  typedef std::vector<LocalMinimum> MinimaList;\n  MinimaList::iterator m_CurrentLM;\n  MinimaList           m_MinimaList;\n\n  bool              m_UseFullRange;\n  EdgeList          m_edges;\n  bool              m_PreserveCollinear;\n  bool              m_HasOpenPaths;\n  PolyOutList       m_PolyOuts;\n  TEdge           *m_ActiveEdges;\n\n  typedef std::priority_queue<cInt> ScanbeamList;\n  ScanbeamList     m_Scanbeam;\n};\n//------------------------------------------------------------------------------\n\nclass Clipper : public virtual ClipperBase\n{\npublic:\n  Clipper(int initOptions = 0);\n  bool Execute(ClipType clipType,\n      Paths &solution,\n      PolyFillType fillType = pftEvenOdd);\n  bool Execute(ClipType clipType,\n      Paths &solution,\n      PolyFillType subjFillType,\n      PolyFillType clipFillType);\n  bool Execute(ClipType clipType,\n      PolyTree &polytree,\n      PolyFillType fillType = pftEvenOdd);\n  bool Execute(ClipType clipType,\n      PolyTree &polytree,\n      PolyFillType subjFillType,\n      PolyFillType clipFillType);\n  bool ReverseSolution() { return m_ReverseOutput; };\n  void ReverseSolution(bool value) {m_ReverseOutput = value;};\n  bool StrictlySimple() {return m_StrictSimple;};\n  void StrictlySimple(bool value) {m_StrictSimple = value;};\n  //set the callback function for z value filling on intersections (otherwise Z is 0)\n#ifdef use_xyz\n  void ZFillFunction(ZFillCallback zFillFunc);\n#endif\nprotected:\n  virtual bool ExecuteInternal();\nprivate:\n  JoinList         m_Joins;\n  JoinList         m_GhostJoins;\n  IntersectList    m_IntersectList;\n  ClipType         m_ClipType;\n  typedef std::list<cInt> MaximaList;\n  MaximaList       m_Maxima;\n  TEdge           *m_SortedEdges;\n  bool             m_ExecuteLocked;\n  PolyFillType     m_ClipFillType;\n  PolyFillType     m_SubjFillType;\n  bool             m_ReverseOutput;\n  bool             m_UsingPolyTree; \n  bool             m_StrictSimple;\n#ifdef use_xyz\n  ZFillCallback   m_ZFill; //custom callback \n#endif\n  void SetWindingCount(TEdge& edge);\n  bool IsEvenOddFillType(const TEdge& edge) const;\n  bool IsEvenOddAltFillType(const TEdge& edge) const;\n  void InsertLocalMinimaIntoAEL(const cInt botY);\n  void InsertEdgeIntoAEL(TEdge *edge, TEdge* startEdge);\n  void AddEdgeToSEL(TEdge *edge);\n  bool PopEdgeFromSEL(TEdge *&edge);\n  void CopyAELToSEL();\n  void DeleteFromSEL(TEdge *e);\n  void SwapPositionsInSEL(TEdge *edge1, TEdge *edge2);\n  bool IsContributing(const TEdge& edge) const;\n  bool IsTopHorz(const cInt XPos);\n  void DoMaxima(TEdge *e);\n  void ProcessHorizontals();\n  void ProcessHorizontal(TEdge *horzEdge);\n  void AddLocalMaxPoly(TEdge *e1, TEdge *e2, const IntPoint &pt);\n  OutPt* AddLocalMinPoly(TEdge *e1, TEdge *e2, const IntPoint &pt);\n  OutRec* GetOutRec(int idx);\n  void AppendPolygon(TEdge *e1, TEdge *e2);\n  void IntersectEdges(TEdge *e1, TEdge *e2, IntPoint &pt);\n  OutPt* AddOutPt(TEdge *e, const IntPoint &pt);\n  OutPt* GetLastOutPt(TEdge *e);\n  bool ProcessIntersections(const cInt topY);\n  void BuildIntersectList(const cInt topY);\n  void ProcessIntersectList();\n  void ProcessEdgesAtTopOfScanbeam(const cInt topY);\n  void BuildResult(Paths& polys);\n  void BuildResult2(PolyTree& polytree);\n  void SetHoleState(TEdge *e, OutRec *outrec);\n  void DisposeIntersectNodes();\n  bool FixupIntersectionOrder();\n  void FixupOutPolygon(OutRec &outrec);\n  void FixupOutPolyline(OutRec &outrec);\n  bool IsHole(TEdge *e);\n  bool FindOwnerFromSplitRecs(OutRec &outRec, OutRec *&currOrfl);\n  void FixHoleLinkage(OutRec &outrec);\n  void AddJoin(OutPt *op1, OutPt *op2, const IntPoint offPt);\n  void ClearJoins();\n  void ClearGhostJoins();\n  void AddGhostJoin(OutPt *op, const IntPoint offPt);\n  bool JoinPoints(Join *j, OutRec* outRec1, OutRec* outRec2);\n  void JoinCommonEdges();\n  void DoSimplePolygons();\n  void FixupFirstLefts1(OutRec* OldOutRec, OutRec* NewOutRec);\n  void FixupFirstLefts2(OutRec* InnerOutRec, OutRec* OuterOutRec);\n  void FixupFirstLefts3(OutRec* OldOutRec, OutRec* NewOutRec);\n#ifdef use_xyz\n  void SetZ(IntPoint& pt, TEdge& e1, TEdge& e2);\n#endif\n};\n//------------------------------------------------------------------------------\n\nclass ClipperOffset \n{\npublic:\n  ClipperOffset(double miterLimit = 2.0, double roundPrecision = 0.25);\n  ~ClipperOffset();\n  void AddPath(const Path& path, JoinType joinType, EndType endType);\n  void AddPaths(const Paths& paths, JoinType joinType, EndType endType);\n  void Execute(Paths& solution, double delta);\n  void Execute(PolyTree& solution, double delta);\n  void Clear();\n  double MiterLimit;\n  double ArcTolerance;\nprivate:\n  Paths m_destPolys;\n  Path m_srcPoly;\n  Path m_destPoly;\n  std::vector<DoublePoint> m_normals;\n  double m_delta, m_sinA, m_sin, m_cos;\n  double m_miterLim, m_StepsPerRad;\n  IntPoint m_lowest;\n  PolyNode m_polyNodes;\n\n  void FixOrientations();\n  void DoOffset(double delta);\n  void OffsetPoint(int j, int& k, JoinType jointype);\n  void DoSquare(int j, int k);\n  void DoMiter(int j, int k, double r);\n  void DoRound(int j, int k);\n};\n//------------------------------------------------------------------------------\n\nclass clipperException : public std::exception\n{\n  public:\n    clipperException(const char* description): m_descr(description) {}\n    virtual ~clipperException() throw() {}\n    virtual const char* what() const throw() {return m_descr.c_str();}\n  private:\n    std::string m_descr;\n};\n//------------------------------------------------------------------------------\n\n} //ClipperLib namespace\n\n#endif //clipper_hpp\n\n\n"
  },
  {
    "path": "dbnet/common.hpp",
    "content": "#ifndef DBNET_COMMON_H_\n#define DBNET_COMMON_H_\n\n#include <iostream>\n#include <fstream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <opencv2/opencv.hpp>\n#include \"dirent.h\"\n#include \"NvInfer.h\"\n#include <chrono>\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\nusing namespace nvinfer1;\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        Weights wt{ DataType::kFLOAT, nullptr, 0 };\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{ DataType::kFLOAT, scval, len };\n\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{ DataType::kFLOAT, shval, len };\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{ DataType::kFLOAT, pval, len };\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* convBnLeaky(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int ksize, int s, int g, std::string lname, std::string bnname, bool bias = true) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n    int p = ksize / 2;\n    IConvolutionLayer* conv1 = nullptr;\n    if (bias) {\n        conv1 = network->addConvolutionNd(input, outch, DimsHW{ ksize, ksize }, weightMap[lname + \".weight\"], weightMap[lname + \".bias\"]);\n    }\n    else {\n        conv1 = network->addConvolutionNd(input, outch, DimsHW{ ksize, ksize }, weightMap[lname + \".weight\"], emptywts);\n    }\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ s, s });\n    conv1->setPaddingNd(DimsHW{ p, p });\n    conv1->setNbGroups(g);\n    //IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn\", 1e-4);\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname.substr(0, lname.find_last_of(\".\")) + bnname, 1e-5);\n    auto lr = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    lr->setAlpha(0.1);\n    return lr;\n}\n\n\nIActivationLayer* basicBlock(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int stride, std::string lname) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ 3, 3 }, weightMap[lname + \"conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ stride, stride });\n    conv1->setPaddingNd(DimsHW{ 1, 1 });\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"bn1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{ 3, 3 }, weightMap[lname + \"conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setPaddingNd(DimsHW{ 1, 1 });\n\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"bn2\", 1e-5);\n\n    IElementWiseLayer* ew1;\n    if (inch != outch) {\n        IConvolutionLayer* conv3 = network->addConvolutionNd(input, outch, DimsHW{ 1, 1 }, weightMap[lname + \"downsample.0.weight\"], emptywts);\n        assert(conv3);\n        conv3->setStrideNd(DimsHW{ stride, stride });\n        IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"downsample.1\", 1e-5);\n        ew1 = network->addElementWise(*bn3->getOutput(0), *bn2->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    else {\n        ew1 = network->addElementWise(input, *bn2->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    IActivationLayer* relu2 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n    return relu2;\n}\n\nint read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n            strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n    closedir(p_dir);\n    return 0;\n}\n\n#endif\n\n"
  },
  {
    "path": "dbnet/dbnet.cpp",
    "content": "#include <iostream>\n#include <chrono>\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include \"common.hpp\"\n#include <math.h>\n#include \"clipper.hpp\"\n\n#define USE_FP16  // comment out this if want to use FP32\n#define DEVICE 0  // GPU id\n#define EXPANDRATIO 1.5\n#define BOX_MINI_SIZE 5\n#define SCORE_THRESHOLD 0.3\n#define BOX_THRESHOLD 0.7\n\nstatic const int SHORT_INPUT = 640;\nstatic const int MAX_INPUT_SIZE = 1440; // 32x\nstatic const int MIN_INPUT_SIZE = 608;\nstatic const int OPT_INPUT_W = 1152;\nstatic const int OPT_INPUT_H = 640;\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"out\";\nstatic Logger gLogger;\n\ncv::RotatedRect expandBox(cv::Point2f temp[], float ratio)\n{\n    ClipperLib::Path path = {\n        {ClipperLib::cInt(temp[0].x), ClipperLib::cInt(temp[0].y)},\n        {ClipperLib::cInt(temp[1].x), ClipperLib::cInt(temp[1].y)},\n        {ClipperLib::cInt(temp[2].x), ClipperLib::cInt(temp[2].y)},\n        {ClipperLib::cInt(temp[3].x), ClipperLib::cInt(temp[3].y)}};\n    double area = ClipperLib::Area(path);\n    double distance;\n    double length = 0.0;\n    for (int i = 0; i < 4; i++) {\n        length = length + sqrtf(powf((temp[i].x - temp[(i + 1) % 4].x), 2) +\n                                powf((temp[i].y - temp[(i + 1) % 4].y), 2));\n    }\n\n    distance = area * ratio / length;\n\n    ClipperLib::ClipperOffset offset;\n    offset.AddPath(path, ClipperLib::JoinType::jtRound,\n                   ClipperLib::EndType::etClosedPolygon);\n    ClipperLib::Paths paths;\n    offset.Execute(paths, distance);\n    \n    std::vector<cv::Point> contour;\n    for (int i = 0; i < paths[0].size(); i++) {\n        contour.emplace_back(paths[0][i].X, paths[0][i].Y);\n    }\n    offset.Clear();\n    return cv::minAreaRect(contour);\n}\n\nfloat paddimg(cv::Mat& In_Out_img, int shortsize = 960) {\n    int w = In_Out_img.cols;\n    int h = In_Out_img.rows;\n    float scale = 1.f;\n    if (w < h) {\n        scale = (float)shortsize / w;\n        h = scale * h;\n        w = shortsize;\n    }\n    else {\n        scale = (float)shortsize / h;\n        w = scale * w;\n        h = shortsize;\n    }\n\n    if (h % 32 != 0) {\n        h = (h / 32 + 1) * 32;\n    }\n    if (w % 32 != 0) {\n        w = (w / 32 + 1) * 32;\n    }\n\n    cv::resize(In_Out_img, In_Out_img, cv::Size(w, h));\n    return scale;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);\n    INetworkDefinition* network = builder->createNetworkV2(explicitBatch);\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims4{ 1, 3, -1, -1 });\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"./DBNet.wts\");\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n\n    /* ------ Resnet18 backbone------ */\n      // Add convolution layer with 6 outputs and a 5x5 filter.\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 64, DimsHW{ 7, 7 }, weightMap[\"backbone.conv1.weight\"], emptywts);   \n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ 2, 2 });\n    conv1->setPaddingNd(DimsHW{ 3, 3 });\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"backbone.bn1\", 1e-5);\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{ 3, 3 });\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{ 2, 2 });\n    pool1->setPaddingNd(DimsHW{ 1, 1 });\n\n    IActivationLayer* relu2 = basicBlock(network, weightMap, *pool1->getOutput(0), 64, 64, 1, \"backbone.layer1.0.\");\n    IActivationLayer* relu3 = basicBlock(network, weightMap, *relu2->getOutput(0), 64, 64, 1, \"backbone.layer1.1.\"); // x2\n\n    IActivationLayer* relu4 = basicBlock(network, weightMap, *relu3->getOutput(0), 64, 128, 2, \"backbone.layer2.0.\");\n    IActivationLayer* relu5 = basicBlock(network, weightMap, *relu4->getOutput(0), 128, 128, 1, \"backbone.layer2.1.\"); // x3\n\n    IActivationLayer* relu6 = basicBlock(network, weightMap, *relu5->getOutput(0), 128, 256, 2, \"backbone.layer3.0.\");\n    IActivationLayer* relu7 = basicBlock(network, weightMap, *relu6->getOutput(0), 256, 256, 1, \"backbone.layer3.1.\"); //x4\n\n    IActivationLayer* relu8 = basicBlock(network, weightMap, *relu7->getOutput(0), 256, 512, 2, \"backbone.layer4.0.\");\n    IActivationLayer* relu9 = basicBlock(network, weightMap, *relu8->getOutput(0), 512, 512, 1, \"backbone.layer4.1.\"); //x5\n\n    /* ------- FPN  neck ------- */\n    ILayer* p5 = convBnLeaky(network, weightMap, *relu9->getOutput(0), 64, 1, 1, 1, \"neck.reduce_conv_c5.conv\", \".bn\"); // k=1 s = 1  p = k/2=1/2=0\n    ILayer* c4_1 = convBnLeaky(network, weightMap, *relu7->getOutput(0), 64, 1, 1, 1, \"neck.reduce_conv_c4.conv\", \".bn\");\n\n    float *deval = reinterpret_cast<float*>(malloc(sizeof(float) * 64 * 2 * 2));\n    for (int i = 0; i < 64 * 2 * 2; i++) {\n        deval[i] = 1.0;\n    }\n    Weights deconvwts1{ DataType::kFLOAT, deval, 64 * 2 * 2 };\n    IDeconvolutionLayer* p4_1 = network->addDeconvolutionNd(*p5->getOutput(0), 64, DimsHW{ 2, 2 }, deconvwts1, emptywts);\n    p4_1->setStrideNd(DimsHW{ 2, 2 });\n    p4_1->setNbGroups(64);\n    weightMap[\"deconv1\"] = deconvwts1;\n\n    IElementWiseLayer* p4_add = network->addElementWise(*p4_1->getOutput(0), *c4_1->getOutput(0), ElementWiseOperation::kSUM);\n    ILayer* p4 = convBnLeaky(network, weightMap, *p4_add->getOutput(0), 64, 3, 1, 1, \"neck.smooth_p4.conv\", \".bn\");  // smooth\n    ILayer* c3_1 = convBnLeaky(network, weightMap, *relu5->getOutput(0), 64, 1, 1, 1, \"neck.reduce_conv_c3.conv\", \".bn\");\n\n    Weights deconvwts2{ DataType::kFLOAT, deval, 64 * 2 * 2 };\n    IDeconvolutionLayer* p3_1 = network->addDeconvolutionNd(*p4->getOutput(0), 64, DimsHW{ 2, 2 }, deconvwts2, emptywts);\n    p3_1->setStrideNd(DimsHW{ 2, 2 });\n    p3_1->setNbGroups(64);\n\n    IElementWiseLayer* p3_add = network->addElementWise(*p3_1->getOutput(0), *c3_1->getOutput(0), ElementWiseOperation::kSUM);\n    ILayer* p3 = convBnLeaky(network, weightMap, *p3_add->getOutput(0), 64, 3, 1, 1, \"neck.smooth_p3.conv\", \".bn\");  // smooth\n    ILayer* c2_1 = convBnLeaky(network, weightMap, *relu3->getOutput(0), 64, 1, 1, 1, \"neck.reduce_conv_c2.conv\", \".bn\");\n\n    Weights deconvwts3{ DataType::kFLOAT, deval, 64 * 2 * 2 };\n    IDeconvolutionLayer* p2_1 = network->addDeconvolutionNd(*p3->getOutput(0), 64, DimsHW{ 2, 2 }, deconvwts3, emptywts);\n    p2_1->setStrideNd(DimsHW{ 2, 2 });\n    p2_1->setNbGroups(64);\n\n    IElementWiseLayer* p2_add = network->addElementWise(*p2_1->getOutput(0), *c2_1->getOutput(0), ElementWiseOperation::kSUM);\n    ILayer* p2 = convBnLeaky(network, weightMap, *p2_add->getOutput(0), 64, 3, 1, 1, \"neck.smooth_p2.conv\", \".bn\");  // smooth\n\n    Weights deconvwts4{ DataType::kFLOAT, deval, 64 * 2 * 2 };\n    IDeconvolutionLayer* p3_up_p2 = network->addDeconvolutionNd(*p3->getOutput(0), 64, DimsHW{ 2, 2 }, deconvwts4, emptywts);\n    p3_up_p2->setStrideNd(DimsHW{ 2, 2 });\n    p3_up_p2->setNbGroups(64);\n\n    float *deval2 = reinterpret_cast<float*>(malloc(sizeof(float) * 64 * 8 * 8));\n    for (int i = 0; i < 64 * 8 * 8; i++) {\n        deval2[i] = 1.0;\n    }\n    Weights deconvwts5{ DataType::kFLOAT, deval2, 64 * 8 * 8 };\n    IDeconvolutionLayer* p4_up_p2 = network->addDeconvolutionNd(*p4->getOutput(0), 64, DimsHW{ 8, 8 }, deconvwts5, emptywts);\n    p4_up_p2->setPaddingNd(DimsHW{ 2, 2 });\n    p4_up_p2->setStrideNd(DimsHW{ 4, 4 });\n    p4_up_p2->setNbGroups(64);\n    weightMap[\"deconv2\"] = deconvwts5;\n\n    Weights deconvwts6{ DataType::kFLOAT, deval2, 64 * 8 * 8 };\n    IDeconvolutionLayer* p5_up_p2 = network->addDeconvolutionNd(*p5->getOutput(0), 64, DimsHW{ 8, 8 }, deconvwts6, emptywts);\n    p5_up_p2->setStrideNd(DimsHW{ 8, 8 });\n    p5_up_p2->setNbGroups(64);\n\n    // torch.cat([p2, p3, p4, p5], dim=1)\n    ITensor* inputTensors[] = { p2->getOutput(0), p3_up_p2->getOutput(0), p4_up_p2->getOutput(0), p5_up_p2->getOutput(0) };\n    IConcatenationLayer* neck_cat = network->addConcatenation(inputTensors, 4);\n\n    ILayer* neck_out = convBnLeaky(network, weightMap, *neck_cat->getOutput(0), 256, 3, 1, 1, \"neck.conv.0\", \".1\");  // smooth\n    assert(neck_out);\n    ILayer* binarize1 = convBnLeaky(network, weightMap, *neck_out->getOutput(0), 64, 3, 1, 1, \"head.binarize.0\", \".1\");  //  \n    Weights deconvwts7{ DataType::kFLOAT, deval, 64 * 2 * 2 };\n    IDeconvolutionLayer* binarizeup = network->addDeconvolutionNd(*binarize1->getOutput(0), 64, DimsHW{ 2, 2 }, deconvwts7, emptywts);\n    binarizeup->setStrideNd(DimsHW{ 2, 2 });\n    binarizeup->setNbGroups(64);\n    IScaleLayer* binarizebn1 = addBatchNorm2d(network, weightMap, *binarizeup->getOutput(0), \"head.binarize.4\", 1e-5);\n    IActivationLayer* binarizerelu1 = network->addActivation(*binarizebn1->getOutput(0), ActivationType::kRELU);\n    assert(binarizerelu1);\n\n    Weights deconvwts8{ DataType::kFLOAT, deval, 64 * 2 * 2 };\n    IDeconvolutionLayer* binarizeup2 = network->addDeconvolutionNd(*binarizerelu1->getOutput(0), 64, DimsHW{ 2, 2 }, deconvwts8, emptywts);\n    binarizeup2->setStrideNd(DimsHW{ 2, 2 });\n    binarizeup2->setNbGroups(64);\n\n    IConvolutionLayer* binarize3 = network->addConvolutionNd(*binarizeup2->getOutput(0), 1, DimsHW{ 3, 3 }, weightMap[\"head.binarize.7.weight\"], weightMap[\"head.binarize.7.bias\"]);\n    assert(binarize3);\n    binarize3->setStrideNd(DimsHW{ 1, 1 });\n    binarize3->setPaddingNd(DimsHW{ 1, 1 });\n    IActivationLayer* binarize4 = network->addActivation(*binarize3->getOutput(0), ActivationType::kSIGMOID);\n    assert(binarize4);\n\n    //threshold_maps = self.thresh(x)\n    ILayer* thresh1 = convBnLeaky(network, weightMap, *neck_out->getOutput(0), 64, 3, 1, 1, \"head.thresh.0\", \".1\", false);  //  \n    Weights deconvwts9{ DataType::kFLOAT, deval, 64 * 2 * 2 };\n    IDeconvolutionLayer* threshup = network->addDeconvolutionNd(*thresh1->getOutput(0), 64, DimsHW{ 2, 2 }, deconvwts9, emptywts);\n    threshup->setStrideNd(DimsHW{ 2, 2 });\n    threshup->setNbGroups(64);\n    IConvolutionLayer* thresh2 = network->addConvolutionNd(*threshup->getOutput(0), 64, DimsHW{ 3, 3 }, weightMap[\"head.thresh.3.1.weight\"], weightMap[\"head.thresh.3.1.bias\"]);\n    assert(thresh2);\n    thresh2->setStrideNd(DimsHW{ 1, 1 });\n    thresh2->setPaddingNd(DimsHW{ 1, 1 });\n\n    IScaleLayer* threshbn1 = addBatchNorm2d(network, weightMap, *thresh2->getOutput(0), \"head.thresh.4\", 1e-5);\n    IActivationLayer* threshrelu1 = network->addActivation(*threshbn1->getOutput(0), ActivationType::kRELU);\n    assert(threshrelu1);\n\n    Weights deconvwts10{ DataType::kFLOAT, deval, 64 * 2 * 2 };\n    IDeconvolutionLayer* threshup2 = network->addDeconvolutionNd(*threshrelu1->getOutput(0), 64, DimsHW{ 2, 2 }, deconvwts10, emptywts);\n    threshup2->setStrideNd(DimsHW{ 2, 2 });\n    threshup2->setNbGroups(64);\n    IConvolutionLayer* thresh3 = network->addConvolutionNd(*threshup2->getOutput(0), 1, DimsHW{ 3, 3 }, weightMap[\"head.thresh.6.1.weight\"], weightMap[\"head.thresh.6.1.bias\"]);\n    assert(thresh3);\n    thresh3->setStrideNd(DimsHW{ 1, 1 });\n    thresh3->setPaddingNd(DimsHW{ 1, 1 });\n    IActivationLayer* thresh4 = network->addActivation(*thresh3->getOutput(0), ActivationType::kSIGMOID);\n    assert(thresh4);\n\n    ITensor* inputTensors2[] = { binarize4->getOutput(0), thresh4->getOutput(0) };\n    IConcatenationLayer* head_out = network->addConcatenation(inputTensors2, 2);\n\n    // y = F.interpolate(y, size=(H, W)) \n    head_out->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*head_out->getOutput(0));\n\n    IOptimizationProfile* profile = builder->createOptimizationProfile();\n    profile->setDimensions(INPUT_BLOB_NAME, OptProfileSelector::kMIN, Dims4(1, 3, MIN_INPUT_SIZE, MIN_INPUT_SIZE));\n    profile->setDimensions(INPUT_BLOB_NAME, OptProfileSelector::kOPT, Dims4(1, 3, OPT_INPUT_H, OPT_INPUT_W));\n    profile->setDimensions(INPUT_BLOB_NAME, OptProfileSelector::kMAX, Dims4(1, 3, MAX_INPUT_SIZE, MAX_INPUT_SIZE));\n    config->addOptimizationProfile(profile);\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#ifdef USE_FP16\n    config->setFlag(BuilderFlag::kFP16);\n#endif\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    //ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int h_scale, int w_scale) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n    context.setBindingDimensions(inputIndex, Dims4(1, 3, h_scale, w_scale));\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], 3 * h_scale * w_scale * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], 2 * h_scale * w_scale * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, 3 * h_scale * w_scale * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueueV2(buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], h_scale * w_scale * 2 * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nbool get_mini_boxes(cv::RotatedRect& rotated_rect, cv::Point2f rect[],\n                    int min_size)\n{\n\n    cv::Point2f temp_rect[4];\n    rotated_rect.points(temp_rect);\n    for (int i = 0; i < 4; i++) {\n        for (int j = i + 1; j < 4; j++) {\n            if (temp_rect[i].x > temp_rect[j].x) {\n                cv::Point2f temp;\n                temp = temp_rect[i];\n                temp_rect[i] = temp_rect[j];\n                temp_rect[j] = temp;\n            }\n        }\n    }\n    int index0 = 0;\n    int index1 = 1;\n    int index2 = 2;\n    int index3 = 3;\n    if (temp_rect[1].y > temp_rect[0].y) {\n        index0 = 0;\n        index3 = 1;\n    } else {\n        index0 = 1;\n        index3 = 0;\n    }\n    if (temp_rect[3].y > temp_rect[2].y) {\n        index1 = 2;\n        index2 = 3;\n    } else {\n        index1 = 3;\n        index2 = 2;\n    }   \n\n    rect[0] = temp_rect[index0];  // Left top coordinate\n    rect[1] = temp_rect[index1];  // Left bottom coordinate\n    rect[2] = temp_rect[index2];  // Right bottom coordinate\n    rect[3] = temp_rect[index3];  // Right top coordinate\n\n    if (rotated_rect.size.width < min_size ||\n        rotated_rect.size.height < min_size) {\n        return false;\n    } else {\n        return true;\n    }\n}\n\nfloat get_box_score(float* map, cv::Point2f rect[], int width, int height,\n                    float threshold)\n{\n\n    int xmin = width - 1;\n    int ymin = height - 1;\n    int xmax = 0;\n    int ymax = 0;\n\n    for (int j = 0; j < 4; j++) {\n        if (rect[j].x < xmin) {\n            xmin = rect[j].x;\n        }\n        if (rect[j].y < ymin) {\n            ymin = rect[j].y;\n        }\n        if (rect[j].x > xmax) {\n            xmax = rect[j].x;\n        }\n        if (rect[j].y > ymax) {\n            ymax = rect[j].y;\n        }\n    }\n    float sum = 0;\n    int num = 0;\n    for (int i = ymin; i <= ymax; i++) {\n        for (int j = xmin; j <= xmax; j++) {\n            if (map[i * width + j] > threshold) {\n                sum = sum + map[i * width + j];\n                num++;\n            }\n        }\n    }\n\n    return sum / num;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{ nullptr };\n    size_t size{ 0 };\n\n    if (argc == 2 && std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{ nullptr };\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n        std::ofstream p(\"DBNet.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    }\n    else if (argc == 3 && std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"DBNet.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    }\n    else {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./debnet -s  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./debnet -d ../samples  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // prepare input data ---------------------------\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(argv[2], file_names) < 0) {\n        std::cout << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    // icdar2015.yaml Hyperparameter\n    std::vector<float> mean_value{ 0.406, 0.456, 0.485 };  // BGR\n    std::vector<float> std_value{ 0.225, 0.224, 0.229 };\n\n    int fcount = 0;\n\n    for (auto f : file_names) {\n        fcount++;\n        std::cout << fcount << \"  \" << f << std::endl;\n        cv::Mat pr_img = cv::imread(std::string(argv[2]) + \"/\" + f);\n        cv::Mat src_img = pr_img.clone();\n        if (pr_img.empty()) continue;\n        float scale = paddimg(pr_img, SHORT_INPUT); // resize the image\n        std::cout << \"letterbox shape: \" << pr_img.cols << \", \" << pr_img.rows << std::endl;\n        if (pr_img.cols < MIN_INPUT_SIZE || pr_img.rows < MIN_INPUT_SIZE) continue;\n        float* data = new float[3 * pr_img.rows * pr_img.cols];\n\n        auto start = std::chrono::system_clock::now();\n        int i = 0;\n        for (int row = 0; row < pr_img.rows; ++row) {\n            uchar* uc_pixel = pr_img.data + row * pr_img.step;\n            for (int col = 0; col < pr_img.cols; ++col) {\n                data[i] = (uc_pixel[2] / 255.0 - mean_value[2]) / std_value[2];\n                data[i + pr_img.rows * pr_img.cols] = (uc_pixel[1] / 255.0 - mean_value[1]) / std_value[1];\n                data[i + 2 * pr_img.rows * pr_img.cols] = (uc_pixel[0] / 255.0 - mean_value[0]) / std_value[0];\n                uc_pixel += 3;\n                ++i;\n            }\n        }\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"pre time:\"<< std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n        float* prob = new float[pr_img.rows *pr_img.cols * 2];\n        // Run inference\n        start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, pr_img.rows, pr_img.cols);\n        end = std::chrono::system_clock::now();\n        std::cout << \"detect time:\"<< std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n        // prob shape is 2*640*640, get the first one\n        cv::Mat map = cv::Mat::zeros(cv::Size(pr_img.cols, pr_img.rows), CV_8UC1);\n        for (int h = 0; h < pr_img.rows; ++h) {\n            uchar *ptr = map.ptr(h);\n            for (int w = 0; w < pr_img.cols; ++w) {\n                ptr[w] = (prob[h * pr_img.cols + w] > 0.3) ? 255 : 0;\n            }\n        }\n\n        // Extracting minimum circumscribed rectangle\n        std::vector<std::vector<cv::Point>> contours;\n        std::vector<cv::Vec4i> hierarcy;\n        cv::findContours(map, contours, hierarcy, CV_RETR_LIST, CV_CHAIN_APPROX_SIMPLE);\n\n        std::vector<cv::Rect> boundRect(contours.size());\n        std::vector<cv::RotatedRect> box(contours.size());\n        cv::Point2f rect[4];\n        cv::Point2f order_rect[4];\n\n        for (int i = 0; i < contours.size(); i++) {\n            cv::RotatedRect rotated_rect = cv::minAreaRect(cv::Mat(contours[i]));\n            if (!get_mini_boxes(rotated_rect, rect, BOX_MINI_SIZE)) {\n                std::cout << \"box too small\" <<  std::endl;\n                continue;\n            }\n\n            // drop low score boxes\n            float score = get_box_score(prob, rect, pr_img.cols, pr_img.rows,\n                                        SCORE_THRESHOLD);\n            if (score < BOX_THRESHOLD) {\n                std::cout << \"score too low =  \" << score << \", threshold = \" << BOX_THRESHOLD <<  std::endl;\n                continue;\n            }\n\n            // Scaling the predict boxes depend on EXPANDRATIO\n            cv::RotatedRect expandbox = expandBox(rect, EXPANDRATIO);\n            expandbox.points(rect);\n            if (!get_mini_boxes(expandbox, rect, BOX_MINI_SIZE + 2)) {  \n                continue;\n            }\n\n            // Restore the coordinates to the original image\n            for (int k = 0; k < 4; k++) {\n                order_rect[k] = rect[k];\n                order_rect[k].x = int(order_rect[k].x / pr_img.cols * src_img.cols);\n                order_rect[k].y = int(order_rect[k].y / pr_img.rows * src_img.rows);\n            }\n            \n            cv::rectangle(src_img, cv::Point(order_rect[0].x,order_rect[0].y), cv::Point(order_rect[2].x,order_rect[2].y), cv::Scalar(0, 0, 255), 2, 8);\n            //std::cout << \"After LT =  \" << order_rect[0] << \", After RD = \" << order_rect[2] <<  std::endl;            \n        }\n\n        cv::imwrite(\"_\" + f, src_img);\n        std::cout << \"write image done.\" << std::endl;\n        //cv::waitKey(0);\n\n        delete prob;\n        delete data;\n    }\n\n    return 0;\n}"
  },
  {
    "path": "dbnet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "dbnet/utils.h",
    "content": "#ifndef __TRT_UTILS_H_\n#define __TRT_UTILS_H_\n\n#include <iostream>\n#include <vector>\n#include <algorithm>\n#include <cudnn.h>\n\n#ifndef CUDA_CHECK\n\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n\n#endif\n\nnamespace Tn\n{\n    class Profiler : public nvinfer1::IProfiler\n    {\n    public:\n        void printLayerTimes(int itrationsTimes)\n        {\n            float totalTime = 0;\n            for (size_t i = 0; i < mProfile.size(); i++)\n            {\n                printf(\"%-40.40s %4.3fms\\n\", mProfile[i].first.c_str(), mProfile[i].second / itrationsTimes);\n                totalTime += mProfile[i].second;\n            }\n            printf(\"Time over all layers: %4.3f\\n\", totalTime / itrationsTimes);\n        }\n    private:\n        typedef std::pair<std::string, float> Record;\n        std::vector<Record> mProfile;\n\n        virtual void reportLayerTime(const char* layerName, float ms)\n        {\n            auto record = std::find_if(mProfile.begin(), mProfile.end(), [&](const Record& r){ return r.first == layerName; });\n            if (record == mProfile.end())\n                mProfile.push_back(std::make_pair(layerName, ms));\n            else\n                record->second += ms;\n        }\n    };\n\n    //Logger for TensorRT info/warning/errors\n    class Logger : public nvinfer1::ILogger\n    {\n    public:\n\n        Logger(): Logger(Severity::kWARNING) {}\n\n        Logger(Severity severity): reportableSeverity(severity) {}\n\n        void log(Severity severity, const char* msg) override\n        {\n            // suppress messages with severity enum value greater than the reportable\n            if (severity > reportableSeverity) return;\n\n            switch (severity)\n            {\n                case Severity::kINTERNAL_ERROR: std::cerr << \"INTERNAL_ERROR: \"; break;\n                case Severity::kERROR: std::cerr << \"ERROR: \"; break;\n                case Severity::kWARNING: std::cerr << \"WARNING: \"; break;\n                case Severity::kINFO: std::cerr << \"INFO: \"; break;\n                default: std::cerr << \"UNKNOWN: \"; break;\n            }\n            std::cerr << msg << std::endl;\n        }\n\n        Severity reportableSeverity{Severity::kWARNING};\n    };\n\n    template<typename T> \n    void write(char*& buffer, const T& val)\n    {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T> \n    void read(const char*& buffer, T& val)\n    {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n}\n\n#endif"
  },
  {
    "path": "densenet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\n\n# set the project name\nproject(densenet)\n\nadd_definitions(-std=c++11)\n\n# get main project dir to include common files\nget_filename_component(MAIN_DIR ../ ABSOLUTE)\n\n# When enabled the static version of the \n# CUDA runtime library will be used in CUDA_LIBRARIES\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\n\n# specify the C++ standard\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_CXX_STANDARD_REQUIRED True)\nset(CMAKE_BUILD_TYPE Debug)\n\n# include\n\n# include and link cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n\n# include and link tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu)\nlink_directories(/usr/lib/x86_64-linux-gnu)\n\n# add the executable\nadd_executable(densenet ${PROJECT_SOURCE_DIR}/densenet121.cpp)\n\ntarget_link_libraries(densenet nvinfer)\ntarget_link_libraries(densenet cudart)\n\nadd_definitions(-O2 -pthread)"
  },
  {
    "path": "densenet/README.md",
    "content": "# Densenet121\n\nThe Pytorch implementation is [makaveli10/densenet](https://github.com/makaveli10/torchtrtz/tree/main/densenet). Model from torchvision.\nThe tensorrt implemenation is taken from [makaveli10/cpptensorrtz](https://github.com/makaveli10/cpptensorrtz/).\n\n## How to Run\n\n1. generate densenet121.wts from pytorch\n\n```\ngit clone https://github.com/wang-xinyu/tensorrtx.git\ngit clone https://github.com/makaveli10/torchtrtz.git\n\n// go to torchtrtz/densenet\n// Enter these two commands to create densenet121.wts\npython models.py\npython gen_trtwts.py\n```\n\n2. build densenet and run\n\n```\n// put densenet121.wts into tensorrtx/densenet\n// go to tensorrtx/densenet\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./densenet -s  // serialize model to file i.e. 'densenet.engine'\nsudo ./densenet -d  // deserialize model and run inference\n```\n\n3. Verify output from [torch impl](https://github.com/makaveli10/torchtrtz/blob/main/densenet/README.md)\n\nTensorRT output[:5]:\n```\n    [-0.587389, -0.329202, -1.83404, -1.89935, -0.928404]\n```\n\n"
  },
  {
    "path": "densenet/densenet121.cpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <cmath>\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 1000;\n\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n    std::cout << \"len \" << len << std::endl;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nIConvolutionLayer* addDenseLayer(INetworkDefinition* network, ITensor* input, std::map<std::string, Weights>& weightMap, std::string lname, float eps)\n{\n    // add Batchnorm\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *input, lname + \".norm1\", eps);\n\n    // add relu\n    IActivationLayer* relu1 = network -> addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    // add conv\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network -> addConvolutionNd(*relu1->getOutput(0), 128, DimsHW{1, 1}, weightMap[lname + \".conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1 -> setStrideNd(DimsHW{1, 1});\n\n    // add Batchnorm\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv1 -> getOutput(0), lname + \".norm2\", eps);\n\n    // add relu\n    IActivationLayer* relu2 = network -> addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n\n    // add conv\n    IConvolutionLayer* conv2 = network -> addConvolutionNd(*relu2->getOutput(0), 32, DimsHW{3, 3}, weightMap[lname + \".conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2 -> setStrideNd(DimsHW{1, 1});\n    conv2 -> setPaddingNd(DimsHW{1, 1});\n    return conv2;\n}\n\n\nIPoolingLayer* addTransition(INetworkDefinition* network, ITensor& input, std::map<std::string, Weights>& weightMap, int outch, std::string lname, float eps)\n{\n    // add batch norm\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap,input, lname + \".norm\", eps);\n\n    // add relu activation\n    IActivationLayer* relu1 = network -> addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    // add convolution layer\n    // empty weights for no bias\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network -> addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{1, 1}, weightMap[lname + \".conv.weight\"], emptywts);\n    assert(conv1);\n    conv1 -> setStrideNd(DimsHW{1, 1});\n\n    // add pooling\n    IPoolingLayer* pool1 = network->addPoolingNd(*conv1->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});\n    assert(pool1);\n    pool1 -> setStrideNd(DimsHW{2, 2});\n    pool1 -> setPaddingNd(DimsHW{0,0});\n    return pool1;\n}\n\n\nIConcatenationLayer* addDenseBlock(INetworkDefinition* network, ITensor* input, std::map<std::string, Weights>& weightMap, int numDenseLayers, std::string lname, float eps)\n{\n    IConvolutionLayer* c{nullptr};\n    IConcatenationLayer* concat{nullptr};\n    ITensor* inputTensors[numDenseLayers+1];\n    inputTensors[0] = input;\n\n    c = addDenseLayer(network, input, weightMap, lname + \".denselayer\" + std::to_string(1), eps);\n    int i;\n    for(i=1; i<numDenseLayers; i++)\n    {\n        // inch += 32;\n        inputTensors[i] = c -> getOutput(0);\n        concat = network -> addConcatenation(inputTensors, i+1);\n        assert(concat);\n        c = addDenseLayer(network, concat->getOutput(0), weightMap, lname + \".denselayer\" + std::to_string(i+1), eps);\n    }\n    inputTensors[numDenseLayers] = c -> getOutput(0);\n    concat = network -> addConcatenation(inputTensors, numDenseLayers+1);\n    assert(concat);\n    return concat;\n}\n\n\n/**\n * Uses the TensorRT API to create the network engine.  \n**/\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt)\n{\n    // Initialize NetworkDefinition\n    INetworkDefinition* network = builder -> createNetworkV2(0U);\n\n    auto data = network -> addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../densenet121.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    auto conv0 = network -> addConvolutionNd(*data, 64, DimsHW{7, 7}, weightMap[\"features.conv0.weight\"], emptywts);\n    assert(conv0);\n    conv0 -> setStrideNd(DimsHW{2, 2});\n    conv0 -> setPaddingNd(DimsHW{3, 3});\n\n    auto norm0 = addBatchNorm2d(network, weightMap, *conv0 -> getOutput(0), \"features.norm0\", 1e-5);\n\n    auto relu0 = network -> addActivation(*norm0 -> getOutput(0), ActivationType::kRELU);\n    assert(relu0);\n\n    auto pool0 = network -> addPoolingNd(*relu0 -> getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool0);\n    pool0 -> setStrideNd(DimsHW{2, 2});\n    pool0 -> setPaddingNd(DimsHW{1, 1});\n    \n    auto dense1 = addDenseBlock(network, pool0 -> getOutput(0), weightMap, 6, \"features.denseblock1\", 1e-5);\n    auto transition1 = addTransition(network, *dense1 -> getOutput(0), weightMap, 128, \"features.transition1\", 1e-5);\n\n    auto dense2 = addDenseBlock(network, transition1 -> getOutput(0), weightMap, 12, \"features.denseblock2\", 1e-5);\n    auto transition2 = addTransition(network, *dense2 -> getOutput(0), weightMap, 256, \"features.transition2\", 1e-5);\n\n    auto dense3 = addDenseBlock(network, transition2 -> getOutput(0), weightMap, 24, \"features.denseblock3\", 1e-5);\n    auto transition3 = addTransition(network, *dense3 -> getOutput(0), weightMap, 512, \"features.transition3\", 1e-5);\n\n    auto dense4 = addDenseBlock(network, transition3 -> getOutput(0), weightMap, 16, \"features.denseblock4\", 1e-5);\n\n    auto bn5 = addBatchNorm2d(network, weightMap, *dense4 -> getOutput(0), \"features.norm5\", 1e-5);\n    auto relu5 = network -> addActivation(*bn5 -> getOutput(0), ActivationType::kRELU);\n\n    // adaptive average pool => pytorch (F.adaptive_avg_pool2d(input, (1, 1)))\n    auto pool5 = network -> addPoolingNd(*relu5 -> getOutput(0), PoolingType::kAVERAGE, DimsHW{7,7});\n\n    auto fc1 = network -> addFullyConnected(*pool5 -> getOutput(0), 1000, weightMap[\"classifier.weight\"], weightMap[\"classifier.bias\"]);\n    assert(fc1);\n\n    // set ouput blob name\n    fc1 -> getOutput(0) -> setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n\n    // mark the output\n    network -> markOutput(*fc1 -> getOutput(0));\n\n    // set batchsize and workspace size\n    builder -> setMaxBatchSize(maxBatchSize);\n    config -> setMaxWorkspaceSize(1 << 28); // 256 MiB\n\n    // build engine\n    ICudaEngine* engine = builder -> buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n    \n    // destroy\n    network -> destroy();\n\n    // fere host mem\n    for(auto& mem: weightMap)\n    {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream)\n{\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n    config->destroy();\n}\n\n/**\n * Performs inference on the given input and \n * writes the output from device to host memory.\n**/\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize)\n{\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./densenet -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./densenet -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(\"densenet.engine\", std::ios::binary);\n        if (!p)\n        {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"densenet.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n\n    // Subtract mean from image\n    static float data[3 * INPUT_H * INPUT_W];\n    for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n        data[i] = 1.0;\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    static float prob[OUTPUT_SIZE];\n    for (int i = 0; i < 100; i++) {\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\n    {\n        std::cout << prob[i] << \", \";\n        if (i % 10 == 0) std::cout << i / 10 << std::endl;\n    }\n    std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "densenet/densenet121.py",
    "content": "import os\nimport sys\nimport struct\nimport argparse\n\nimport numpy as np\nimport pycuda.autoinit\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\n\nBATCH_SIZE = 1\nINPUT_H = 224\nINPUT_W = 224\nOUTPUT_SIZE = 1000\nINPUT_BLOB_NAME = \"data\"\nOUTPUT_BLOB_NAME = \"prob\"\nEPS = 1e-5\n\nWEIGHT_PATH = \"./densenet121.wts\"\nENGINE_PATH = \"./densenet121.engine\"\n\nTRT_LOGGER = trt.Logger(trt.Logger.INFO)\n\n\ndef load_weights(file):\n    print(f\"Loading weights: {file}\")\n\n    assert os.path.exists(file), 'Unable to load weight file.'\n\n    weight_map = {}\n    with open(file, \"r\") as f:\n        lines = [line.strip() for line in f]\n    count = int(lines[0])\n    assert count == len(lines) - 1\n    for i in range(1, count + 1):\n        splits = lines[i].split(\" \")\n        name = splits[0]\n        cur_count = int(splits[1])\n        assert cur_count + 2 == len(splits)\n        values = []\n        for j in range(2, len(splits)):\n            # hex string to bytes to float\n            values.append(struct.unpack(\">f\", bytes.fromhex(splits[j])))\n        weight_map[name] = np.array(values, dtype=np.float32)\n\n    return weight_map\n\n\ndef add_batch_norm_2d(network, weight_map, input, layer_name):\n    gamma = weight_map[layer_name + \".weight\"]\n    beta = weight_map[layer_name + \".bias\"]\n    mean = weight_map[layer_name + \".running_mean\"]\n    var = weight_map[layer_name + \".running_var\"]\n    var = np.sqrt(var + EPS)\n\n    scale = gamma / var\n    shift = -mean / var * gamma + beta\n    return network.add_scale(input=input,\n                             mode=trt.ScaleMode.CHANNEL,\n                             shift=shift,\n                             scale=scale)\n\n\ndef add_dense_layer(network, input, weight_map, lname):\n    bn1 = add_batch_norm_2d(network, weight_map, input, lname + \".norm1\")\n\n    relu1 = network.add_activation(bn1.get_output(0), type=trt.ActivationType.RELU)\n    assert relu1\n\n    conv1 = network.add_convolution(input=relu1.get_output(0),\n                                    num_output_maps=128,\n                                    kernel_shape=(1, 1),\n                                    kernel=weight_map[lname + \".conv1.weight\"],\n                                    bias=trt.Weights())\n    assert conv1\n    conv1.stride = (1, 1)\n\n    bn2 = add_batch_norm_2d(network, weight_map, conv1.get_output(0), lname + \".norm2\")\n\n    relu2 = network.add_activation(bn2.get_output(0), type=trt.ActivationType.RELU)\n    assert relu2\n\n    conv2 = network.add_convolution(input=relu2.get_output(0),\n                                    num_output_maps=32,\n                                    kernel_shape=(3, 3),\n                                    kernel=weight_map[lname + \".conv2.weight\"],\n                                    bias=trt.Weights())\n    assert conv2\n    conv2.stride = (1, 1)\n    conv2.padding = (1, 1)\n\n    return conv2\n\n\ndef add_transition(network, input, weight_map, outch, lname):\n    bn1 = add_batch_norm_2d(network, weight_map, input, lname + \".norm\")\n\n    relu1 = network.add_activation(bn1.get_output(0), type=trt.ActivationType.RELU)\n    assert relu1\n\n    conv1 = network.add_convolution(input=relu1.get_output(0),\n                                    num_output_maps=outch,\n                                    kernel_shape=(1, 1),\n                                    kernel=weight_map[lname + \".conv.weight\"],\n                                    bias=trt.Weights())\n    assert conv1\n    conv1.stride = (1, 1)\n\n    pool1 = network.add_pooling(input=conv1.get_output(0),\n                                type=trt.PoolingType.AVERAGE,\n                                window_size=trt.DimsHW(2, 2))\n    assert pool1\n    pool1.stride_nd = (2, 2)\n    pool1.padding_nd = (0, 0)\n\n    return pool1\n\n\ndef add_dense_block(network, input, weight_map, num_dense_layers, lname):\n    input_tensors = [None for _ in range(num_dense_layers+1)]\n    input_tensors[0] = input\n    c = add_dense_layer(network, input, weight_map, lname + \".denselayer\" + str(1))\n    for i in range(1, num_dense_layers):\n        input_tensors[i] = c.get_output(0)\n        concat = network.add_concatenation(input_tensors[:i+1])\n        assert concat\n        c = add_dense_layer(network, concat.get_output(0), weight_map, lname + \".denselayer\" + str(i+1))\n\n    input_tensors[num_dense_layers] = c.get_output(0)\n    concat = network.add_concatenation(input_tensors)\n    assert concat\n\n    return concat\n\n\ndef create_engine(max_batch_size, builder, config, dt):\n    weight_map = load_weights(WEIGHT_PATH)\n    network = builder.create_network()\n\n    data = network.add_input(INPUT_BLOB_NAME, dt, (3, INPUT_H, INPUT_W))\n    assert data\n\n    conv0 = network.add_convolution(input=data,\n                                    num_output_maps=64,\n                                    kernel_shape=(7, 7),\n                                    kernel=weight_map[\"features.conv0.weight\"],\n                                    bias=trt.Weights())\n    assert conv0\n    conv0.stride = (2, 2)\n    conv0.padding = (3, 3)\n\n    bn0 = add_batch_norm_2d(network, weight_map, conv0.get_output(0), \"features.norm0\")\n\n    relu0 = network.add_activation(bn0.get_output(0), type=trt.ActivationType.RELU)\n    assert relu0\n\n    pool0 = network.add_pooling(input=relu0.get_output(0),\n                                type=trt.PoolingType.MAX,\n                                window_size=trt.DimsHW(3, 3))\n    assert pool0\n    pool0.stride_nd = (2, 2)\n    pool0.padding_nd = (1, 1)\n\n    dense1 = add_dense_block(network, pool0.get_output(0), weight_map, 6, \"features.denseblock1\")\n    transition1 = add_transition(network, dense1.get_output(0), weight_map, 128, \"features.transition1\")\n\n    dense2 = add_dense_block(network, transition1.get_output(0), weight_map, 12, \"features.denseblock2\")\n    transition2 = add_transition(network, dense2.get_output(0), weight_map, 256, \"features.transition2\")\n\n    dense3 = add_dense_block(network, transition2.get_output(0), weight_map, 24, \"features.denseblock3\")\n    transition3 = add_transition(network, dense3.get_output(0), weight_map, 512, \"features.transition3\")\n\n    dense4 = add_dense_block(network, transition3.get_output(0), weight_map, 16, \"features.denseblock4\")\n\n    bn5 = add_batch_norm_2d(network, weight_map, dense4.get_output(0), \"features.norm5\")\n    relu5 = network.add_activation(bn5.get_output(0), type=trt.ActivationType.RELU)\n\n    pool5 = network.add_pooling(relu5.get_output(0), type=trt.PoolingType.AVERAGE, window_size=trt.DimsHW(7, 7))\n\n    fc1 = network.add_fully_connected(input=pool5.get_output(0),\n                                      num_outputs=OUTPUT_SIZE,\n                                      kernel=weight_map[\"classifier.weight\"],\n                                      bias=weight_map[\"classifier.bias\"])\n    assert fc1\n\n    fc1.get_output(0).name = OUTPUT_BLOB_NAME\n    network.mark_output(fc1.get_output(0))\n\n    # Build Engine\n    builder.max_batch_size = max_batch_size\n    builder.max_workspace_size = 1 << 20\n    engine = builder.build_engine(network, config)\n\n    del network\n    del weight_map\n\n    return engine\n\n\ndef API_to_model(max_batch_size):\n    builder = trt.Builder(TRT_LOGGER)\n    config = builder.create_builder_config()\n    engine = create_engine(max_batch_size, builder, config, trt.float32)\n    assert engine\n    with open(ENGINE_PATH, \"wb\") as f:\n        f.write(engine.serialize())\n\n    del engine\n    del builder\n    del config\n\n\nclass HostDeviceMem(object):\n    def __init__(self, host_mem, device_mem):\n        self.host = host_mem\n        self.device = device_mem\n\n    def __str__(self):\n        return \"Host:\\n\" + str(self.host) + \"\\nDevice:\\n\" + str(self.device)\n\n    def __repr__(self):\n        return self.__str__()\n\n\ndef allocate_buffers(engine):\n    inputs = []\n    outputs = []\n    bindings = []\n    stream = cuda.Stream()\n    for binding in engine:\n        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n        dtype = trt.nptype(engine.get_binding_dtype(binding))\n        # Allocate host and device buffers\n        host_mem = cuda.pagelocked_empty(size, dtype)\n        device_mem = cuda.mem_alloc(host_mem.nbytes)\n        # Append the device buffer to device bindings.\n        bindings.append(int(device_mem))\n        # Append to the appropriate list.\n        if engine.binding_is_input(binding):\n            inputs.append(HostDeviceMem(host_mem, device_mem))\n        else:\n            outputs.append(HostDeviceMem(host_mem, device_mem))\n    return inputs, outputs, bindings, stream\n\n\ndef do_inference(context, bindings, inputs, outputs, stream, batch_size=1):\n    # Transfer input data to the GPU.\n    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]\n    # Run inference.\n    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)\n    # Transfer predictions back from the GPU.\n    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]\n    # Synchronize the stream\n    stream.synchronize()\n    # Return only the host outputs.\n    return [out.host for out in outputs]\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"-s\", action='store_true')\n    parser.add_argument(\"-d\", action='store_true')\n    args = parser.parse_args()\n\n    if not (args.s ^ args.d):\n        print(\n            \"arguments not right!\\n\"\n            \"python densenet121.py -s   # serialize model to plan file\\n\"\n            \"python densenet121.py -d   # deserialize plan file and run inference\"\n        )\n        sys.exit()\n\n    if args.s:\n        API_to_model(BATCH_SIZE)\n    else:\n        runtime = trt.Runtime(TRT_LOGGER)\n        assert runtime\n\n        with open(ENGINE_PATH, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        assert engine\n\n        context = engine.create_execution_context()\n        assert context\n\n        data = np.ones((BATCH_SIZE * 3 * INPUT_H * INPUT_W), dtype=np.float32)\n        inputs, outputs, bindings, stream = allocate_buffers(engine)\n        inputs[0].host = data\n\n        trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)\n\n        print(f'Output: \\n{trt_outputs[0][:10]}\\n{trt_outputs[0][-10:]}')\n"
  },
  {
    "path": "densenet/logging.h",
    "content": "/*\n * Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n        , mPrefix(other.mPrefix)\n        , mShouldLog(other.mShouldLog)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n            {\n                ss << \" \";\n            }\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//!         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H"
  },
  {
    "path": "detr/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(detr)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/data/app/TensorRT-8.4.3.1/include)\nlink_directories(/data/app/TensorRT-8.4.3.1/lib)\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(detr ${PROJECT_SOURCE_DIR}/detr.cpp)\ntarget_link_libraries(detr nvinfer)\ntarget_link_libraries(detr cudart)\ntarget_link_libraries(detr ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "detr/README.md",
    "content": "# DETR\n\nThe Pytorch implementation is [facebookresearch/detr](https://github.com/facebookresearch/detr).\n\nFor details see [End-to-End Object Detection with Transformers](https://ai.facebook.com/research/publications/end-to-end-object-detection-with-transformers).\n\n## Test Environment\n\n- GTX2080Ti / Ubuntu16.04 / cuda10.2 / cudnn8.0.4 / TensorRT7.2.1 / OpenCV4.2\n- GTX2080Ti / win10 / cuda10.2 / cudnn8.0.4 / TensorRT7.2.1 / OpenCV4.2 / VS2017\n\n## How to Run\n\n1. generate .wts from pytorch with .pth\n\n```\n// git clone https://github.com/facebookresearch/detr.git\n// go to facebookresearch/detr\n// download https://dl.fbaipublicfiles.com/detr/detr-r50-e632da11.pth\n// download https://raw.githubusercontent.com/freedenS/TestImage/main/demo.jpg\n// copy tensorrtx/detr/gen_wts.py and demo.jpg into facebookresearch/detr\npython gen_wts.py\n// a file 'detr.wts' will be generated.\n```\n\n2. build tensorrtx/detr and run\n\n```\n// put detr.wts into tensorrtx/detr\n// go to tensorrtx/detr\n// update parameters in detr.cpp if your model is trained on custom dataset.The parameters are corresponding to config in detr.\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./detr -s [.wts] // serialize model to plan file\nsudo ./detr -d [.engine] [image folder] // deserialize and run inference, the images in [image folder] will be processed\n// For example\nsudo ./detr -s ../detr.wts detr.engine\nsudo ./detr -d detr.engine ../samples\n```\n\n3. check the images generated, as follows. _demo.jpg and so on.\n\n## Backbone\n\n#### R50\n\n```\n1.download pretrained model\n  https://dl.fbaipublicfiles.com/detr/detr-r50-e632da11.pth\n2.export wts\n  set first parameter in Backbone in gen_wts.py(line 23) to resnet50\n  set path of pretrained model(line 87 in gen_wts.py)\n3.set resnet_type in BuildResNet(line 546 in detr.cpp) to R50\n```\n\n#### R101\n\n```\n1.download pretrained model\n  https://dl.fbaipublicfiles.com/detr/detr-r101-2c7b67e5.pth\n2.export wts\n  set first parameter in Backbone in gen_wts.py(line 23) to resnet101\n  set path of pretrained model(line 87 in gen_wts.py)\n3.set resnet_type in BuildResNet(line 546 in detr.cpp) to R101\n```\n\n## NOTE\n\n- tensorrt use fixed input size, if the size of your data is different from the engine, you need to adjust your data and the result.\n- image preprocessing with c++ is a little different with python(opencv vs PIL)\n\n## Quantization\n\n1. quantizationType:fp32,fp16,int8. see BuildDETRModel(detr.cpp line 613) for detail.\n\n2. the usage of int8 is same with [tensorrtx/yolov5](../yolov5/README.md).\n\n\n## Latency\n\naverage cost of doInference(in detr.cpp) from second time with batch=1 under the ubuntu environment above\n\n|      | fp32    | fp16    | int8   |\n| ---- | ------- | ------- | ------ |\n| R50  | 19.57ms | 9.424ms | 8.38ms |\n| R101 | 30.82ms | 12.4ms  | 9.59ms |\n\n"
  },
  {
    "path": "detr/backbone.hpp",
    "content": "#pragma once\n#include <map>\n#include \"common.hpp\"\n\nenum RESNETTYPE {\n    R18 = 0,\n    R34,\n    R50,\n    R101,\n    R152\n};\n\nconst std::map<RESNETTYPE, std::vector<int>> num_blocks_per_stage = {\n    {R18, {2, 2, 2, 2}},\n    {R34, {3, 4, 6, 3}},\n    {R50, {3, 4, 6, 3}},\n    {R101, {3, 4, 23, 3}},\n    {R152, {3, 8, 36, 3}}\n};\n\nIScaleLayer* addBatchNorm2d(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nITensor& input,\nconst std::string& lname,\nfloat eps = 1e-5\n) {\n    float *gamma = (float*)(weightMap[lname + \".weight\"].values);\n    float *beta = (float*)(weightMap[lname + \".bias\"].values);\n    float *mean = (float*)(weightMap[lname + \".running_mean\"].values);\n    float *var = (float*)(weightMap[lname + \".running_var\"].values);\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{ DataType::kFLOAT, scval, len };\n\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{ DataType::kFLOAT, shval, len };\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{ DataType::kFLOAT, pval, len };\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* BasicStem(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& input,\nint out_channels,\nint group_num = 1\n) {\n    // conv1\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n    IConvolutionLayer* conv1 = network->addConvolutionNd(\n        input,\n        out_channels,\n        DimsHW{ 7, 7 },\n        weightMap[lname + \".conv1.weight\"],\n        emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ 2, 2 });\n    conv1->setPaddingNd(DimsHW{ 3, 3 });\n    conv1->setNbGroups(group_num);\n\n    auto bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn1\");\n    assert(bn1);\n\n    auto r1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(r1);\n\n    auto max_pool2d = network->addPoolingNd(*r1->getOutput(0), PoolingType::kMAX, DimsHW{ 3, 3 });\n    max_pool2d->setStrideNd(DimsHW{ 2, 2 });\n    max_pool2d->setPaddingNd(DimsHW{ 1, 1 });\n    auto mp_dim = max_pool2d->getOutput(0)->getDimensions();\n    return max_pool2d;\n}\n\nITensor* BasicBlock(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& input,\nint in_channels,\nint out_channels,\nint stride = 1\n) {\n    // conv1\n    IConvolutionLayer* conv1 = network->addConvolutionNd(\n        input,\n        out_channels,\n        DimsHW{ 3, 3 },\n        weightMap[lname + \".conv1.weight\"],\n        weightMap[lname + \".conv1.bias\"]);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ stride, stride });\n    conv1->setPaddingNd(DimsHW{ 1, 1 });\n\n    auto r1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    assert(r1);\n\n    // conv2\n    IConvolutionLayer* conv2 = network->addConvolutionNd(\n        *r1->getOutput(0),\n        out_channels, DimsHW{ 3, 3 },\n        weightMap[lname + \".conv2.weight\"],\n        weightMap[lname + \".conv2.bias\"]);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{ 1, 1 });\n    conv2->setPaddingNd(DimsHW{ 1, 1 });\n\n    // shortcut\n    ITensor* shortcut_value = nullptr;\n    if (in_channels != out_channels) {\n        auto shortcut = network->addConvolutionNd(\n            input,\n            out_channels,\n            DimsHW{ 1, 1 },\n            weightMap[lname + \".shortcut.weight\"],\n            weightMap[lname + \".shortcut.bias\"]);\n        assert(shortcut);\n        shortcut->setStrideNd(DimsHW{ stride, stride });\n        shortcut_value = shortcut->getOutput(0);\n    } else {\n        shortcut_value = &input;\n    }\n\n    // add\n    auto ew = network->addElementWise(*conv2->getOutput(0), *shortcut_value, ElementWiseOperation::kSUM);\n    assert(ew);\n\n    auto r3 = network->addActivation(*ew->getOutput(0), ActivationType::kRELU);\n    assert(r3);\n\n    return r3->getOutput(0);\n}\n\nITensor* BottleneckBlock(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& input,\nint in_channels,\nint bottleneck_channels,\nint out_channels,\nint stride = 1,\nint dilation = 1,\nint group_num = 1\n) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n    // conv1\n    IConvolutionLayer* conv1 = network->addConvolutionNd(\n        input,\n        bottleneck_channels,\n        DimsHW{ 1, 1 },\n        weightMap[lname + \".conv1.weight\"],\n        emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ 1, 1 });\n    conv1->setNbGroups(group_num);\n\n    auto bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn1\");\n    assert(bn1);\n\n    auto r1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(r1);\n\n    // conv2\n    IConvolutionLayer* conv2 = network->addConvolutionNd(\n        *r1->getOutput(0),\n        bottleneck_channels,\n        DimsHW{ 3, 3 },\n        weightMap[lname + \".conv2.weight\"],\n        emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{ stride, stride });\n    conv2->setPaddingNd(DimsHW{ 1 * dilation, 1 * dilation });\n    conv2->setDilationNd(DimsHW{ dilation, dilation });\n    conv2->setNbGroups(group_num);\n\n    auto bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \".bn2\");\n    assert(bn2);\n\n    auto r2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    assert(r2);\n\n    // conv3\n    IConvolutionLayer* conv3 = network->addConvolutionNd(\n        *r2->getOutput(0),\n        out_channels,\n        DimsHW{ 1, 1 },\n        weightMap[lname + \".conv3.weight\"],\n        emptywts);\n    assert(conv3);\n    conv3->setStrideNd(DimsHW{ 1, 1 });\n    conv3->setNbGroups(group_num);\n\n    auto bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \".bn3\");\n    assert(bn3);\n\n    // shortcut\n    ITensor* shortcut_value = nullptr;\n    if (in_channels != out_channels) {\n        auto shortcut = network->addConvolutionNd(\n            input,\n            out_channels,\n            DimsHW{ 1, 1 },\n            weightMap[lname + \".downsample.0.weight\"],\n            emptywts);\n        assert(shortcut);\n        shortcut->setStrideNd(DimsHW{stride, stride});\n        shortcut->setNbGroups(group_num);\n\n        auto shortcut_bn = addBatchNorm2d(network, weightMap, *shortcut->getOutput(0), lname + \".downsample.1\");\n        assert(shortcut_bn);\n        shortcut_value = shortcut_bn->getOutput(0);\n    } else {\n        shortcut_value = &input;\n    }\n\n    // add\n    auto ew = network->addElementWise(*bn3->getOutput(0), *shortcut_value, ElementWiseOperation::kSUM);\n    assert(ew);\n\n    auto r3 = network->addActivation(*ew->getOutput(0), ActivationType::kRELU);\n    assert(r3);\n\n    return r3->getOutput(0);\n}\n\nITensor* MakeStage(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& input,\nint stage,\nRESNETTYPE resnet_type,\nint in_channels,\nint bottleneck_channels,\nint out_channels,\nint first_stride = 1,\nint dilation = 1\n) {\n    ITensor* out = &input;\n    for (int i = 0; i < stage; i++) {\n        std::string layerName = lname + \".\" + std::to_string(i);\n        int stride = i == 0 ? first_stride : 1;\n\n        if (resnet_type == R18 || resnet_type == R34)\n            out = BasicBlock(network, weightMap, layerName, *out, in_channels, out_channels, stride);\n        else\n            out = BottleneckBlock(\n                network,\n                weightMap,\n                layerName,\n                *out,\n                in_channels,\n                bottleneck_channels,\n                out_channels,\n                stride,\n                dilation);\n\n        in_channels = out_channels;\n    }\n    return out;\n}\n\nITensor* BuildResNet(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nITensor& input,\nRESNETTYPE resnet_type,\nint stem_out_channels,\nint bottleneck_channels,\nint res2_out_channels,\nint res5_dilation = 1\n) {\n    assert(res5_dilation == 1 || res5_dilation == 2);  // \"res5_dilation must be 1 or 2\"\n    if (resnet_type == R18 || resnet_type == R34) {\n        assert(res2_out_channels == 64);  // \"res2_out_channels must be 64 for R18/R34\")\n        assert(res5_dilation == 1);  // \"res5_dilation must be 1 for R18/R34\")\n    }\n\n    int out_channels = res2_out_channels;\n    ITensor* out = nullptr;\n    // stem\n    auto stem = BasicStem(network, weightMap, \"backbone.0.body\", input, stem_out_channels);\n    out = stem->getOutput(0);\n\n    // res\n    for (int i = 0; i < 4; i++) {\n        int dilation = (i == 3) ? res5_dilation : 1;\n        int first_stride = (i == 0 || (i == 3 && dilation == 2)) ? 1 : 2;\n        out = MakeStage(\n            network,\n            weightMap,\n            \"backbone.0.body.layer\" + std::to_string(i + 1),\n            *out,\n            num_blocks_per_stage.at(resnet_type)[i],\n            resnet_type,\n            stem_out_channels,\n            bottleneck_channels,\n            out_channels,\n            first_stride,\n            dilation);\n        stem_out_channels = out_channels;\n        bottleneck_channels *= 2;\n        out_channels *= 2;\n    }\n    return out;\n}\n"
  },
  {
    "path": "detr/calibrator.hpp",
    "content": "#pragma once\n\n#include \"NvInfer.h\"\n#include <string>\n#include <vector>\n#include <iostream>\n#include <iterator>\n#include <fstream>\n#include <algorithm>\n#include \"common.hpp\"\n#include \"macros.h\"\n \n//! \\class Int8EntropyCalibrator2\n//!\n//! \\brief Implements Entropy calibrator 2.\n//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.\n//!\nclass Int8EntropyCalibrator2 : public nvinfer1::IInt8EntropyCalibrator2 {\n public:\n    Int8EntropyCalibrator2(int batchsize, int input_w, int input_h,\n    const char* img_dir, const char* calib_table_name,\n    const char* input_blob_name, bool read_cache = true);\n\n    virtual ~Int8EntropyCalibrator2();\n    int getBatchSize() const TRT_NOEXCEPT override;\n    bool getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT override;\n    const void* readCalibrationCache(size_t& length) TRT_NOEXCEPT override;\n    void writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT override;\n\n private:\n    int batchsize_;\n    int input_w_;\n    int input_h_;\n    int img_idx_;\n    std::string img_dir_;\n    std::vector<std::string> img_files_;\n    size_t input_count_;\n    std::string calib_table_name_;\n    const char* input_blob_name_;\n    bool read_cache_;\n    void* device_input_;\n    std::vector<char> calib_cache_;\n};\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize,\nint input_w, int input_h, const char* img_dir,\nconst char* calib_table_name, const char* input_blob_name,\nbool read_cache)\n    : batchsize_(batchsize)\n    , input_w_(input_w)\n    , input_h_(input_h)\n    , img_idx_(0)\n    , img_dir_(img_dir)\n    , calib_table_name_(calib_table_name)\n    , input_blob_name_(input_blob_name)\n    , read_cache_(read_cache) {\n    input_count_ = 3 * input_w * input_h * batchsize;\n    CUDA_CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n    read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2() {\n    CUDA_CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const TRT_NOEXCEPT {\n    return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT {\n    if (img_idx_ + batchsize_ > static_cast<int>(img_files_.size())) {\n        return false;\n    }\n\n    std::vector<float> input_imgs_(input_count_, 0);\n    for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n        std::cout << img_files_[i] << \"  \" << i << std::endl;\n        cv::Mat temp = cv::imread(img_dir_ + img_files_[i]);\n        if (temp.empty()) {\n            std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n            return false;\n        }\n        preprocessImg(temp, input_w_, input_h_);\n        for (int c = 0; c < 3; c++) {\n            for (int h = 0; h < input_h_; h++) {\n                for (int w = 0; w < input_w_; w++) {\n                    input_imgs_[(i-img_idx_)*input_w_*input_h_*3 +\n                        c * input_h_ * input_w_ + h * input_w_ + w] = temp.at<cv::Vec3f>(h, w)[c];\n                }\n            }\n        }\n    }\n    img_idx_ += batchsize_;\n\n    CUDA_CHECK(cudaMemcpy(device_input_, input_imgs_.data(), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n    assert(!strcmp(names[0], input_blob_name_));\n    bindings[0] = device_input_;\n    return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length) TRT_NOEXCEPT {\n    std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n    calib_cache_.clear();\n    std::ifstream input(calib_table_name_, std::ios::binary);\n    input >> std::noskipws;\n    if (read_cache_ && input.good()) {\n        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n    }\n    length = calib_cache_.size();\n    return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT {\n    std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n    std::ofstream output(calib_table_name_, std::ios::binary);\n    output.write(reinterpret_cast<const char*>(cache), length);\n}\n"
  },
  {
    "path": "detr/common.hpp",
    "content": "#pragma once\n\n#include <dirent.h>\n#include <cuda_runtime_api.h>\n#include <fstream>\n#include <sstream>\n#include <iostream>\n#include <vector>\n#include <unordered_map>\n#include <algorithm>\n#include \"./logging.h\"\n#include <NvInfer.h>\n#include <opencv2/opencv.hpp>\n\nstatic Logger gLogger;\n\nusing namespace nvinfer1;\nvoid loadWeights(const std::string file, std::unordered_map<std::string, Weights>& weightMap) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        Weights wt{ DataType::kFLOAT, nullptr, 0 };\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n}\n\nint CalculateSize(Dims a) {\n    int res = 1;\n    for (int i = 0; i < a.nbDims; i++) {\n        res *= a.d[i];\n    }\n    return res;\n}\n\nstatic inline int read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n            strcmp(p_file->d_name, \"..\") != 0) {\n            // std::string cur_file_name(p_dir_name);\n            // cur_file_name += \"/\";\n            // cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\nvoid preprocessImg(cv::Mat& img, int newh, int neww) {\n    // convert to rgb\n    cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n    cv::resize(img, img, cv::Size(neww, newh));\n    img.convertTo(img, CV_32FC3);\n    img /= 255;\n    img -= cv::Scalar(0.485, 0.456, 0.406);\n    img /= cv::Scalar(0.229, 0.224, 0.225);\n}\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)\\\n    {\\\n        cudaError_t error_code = callstr;\\\n        if (error_code != cudaSuccess) {\\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__;\\\n            assert(0);\\\n        }\\\n    }\n#endif  // CUDA_CHECK\n"
  },
  {
    "path": "detr/detr.cpp",
    "content": "#pragma once\n#include <iostream>\n#include <unordered_map>\n#include \"./logging.h\"\n#include \"backbone.hpp\"\n#include \"calibrator.hpp\"\n\n#define DEVICE 0\n#define BATCH_SIZE 1\n\n// 1 / math.sqrt(head_dim) https://github.com/pytorch/pytorch/blob/master/torch/csrc/api/include/torch/nn/functional/activation.h#623\nstatic const float SCALING = 0.17677669529663687;\nstatic const int INPUT_H = 800;\nstatic const int INPUT_W = 1066;\nstatic const int NUM_CLASS = 92;  // include background\nstatic const float SCALING_ONE = 1.0;\nstatic const float SHIFT_ZERO = 0.0;\nstatic const float POWER_TWO = 2.0;\nstatic const float EPS = 0.00001;\nstatic const int D_MODEL = 256;\nstatic const int NHEAD = 8;\nstatic const int DIM_FEEDFORWARD = 2048;\nstatic const int NUM_ENCODE_LAYERS = 6;\nstatic const int NUM_DECODE_LAYERS = 6;\nstatic const int NUM_QUERIES = 100;\nstatic const float SCORE_THRESH = 0.5;\n\nconst char* INPUT_NODE_NAME = \"images\";\nconst std::vector<std::string> OUTPUT_NAMES = { \"scores\", \"boxes\"};\n\nITensor* PositionEmbeddingSine(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nITensor& input,\nint num_pos_feats = 64,\nint temperature = 10000\n) {\n    // refer to https://github.com/facebookresearch/detr/blob/master/models/position_encoding.py#12\n    // TODO: improve this implementation\n    auto mask_dim = input.getDimensions();\n    int h = mask_dim.d[1], w = mask_dim.d[2];\n    std::vector<std::vector<float>> y_embed(h);\n    for (int i = 0; i < h; i++)\n        y_embed[i] = std::vector<float>(w, i + 1);\n    std::vector<float> sub_embed(w, 0);\n    for (int i = 0; i < w; i++)\n        sub_embed[i] = i + 1;\n    std::vector<std::vector<float>> x_embed(h, sub_embed);\n\n    // normalize\n    float eps = 1e-6, scale = 2.0 * 3.1415926;\n    for (int i = 0; i < h; i++) {\n        for (int j = 0; j < w; j++) {\n            y_embed[i][j] = y_embed[i][j] / (h + eps) * scale;\n            x_embed[i][j] = x_embed[i][j] / (w + eps) * scale;\n        }\n    }\n\n    // dim_t\n    std::vector<float> dim_t(num_pos_feats, 0);\n    for (int i = 0; i < num_pos_feats; i++) {\n        dim_t[i] = pow(temperature, (2 * (i / 2) / static_cast<float>(num_pos_feats)));\n    }\n\n    // pos_x, pos_y\n    std::vector<std::vector<std::vector<float>>> pos_x(h,\n    std::vector<std::vector<float>>(w,\n    std::vector<float>(num_pos_feats, 0)));\n\n    std::vector<std::vector<std::vector<float>>> pos_y(h,\n    std::vector<std::vector<float>>(w,\n    std::vector<float>(num_pos_feats, 0)));\n    for (int i = 0; i < h; i++) {\n        for (int j = 0; j < w; j++) {\n            for (int k = 0; k < num_pos_feats; k++) {\n                float value_x = x_embed[i][j] / dim_t[k];\n                float value_y = y_embed[i][j] / dim_t[k];\n                if (k & 1) {\n                    pos_x[i][j][k] = std::cos(value_x);\n                    pos_y[i][j][k] = std::cos(value_y);\n                } else {\n                    pos_x[i][j][k] = std::sin(value_x);\n                    pos_y[i][j][k] = std::sin(value_y);\n                }\n            }\n        }\n    }\n\n    // pos\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * h * w * num_pos_feats * 2));\n    float *pNext = pval;\n    for (int i = 0; i < h; i++) {\n        for (int j = 0; j < w; j++) {\n            for (int k = 0; k < num_pos_feats; k++) {\n                *pNext = pos_y[i][j][k];\n                ++pNext;\n            }\n            for (int k = 0; k < num_pos_feats; k++) {\n                *pNext = pos_x[i][j][k];\n                ++pNext;\n            }\n        }\n    }\n    Weights pos_embed_weight{ DataType::kFLOAT, pval, h * w * num_pos_feats * 2 };\n    weightMap[\"pos\"] = pos_embed_weight;\n    auto pos_embed = network->addConstant(Dims4{ h * w, num_pos_feats * 2, 1, 1 }, pos_embed_weight);\n    assert(pos_embed);\n    return pos_embed->getOutput(0);\n}\n\nITensor* MultiHeadAttention(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& query,\nITensor& key,\nITensor& value,\nint embed_dim = 256,\nint num_heads = 8\n) {\n    int tgt_len = query.getDimensions().d[0];\n    int head_dim = embed_dim / num_heads;\n\n    // q\n    auto linear_q = network->addFullyConnected(\n        query,\n        embed_dim,\n        weightMap[lname + \".in_proj_weight_q\"],\n        weightMap[lname + \".in_proj_bias_q\"]);\n    assert(linear_q);\n\n    // k\n    auto linear_k = network->addFullyConnected(\n        key,\n        embed_dim,\n        weightMap[lname + \".in_proj_weight_k\"],\n        weightMap[lname + \".in_proj_bias_k\"]);\n    assert(linear_k);\n\n    // v\n    auto linear_v = network->addFullyConnected(\n        value,\n        embed_dim,\n        weightMap[lname + \".in_proj_weight_v\"],\n        weightMap[lname + \".in_proj_bias_v\"]);\n    assert(linear_v);\n\n    auto scaling_t = network->addConstant(Dims4{ 1, 1, 1, 1 }, Weights{ DataType::kFLOAT, &SCALING, 1 });\n    assert(scaling_t);\n    auto q_scaling = network->addElementWise(\n        *linear_q->getOutput(0),\n        *scaling_t->getOutput(0),\n        ElementWiseOperation::kPROD);\n    assert(q_scaling);\n\n    auto q_shuffle = network->addShuffle(*q_scaling->getOutput(0));\n    assert(q_shuffle);\n    q_shuffle->setName((lname + \".q_shuffle\").c_str());\n    q_shuffle->setReshapeDimensions(Dims3{ -1, num_heads, head_dim });\n    q_shuffle->setSecondTranspose(Permutation{1, 0, 2});\n\n    auto k_shuffle = network->addShuffle(*linear_k->getOutput(0));\n    assert(k_shuffle);\n    k_shuffle->setName((lname + \".k_shuffle\").c_str());\n    k_shuffle->setReshapeDimensions(Dims3{ -1, num_heads, head_dim });\n    k_shuffle->setSecondTranspose(Permutation{ 1, 0, 2 });\n\n    auto v_shuffle = network->addShuffle(*linear_v->getOutput(0));\n    assert(v_shuffle);\n    v_shuffle->setName((lname + \".v_shuffle\").c_str());\n    v_shuffle->setReshapeDimensions(Dims3{ -1, num_heads, head_dim });\n    v_shuffle->setSecondTranspose(Permutation{ 1, 0, 2 });\n#if NV_TENSORRT_MAJOR >= 8\n    auto q_product_k = network->addMatrixMultiply(*q_shuffle->getOutput(0), nvinfer1::MatrixOperation::kNONE, *k_shuffle->getOutput(0), nvinfer1::MatrixOperation::kTRANSPOSE);\n#else\n    auto q_product_k = network->addMatrixMultiply(*q_shuffle->getOutput(0), false, *k_shuffle->getOutput(0), true);\n#endif\n    assert(q_product_k);\n\n    // src_key_padding_mask are all false, so do nothing here\n    // see https://github.com/pytorch/pytorch/blob/master/torch/csrc/api/include/torch/nn/functional/activation.h#826-#839\n\n    auto softmax = network->addSoftMax(*q_product_k->getOutput(0));\n    assert(softmax);\n    softmax->setAxes(4);\n#if NV_TENSORRT_MAJOR >= 8\n    auto attn_product_v = network->addMatrixMultiply(*softmax->getOutput(0), nvinfer1::MatrixOperation::kNONE, *v_shuffle->getOutput(0), nvinfer1::MatrixOperation::kNONE);\n#else\n    auto attn_product_v = network->addMatrixMultiply(*softmax->getOutput(0), false, *v_shuffle->getOutput(0), false);\n#endif\n    assert(attn_product_v);\n\n    auto attn_shuffle = network->addShuffle(*attn_product_v->getOutput(0));\n    assert(attn_shuffle);\n    attn_shuffle->setName((lname + \".attn_shuffle\").c_str());\n    attn_shuffle->setFirstTranspose(Permutation{ 1, 0, 2 });\n    attn_shuffle->setReshapeDimensions(Dims4{ tgt_len, -1, 1, 1 });\n\n    auto linear_attn = network->addFullyConnected(\n        *attn_shuffle->getOutput(0),\n        embed_dim,\n        weightMap[lname + \".out_proj.weight\"],\n        weightMap[lname + \".out_proj.bias\"]);\n    assert(linear_attn);\n\n    return linear_attn->getOutput(0);\n}\n\nITensor* LayerNorm(\nINetworkDefinition *network,\nITensor& input,\nstd::unordered_map<std::string, Weights>& weightMap,\nconst std::string& lname,\nint d_model = 256\n) {\n    // TODO: maybe a better implementation https://github.com/NVIDIA/TensorRT/blob/master/plugin/common/common.cuh#212\n    auto mean = network->addReduce(input, ReduceOperation::kAVG, 2, true);\n    assert(mean);\n\n    auto sub_mean = network->addElementWise(input, *mean->getOutput(0), ElementWiseOperation::kSUB);\n    assert(sub_mean);\n\n    // implement pow2 with scale\n    Weights scale{ DataType::kFLOAT, &SCALING_ONE, 1 };\n    Weights shift{ DataType::kFLOAT, &SHIFT_ZERO, 1 };\n    Weights power{ DataType::kFLOAT, &POWER_TWO, 1 };\n    auto pow2 = network->addScaleNd(*sub_mean->getOutput(0), ScaleMode::kUNIFORM, shift, scale, power, 0);\n    assert(pow2);\n\n    auto pow_mean = network->addReduce(*pow2->getOutput(0), ReduceOperation::kAVG, 2, true);\n    assert(pow_mean);\n\n    auto eps = network->addConstant(Dims4{ 1, 1, 1, 1 }, Weights{ DataType::kFLOAT, &EPS, 1 });\n    assert(eps);\n\n    auto add_eps = network->addElementWise(*pow_mean->getOutput(0), *eps->getOutput(0), ElementWiseOperation::kSUM);\n    assert(add_eps);\n\n    auto sqrt = network->addUnary(*add_eps->getOutput(0), UnaryOperation::kSQRT);\n    assert(sqrt);\n\n    auto div = network->addElementWise(*sub_mean->getOutput(0), *sqrt->getOutput(0), ElementWiseOperation::kDIV);\n    assert(div);\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * d_model));\n    for (int i = 0; i < d_model; i++) {\n        pval[i] = 1.0;\n    }\n    Weights norm1_power{ DataType::kFLOAT, pval, d_model };\n    weightMap[lname + \".power\"] = norm1_power;\n    auto affine = network->addScaleNd(\n        *div->getOutput(0),\n        ScaleMode::kCHANNEL,\n        weightMap[lname + \".bias\"],\n        weightMap[lname + \".weight\"],\n        norm1_power,\n        1);\n    assert(affine);\n    return affine->getOutput(0);\n}\n\nITensor* TransformerEncoderLayer(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& src,\nITensor& pos,\nint d_model = 256,\nint nhead = 8,\nint dim_feedforward = 2048\n) {\n    auto pos_embed = network->addElementWise(src, pos, ElementWiseOperation::kSUM);\n    assert(pos_embed);\n\n    ITensor* src2 = MultiHeadAttention(\n        network,\n        weightMap,\n        lname + \".self_attn\",\n        *pos_embed->getOutput(0),\n        *pos_embed->getOutput(0),\n        src,\n        d_model,\n        nhead);\n\n    auto shortcut1 = network->addElementWise(src, *src2, ElementWiseOperation::kSUM);\n    assert(shortcut1);\n\n    ITensor* norm1 = LayerNorm(network, *shortcut1->getOutput(0), weightMap, lname + \".norm1\");\n\n    auto linear1 = network->addFullyConnected(\n        *norm1,\n        dim_feedforward,\n        weightMap[lname + \".linear1.weight\"],\n        weightMap[lname + \".linear1.bias\"]);\n    assert(linear1);\n\n    auto relu = network->addActivation(*linear1->getOutput(0), ActivationType::kRELU);\n    assert(relu);\n\n    auto linear2 = network->addFullyConnected(\n        *relu->getOutput(0),\n        d_model,\n        weightMap[lname + \".linear2.weight\"],\n        weightMap[lname + \".linear2.bias\"]);\n    assert(linear2);\n\n    auto shortcut2 = network->addElementWise(*norm1, *linear2->getOutput(0), ElementWiseOperation::kSUM);\n    assert(shortcut2);\n\n    ITensor* norm2 = LayerNorm(network, *shortcut2->getOutput(0), weightMap, lname + \".norm2\");\n    return norm2;\n}\n\nITensor* TransformerEncoder(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& src,\nITensor& pos,\nint num_layers = 6\n) {\n    ITensor* out = &src;\n    for (int i = 0; i < num_layers; i++) {\n        std::string layer_name = lname + \".layers.\" + std::to_string(i);\n        out = TransformerEncoderLayer(network, weightMap, layer_name, *out, pos);\n    }\n    return out;\n}\n\nITensor* TransformerDecoderLayer(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& tgt,\nITensor& memory,\nITensor& pos,\nITensor& query_pos,\nint d_model = 256,\nint nhead = 8,\nint dim_feedforward = 2048\n) {\n    auto pos_embed = network->addElementWise(tgt, query_pos, ElementWiseOperation::kSUM);\n    assert(pos_embed);\n\n    ITensor* tgt2 = MultiHeadAttention(\n        network,\n        weightMap,\n        lname + \".self_attn\",\n        *pos_embed->getOutput(0),\n        *pos_embed->getOutput(0),\n        tgt);\n\n    auto shortcut1 = network->addElementWise(tgt, *tgt2, ElementWiseOperation::kSUM);\n    assert(shortcut1);\n    ITensor* norm1 = LayerNorm(network, *shortcut1->getOutput(0), weightMap, lname + \".norm1\");\n\n    auto query_embed = network->addElementWise(*norm1, query_pos, ElementWiseOperation::kSUM);\n    assert(query_embed);\n\n    auto key_embed = network->addElementWise(memory, pos, ElementWiseOperation::kSUM);\n    assert(key_embed);\n\n    ITensor* mha2 = MultiHeadAttention(\n        network,\n        weightMap,\n        lname + \".multihead_attn\",\n        *query_embed->getOutput(0),\n        *key_embed->getOutput(0),\n        memory);\n\n    auto shortcut2 = network->addElementWise(*norm1, *mha2, ElementWiseOperation::kSUM);\n    assert(shortcut2);\n\n    ITensor* norm2 = LayerNorm(network, *shortcut2->getOutput(0), weightMap, lname + \".norm2\");\n\n    auto linear1 = network->addFullyConnected(\n        *norm2,\n        dim_feedforward,\n        weightMap[lname + \".linear1.weight\"],\n        weightMap[lname + \".linear1.bias\"]);\n    assert(linear1);\n\n    auto relu = network->addActivation(*linear1->getOutput(0), ActivationType::kRELU);\n    assert(relu);\n\n    auto linear2 = network->addFullyConnected(\n        *relu->getOutput(0),\n        d_model,\n        weightMap[lname + \".linear2.weight\"],\n        weightMap[lname + \".linear2.bias\"]);\n    assert(linear2);\n\n    auto shortcut3 = network->addElementWise(*norm2, *linear2->getOutput(0), ElementWiseOperation::kSUM);\n    assert(shortcut3);\n\n    ITensor* norm3 = LayerNorm(network, *shortcut3->getOutput(0), weightMap, lname + \".norm3\");\n\n    return norm3;\n}\n\nITensor* TransformerDecoder(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& tgt,\nITensor& memory,\nITensor& pos,\nITensor& query_pos,\nint num_layers = 6,\nint d_model = 256,\nint nhead = 8,\nint dim_feedforward = 2048\n) {\n    ITensor* out = &tgt;\n    for (int i = 0; i < num_layers; i++) {\n        std::string layer_name = lname + \".layers.\" + std::to_string(i);\n        out = TransformerDecoderLayer(\n            network,\n            weightMap,\n            layer_name,\n            *out,\n            memory,\n            pos,\n            query_pos,\n            d_model,\n            nhead,\n            dim_feedforward);\n    }\n    ITensor* norm = LayerNorm(network, *out, weightMap, lname + \".norm\", d_model);\n    return norm;\n}\n\nITensor* Transformer(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& src,\nITensor& pos_embed,\nint num_queries = 100,\nint num_encoder_layers = 6,\nint num_decoder_layers = 6,\nint d_model = 256,\nint nhead = 8,\nint dim_feedforward = 2048\n) {\n    auto memory = TransformerEncoder(network, weightMap, lname + \".encoder\", src, pos_embed, num_encoder_layers);\n\n    // construct tgt\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * num_queries * d_model));\n    for (int i = 0; i < num_queries * d_model; i++) {\n        pval[i] = 0.0;\n    }\n    Weights tgt_weight{ DataType::kFLOAT, pval, num_queries * d_model };\n    weightMap[lname + \".tgt_weight\"] = tgt_weight;\n    auto tgt = network->addConstant(Dims4{ num_queries, d_model, 1, 1 }, tgt_weight);\n    assert(tgt);\n    // construct query_pos\n    auto query_pos = network->addConstant(Dims4{ num_queries, d_model, 1, 1 }, weightMap[\"query_embed.weight\"]);\n    assert(query_pos);\n\n    auto out = TransformerDecoder(\n        network,\n        weightMap,\n        lname + \".decoder\",\n        *tgt->getOutput(0),\n        *memory, pos_embed,\n        *query_pos->getOutput(0),\n        num_decoder_layers,\n        d_model,\n        nhead,\n        dim_feedforward);\n    return out;\n}\n\nITensor* MLP(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& src,\nint num_layers = 3,\nint hidden_dim = 256,\nint output_dim = 4\n) {\n    ITensor* out = &src;\n    for (int i = 0; i < num_layers; i++) {\n        std::string layer_name = lname + \".\" + std::to_string(i);\n        if (i != num_layers - 1) {\n            auto fc = network->addFullyConnected(\n                *out,\n                hidden_dim,\n                weightMap[layer_name + \".weight\"],\n                weightMap[layer_name + \".bias\"]);\n            assert(fc);\n            auto relu = network->addActivation(*fc->getOutput(0), ActivationType::kRELU);\n            assert(relu);\n            out = relu->getOutput(0);\n        } else {\n            auto fc = network->addFullyConnected(\n                *out,\n                output_dim,\n                weightMap[layer_name + \".weight\"],\n                weightMap[layer_name + \".bias\"]);\n            assert(fc);\n            out = fc->getOutput(0);\n        }\n    }\n    return out;\n}\n\nstd::vector<ITensor*> Predict(\nINetworkDefinition *network,\nstd::unordered_map<std::string, Weights>& weightMap,\nITensor& src\n) {\n    auto class_embed = network->addFullyConnected(\n        src,\n        NUM_CLASS,\n        weightMap[\"class_embed.weight\"],\n        weightMap[\"class_embed.bias\"]);\n    assert(class_embed);\n    auto class_softmax = network->addSoftMax(*class_embed->getOutput(0));\n    assert(class_softmax);\n    class_softmax->setAxes(2);\n    ITensor* bbox = MLP(network, weightMap, \"bbox_embed.layers\", src);\n    auto bbox_sig = network->addActivation(*bbox, ActivationType::kSIGMOID);\n    assert(bbox_sig);\n    std::vector<ITensor*> output = { class_softmax->getOutput(0), bbox_sig->getOutput(0) };\n    return output;\n}\n\nICudaEngine* createEngine_r50detr(\nunsigned int maxBatchSize,\nconst std::string& wtsfile,\nIBuilder* builder,\nIBuilderConfig* config,\nDataType dt,\nconst std::string& modelType = \"fp16\"\n) {\n    /*\n    description: after fuse bn\n    */\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_NODE_NAME, dt, Dims3{ 3, INPUT_H, INPUT_W });\n\n    // preprocess\n    std::unordered_map<std::string, Weights> weightMap;\n    loadWeights(wtsfile, weightMap);\n\n    // backbone\n    auto features = BuildResNet(network, weightMap, *data, R50, 64, 64, 256);\n    ITensor* pos_embed = PositionEmbeddingSine(network, weightMap, *features, 128);\n    auto input_proj = network->addConvolutionNd(\n        *features,\n        D_MODEL,\n        DimsHW{ 1, 1 },\n        weightMap[\"input_proj.weight\"],\n        weightMap[\"input_proj.bias\"]);\n    assert(input_proj);\n    input_proj->setStrideNd(DimsHW{ 1, 1 });\n    auto flatten = network->addShuffle(*input_proj->getOutput(0));\n    assert(flatten);\n    flatten->setReshapeDimensions(Dims4{ input_proj->getOutput(0)->getDimensions().d[0], -1, 1, 1 });\n    flatten->setSecondTranspose(Permutation{ 1, 0, 2, 3 });\n\n    auto out1 = Transformer(\n        network,\n        weightMap,\n        \"transformer\",\n        *flatten->getOutput(0),\n        *pos_embed,\n        NUM_QUERIES,\n        NUM_ENCODE_LAYERS,\n        NUM_DECODE_LAYERS,\n        D_MODEL,\n        NHEAD,\n        DIM_FEEDFORWARD);\n    std::vector<ITensor*> results = Predict(network, weightMap, *out1);\n\n    // build output\n    for (int i = 0; i < results.size(); i++) {\n        network->markOutput(*results[i]);\n        results[i]->setName(OUTPUT_NAMES[i].c_str());\n    }\n\n    // build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1ULL << 30);\n\n    if (modelType == \"fp32\") {\n    } else if (modelType == \"fp16\") {\n        config->setFlag(BuilderFlag::kFP16);\n    } else if (modelType == \"int8\") {\n        std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n        assert(builder->platformHasFastInt8());\n        config->setFlag(BuilderFlag::kINT8);\n        Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(BATCH_SIZE, INPUT_W, INPUT_H, \"./coco_calib/\",\n        \"int8calib.table\", INPUT_NODE_NAME);\n        config->setInt8Calibrator(calibrator);\n    } else {\n        throw(\"does not support model type\");\n    }\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // destroy network\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return engine;\n}\n\nvoid BuildDETRModel(unsigned int maxBatchSize, IHostMemory** modelStream,\nconst std::string& wtsfile, std::string modelType = \"fp32\") {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine_r50detr(maxBatchSize,\n        wtsfile, builder, config, DataType::kFLOAT, modelType);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, cudaStream_t& stream, std::vector<void*>& buffers,\nstd::vector<float>& input, std::vector<float*>& output) {\n    CUDA_CHECK(cudaMemcpyAsync(buffers[0], input.data(), input.size() * sizeof(float),\n    cudaMemcpyHostToDevice, stream));\n\n    context.enqueue(BATCH_SIZE, buffers.data(), stream, nullptr);\n\n    CUDA_CHECK(cudaMemcpyAsync(output[0], buffers[1], BATCH_SIZE * NUM_QUERIES * NUM_CLASS * sizeof(float),\n    cudaMemcpyDeviceToHost, stream));\n    CUDA_CHECK(cudaMemcpyAsync(output[1], buffers[2], BATCH_SIZE * NUM_QUERIES * 4 * sizeof(float),\n    cudaMemcpyDeviceToHost, stream));\n\n    cudaStreamSynchronize(stream);\n}\n\nbool parse_args(int argc, char** argv, std::string& wtsFile, std::string& engineFile, std::string& imgDir) {\n    if (argc < 4) return false;\n    if (std::string(argv[1]) == \"-s\") {\n        wtsFile = std::string(argv[2]);\n        engineFile = std::string(argv[3]);\n    } else if (std::string(argv[1]) == \"-d\") {\n        engineFile = std::string(argv[2]);\n        imgDir = std::string(argv[3]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(DEVICE);\n\n    std::string wtsFile = \"\";\n    std::string engineFile = \"\";\n\n    std::string imgDir;\n    if (!parse_args(argc, argv, wtsFile, engineFile, imgDir)) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./detr -s [.wts] [.engine] // serialize model to plan file\" << std::endl;\n        std::cerr << \"./detr -d [.engine] ../samples // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    if (!wtsFile.empty()) {\n        IHostMemory* modelStream{ nullptr };\n        BuildDETRModel(BATCH_SIZE, &modelStream, wtsFile, \"fp32\");\n        assert(modelStream != nullptr);\n        std::ofstream p(engineFile, std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    }\n\n    // deserialize the .engine and run inference\n    std::ifstream file(engineFile, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engineFile << \" error!\" << std::endl;\n        return -1;\n    }\n\n    std::string trtModelStream;\n    size_t modelSize{ 0 };\n    file.seekg(0, file.end);\n    modelSize = file.tellg();\n    file.seekg(0, file.beg);\n    trtModelStream.resize(modelSize);\n    assert(!trtModelStream.empty());\n    file.read(const_cast<char*>(trtModelStream.c_str()), modelSize);\n    file.close();\n\n    // build engine\n    std::cout << \"build engine\" << std::endl;\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream.c_str(), modelSize);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    runtime->destroy();\n\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n\n    // prepare input file\n    std::vector<std::string> fileList;\n    if (read_files_in_dir(imgDir.c_str(), fileList) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    // calculate input size\n    int input_size = CalculateSize(context->getBindingDimensions(0));\n\n    // prepare input data\n    std::vector<float> data(BATCH_SIZE * input_size, 0);\n    void *data_d, *scores_d, *boxes_d;\n    CUDA_CHECK(cudaMalloc(&data_d, BATCH_SIZE * input_size * sizeof(float)));\n    CUDA_CHECK(cudaMalloc(&scores_d, BATCH_SIZE * NUM_QUERIES * NUM_CLASS * sizeof(float)));\n    CUDA_CHECK(cudaMalloc(&boxes_d, BATCH_SIZE * NUM_QUERIES * 4 * sizeof(float)));\n\n    std::vector<float> scores_h(BATCH_SIZE * NUM_QUERIES * NUM_CLASS);\n    std::vector<float> boxes_h(BATCH_SIZE * NUM_QUERIES * 4);\n\n    std::vector<void*> buffers = { data_d, scores_d, boxes_d };\n    std::vector<float*> outputs = {scores_h.data(), boxes_h.data()};\n\n    int fcount = 0;\n    int fileLen = fileList.size();\n    for (int f = 0; f < fileLen; f++) {\n        fcount++;\n        if (fcount < BATCH_SIZE && f + 1 != fileLen) continue;\n\n        for (int b = 0; b < fcount; b++) {\n            cv::Mat img = cv::imread(imgDir + \"/\" + fileList[f - fcount + 1 + b]);\n            if (img.empty()) continue;\n            preprocessImg(img, INPUT_H, INPUT_W);\n            assert(img.cols * img.rows * 3 == input_size);\n            for (int c = 0; c < 3; c++) {\n                for (int h = 0; h < img.rows; h++) {\n                    for (int w = 0; w < img.cols; w++) {\n                        data[b * input_size +\n                        c * img.rows * img.cols + h * img.cols + w] = img.at<cv::Vec3f>(h, w)[c];\n                    }\n                }\n            }\n        }\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n\n        doInference(*context, stream, buffers, data, outputs);\n\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n        for (int b = 0; b < fcount; b++) {\n            cv::Mat img = cv::imread(imgDir + \"/\" + fileList[f - fcount + 1 + b]);\n            for (int i = 0; i < scores_h.size(); i += NUM_CLASS) {\n                int label = -1;\n                float score = -1;\n                for (int j = i; j < i + NUM_CLASS; j++) {\n                    if (score < scores_h[j]) {\n                        label = j;\n                        score = scores_h[j];\n                    }\n                }\n                if (score > SCORE_THRESH && (label % NUM_CLASS != NUM_CLASS - 1)) {\n                    int ind = label / NUM_CLASS;\n                    label = label % NUM_CLASS;\n                    float cx = boxes_h[ind * 4];\n                    float cy = boxes_h[ind * 4 + 1];\n                    float w = boxes_h[ind * 4 + 2];\n                    float h = boxes_h[ind * 4 + 3];\n                    float x1 = (cx - w / 2.0) * img.cols;\n                    float y1 = (cy - h / 2.0) * img.rows;\n                    float x2 = (cx + w / 2.0) * img.cols;\n                    float y2 = (cy + h / 2.0) * img.rows;\n                    cv::Rect r(x1, y1, x2 - x1, y2 - y1);\n                    cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n                    cv::putText(img, std::to_string(label), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                    cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n                }\n            }\n            cv::imwrite(\"_\" + fileList[f - fcount + 1 + b], img);\n        }\n        fcount = 0;\n    }\n\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(data_d));\n    CUDA_CHECK(cudaFree(scores_d));\n    CUDA_CHECK(cudaFree(boxes_d));\n    context->destroy();\n    engine->destroy();\n\n    return 0;\n}\n"
  },
  {
    "path": "detr/gen_wts.py",
    "content": "import cv2\n\nimport torch\nfrom models.transformer import Transformer\nfrom models.position_encoding import PositionEmbeddingSine\nfrom models.backbone import Backbone, Joiner\nfrom models.detr import DETR\nimport torchvision.transforms as T\nfrom PIL import Image\nimport struct\n\ndef box_cxcywh_to_xyxy(x):\n    x_c, y_c, w, h = x.unbind(-1)\n    b = [(x_c - 0.5 * w), (y_c - 0.5 * h),\n         (x_c + 0.5 * w), (y_c + 0.5 * h)]\n    return torch.stack(b, dim=-1)\n\ndef build_backbone():\n    N_steps = 256 // 2\n    position_embedding = PositionEmbeddingSine(N_steps, normalize=True)\n    train_backbone = True\n    return_interm_layers = False\n    backbone = Backbone('resnet50', train_backbone, return_interm_layers, False)\n    model = Joiner(backbone, position_embedding)\n    model.num_channels = backbone.num_channels\n    return model\n\ndef gen_wts(model, filename):\n    f = open(filename + '.wts', 'w')\n    f.write('{}\\n'.format(len(model.state_dict().keys()) + 72))\n    for k, v in model.state_dict().items():\n        if 'in_proj' in k:\n            dim = int(v.size(0) / 3)\n            q_weight = v[:dim].reshape(-1).cpu().numpy()\n            k_weight = v[dim:2*dim].reshape(-1).cpu().numpy()\n            v_weight = v[2*dim:].reshape(-1).cpu().numpy()\n            f.write('{} {} '.format(k + '_q', len(q_weight)))\n            for vv in q_weight:\n                f.write(' ')\n                f.write(struct.pack('>f', float(vv)).hex())\n            f.write('\\n')\n\n            f.write('{} {} '.format(k + '_k', len(k_weight)))\n            for vv in k_weight:\n                f.write(' ')\n                f.write(struct.pack('>f', float(vv)).hex())\n            f.write('\\n')\n\n            f.write('{} {} '.format(k + '_v', len(v_weight)))\n            for vv in v_weight:\n                f.write(' ')\n                f.write(struct.pack('>f', float(vv)).hex())\n            f.write('\\n')\n        else:\n            vr = v.reshape(-1).cpu().numpy()\n            f.write('{} {} '.format(k, len(vr)))\n            for vv in vr:\n                f.write(' ')\n                f.write(struct.pack('>f',float(vv)).hex())\n            f.write('\\n')\n    f.close()\n\ndef main():\n    num_classes = 91\n    device = torch.device('cuda')\n\n    backbone = build_backbone()\n\n    transformer = Transformer(\n        d_model=256,\n        dropout=0.1,\n        nhead=8,\n        dim_feedforward=2048,\n        num_encoder_layers=6,\n        num_decoder_layers=6,\n        normalize_before=False,\n        return_intermediate_dec=True,\n    )\n\n    model = DETR(\n        backbone,\n        transformer,\n        num_classes=num_classes,\n        num_queries=100,\n        aux_loss=True,\n    )\n    checkpoint = torch.load('./detr-r50-e632da11.pth')\n    model.load_state_dict(checkpoint['model'])\n    model.to(device)\n    model.eval()\n\n    gen_wts(model, \"detr\")\n\n    # test\n    # with torch.no_grad():\n    #     transform = T.Compose([T.Resize(800), T.ToTensor(), T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])\n    #     im = Image.open('./image/demo.jpg')\n    #     img = transform(im).unsqueeze(0)\n\n    #     img = img.to(device)\n    #     res = model(img)\n\n    #     logits = res['pred_logits']\n    #     pred_boxes = res['pred_boxes']\n    #     out_prob = logits.softmax(-1)[0, :, :-1]\n    #     keep = out_prob.max(-1).values > 0.5\n    #     label = out_prob[keep].argmax(dim=1)\n    #     out_bbox = pred_boxes[0, keep]\n    #     out_bbox = out_bbox.to(torch.device('cpu'))\n    #     out_bbox = box_cxcywh_to_xyxy(out_bbox)\n    #     out_bbox = out_bbox * torch.tensor([640, 480, 640, 480])\n    #     image = cv2.imread('./image/demo.jpg')\n    #     for ob in out_bbox:\n    #         x0 = int(ob[0].item())\n    #         y0 = int(ob[1].item())\n    #         x1 = int(ob[2].item())\n    #         y1 = int(ob[3].item())\n    #         cv2.rectangle(image, (x0, y0), (x1, y1), (0,0,255), 1)\n        \n    #     cv2.imwrite('res.jpg', image)\n\nif __name__ == '__main__':\n    main()"
  },
  {
    "path": "detr/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\n#include \"macros.h\"\n\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) {\n        mShouldLog = shouldLog;\n    }\n\n private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer)  // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer)  // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n public:\n    explicit Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n     public:\n        TestAtom(TestAtom&&) = default;\n\n     private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const {\n        return mReportableSeverity;\n    }\n\n private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "detr/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#include \"NvInfer.h\"\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)a\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "docker/README.md",
    "content": "# Tutorials\n\n## Introduction\n\nThis folder contains the docker and docker-compose file to build the development environment without pain.\n\n## Prerequisites\n\n* OS: Linux or WSL2\n* docker\n* nvidia-container-toolkit\n* (Optional but **recommended**) docker-compose\n\n## Usage\n\n1. (With docker-compose) configure the `.env` file, change `DATA_DIR` to your mount point, such as your code or data folder, etc, comment the `volumes` in docker compose file if not necessariy needed\n\n2. Build image:\n```bash\ndocker compose -f docker-compose.yml build\n```\n\n3. Run a container at background:\n```bash\ndocker compose -f docker-compose.yml up -d\n```\n\n4. Attach to this container with your IDE and have fun!\n\n## HowTos\n\n### How to build and run with docker?\n\n``` bash\ndocker build -f docker/x86_64.dockerfile -v .\ndocker run -it --gpus all --privileged --net=host --ipc=host -v  /bin/bash\n```\n\n### How to build image with other TensorRT version?\n\nChange the `TAG` on top of the `.dockerfile`. Note: all images are officially owned by NVIDIA NGC, which requires a registration before pulling. For this repo, the mainly used `TAG` would be:\n\n| Container Image | Container OS | Driver | CUDA | TensorRT | Torch | Recommended |\n| :----: | :----: | :----: | :----: | :----: | :----: | :----: |\n| 20.12-py3 | Ubuntu 20.04 | 455 | 11.2 | 7.2.2 | 1.8.0 | ❌ |\n| 24.01-py3 | Ubuntu 22.04 | 545 | 12.3 | 8.6.1 | 2.2.0 | ✅ |\n| 24.04-py3 | Ubuntu 22.04 | 545 | 12.4 | 8.6.3 | 2.3.0 | ✅ |\n| 24.09-py3 | Ubuntu 22.04 | 560 | 12.6 | 10.4.0 | 2.5.0 | ✅ |\n\nFor more detail of the support matrix, please check [HERE](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)\n\n### How to customize the opencv in the image?\n\nIf prebuilt package from apt cannot meet your requirements, please refer to the demo code in `.dockerfile` to build opencv from source.\n\n### How to solve failiures when building image?\n\nFor *443 timeout* or any similar network issues, a proxy may required. To make your host proxy work for building env of docker, please change the `build` node inside docker-compose file like this:\n```YAML\n    build:\n      dockerfile: x86_64.dockerfile\n      args:\n        HTTP_PROXY: ${PROXY}\n        HTTPS_PROXY: ${PROXY}\n        ALL_PROXY: ${PROXY}\n        http_proxy: ${PROXY}\n        https_proxy: ${PROXY}\n        all_proxy: ${PROXY}\n```\nthen add `PROXY=\"http://xxx:xxx\"` in `.env` file\n\n## Note\n\nThe older version support, like TensorRT version **< 8**, may be deprecated in the future.\n"
  },
  {
    "path": "docker/tensorrtx-docker-compose.yml",
    "content": "services:\n  tensorrt:\n    image: tensortx:1.0.1\n    container_name: tensortx\n    environment:\n      - NVIDIA_VISIBLE_DEVICES=all\n    build:\n      dockerfile: x86_64.dockerfile\n    cap_add:\n      - CAP_SYS_ADMIN\n    security_opt:\n      - seccomp:unconfined\n    privileged: true\n    stdin_open: true\n    tty: true\n    shm_size: '8gb'\n    ulimits:\n      memlock:\n        soft: -1\n        hard: -1\n    devices:\n      - /dev:/dev:rw\n    volumes:\n      #### user ####\n      - ${HOME}:/workspace/localhome:rw\n      #### custom ####\n      - mount:/mnt:rw\n    deploy:\n      restart_policy:\n        condition: on-failure\n        max_attempts: 1\n        delay: 5s\n      resources:\n        reservations:\n          devices:\n            - driver: nvidia\n              capabilities: [gpu]\n              count: all\n\nvolumes:\n  mount:\n    driver: local\n    driver_opts:\n      type: none\n      o: bind\n      device: ${DATA_DIR}\n"
  },
  {
    "path": "docker/x86_64.dockerfile",
    "content": "ARG TAG=24.01-py3\n\nFROM nvcr.io/nvidia/tensorrt:${TAG} AS tensorrtx\n\nENV DEBIAN_FRONTEND noninteractive\n\n# basic tools\nRUN apt update && apt-get install -y --fix-missing --no-install-recommends \\\nsudo wget curl git ca-certificates ninja-build tzdata pkg-config \\\ngdb libglib2.0-dev libmount-dev locales \\\n&& rm -rf /var/lib/apt/lists/*\nRUN pip install --no-cache-dir yapf isort cmake-format pre-commit\n\n## fix a potential pre-commit error\nRUN locale-gen \"en_US.UTF-8\"\n\n## override older cmake\nRUN find /usr/local/share -type d -name \"cmake-*\" -exec rm -rf {} + \\\n&& curl -fsSL \"https://github.com/Kitware/CMake/releases/download/v3.30.0/cmake-3.30.0-linux-x86_64.sh\" \\\n-o cmake.sh && bash cmake.sh --skip-license --exclude-subdir --prefix=/usr/local && rm cmake.sh\n\nRUN apt update && apt-get install -y \\\nlibopencv-dev \\\n&& rm -rf /var/lib/apt/lists/*\n\n## a template to build opencv and opencv_contrib from source\n# RUN git clone -b 4.x https://github.com/opencv/opencv_contrib.git \\\n# && git clone -b 4.x https://github.com/opencv/opencv.git opencv \\\n# && cmake -S opencv -B opencv/build -G Ninja \\\n# -DBUILD_LIST=core,calib3d,imgproc,imgcodecs,highgui \\\n# -DOPENCV_EXTRA_MODULES_PATH=\"/workspace/opencv_contrib/modules\" \\\n# -DCMAKE_BUILD_TYPE=RELEASE \\\n# -DCMAKE_INSTALL_PREFIX=/usr/local \\\n# -DENABLE_FAST_MATH=ON \\\n# -DOPENCV_GENERATE_PKGCONFIG=ON \\\n# -DBUILD_opencv_python2=OFF \\\n# -DBUILD_opencv_python3=OFF \\\n# -DBUILD_JAVA=OFF \\\n# -DBUILD_DOCS=OFF \\\n# -DBUILD_PERF_TESTS=OFF \\\n# -DBUILD_TESTS=OFF \\\n# && ninja -C opencv/build install\n"
  },
  {
    "path": "efficient_ad/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.12)\nproject(EfficientAD-M)\n\nadd_definitions(-w)\nadd_definitions(-D API_EXPORTS)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE \"Debug\")\nset(CMAKE_CUDA_ARCHITECTURES 61 75 86 89)\nset(THREADS_PREFER_PTHREAD_FLAG ON)\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} /od\")\n\n### nvcc\nset(CMAKE_CUDA_COMPILER \"D:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe\")\nenable_language(CUDA)\n### cuda\ninclude_directories(\"D:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/include\")\nlink_directories(\"D:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/lib/x64\")\n### tensorrt\nset(TRT_DIR \"D:/Program Files/NVIDIA GPU Computing Toolkit/TensorRT-8.5.3.1/\")\ninclude_directories(${TRT_DIR}/include)\nlink_directories(${TRT_DIR}/lib)\n### opencv\nset(OpenCV_DIR \"E:/OpenCV/OpenCV_4.6.0/opencv/build\")\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n### dirent\ninclude_directories(\"E:/SDK/dirent-1.24/include\")\n\ninclude_directories(${PROJECT_SOURCE_DIR}/src/)\nfile(GLOB_RECURSE SRCS ${PROJECT_SOURCE_DIR}/src/*.cpp ${PROJECT_SOURCE_DIR}/src/*.cu)\n\nadd_executable(efficientAD_det \"./efficientAD_det.cpp\" ${SRCS})\ntarget_link_libraries(efficientAD_det nvinfer\n                                      cudart\n                                      nvinfer_plugin\n                                      ${OpenCV_LIBS}\n                                      )\n"
  },
  {
    "path": "efficient_ad/README.md",
    "content": "# EfficientAd\n\nEfficientAd: Accurate Visual Anomaly Detection at Millisecond-Level Latencies.\n\nThe Pytorch implementation is [openvinotoolkit/anomalib](https://github.com/openvinotoolkit/anomalib).\n\n<p align=\"center\">\n<img src=\"https://github.com/wang-xinyu/tensorrtx/assets/15235574/061c90a7-fe59-48e0-a8d0-6bddc4296cf1\">\n</p>\n\n# Test Environment\n\nGTX3080 / Windows10 22H2 / cuda11.8 / cudnn8.9.7 / TensorRT8.5.3 / OpenCV4.6\n\n# How to Run\n\n1. training to generate weight files (`efficientAD_[category].pt`)\n\n   ```\n   // Please refer to Anomalib's tutorial for details:\n   // https://github.com/openvinotoolkit/anomalib?tab=readme-ov-file#-training\n   ```\n\n2. generate `.wts` from pytorch with `.pt`\n\n   ```\n   cd ./datas/models/\n   // copy your `.pt` file to the current directory.\n   python gen_wts.py\n   // a file `efficientAD_[category].wts` will be generated.\n   ```\n\n3. build and run\n\n   ```\n   mkdir build\n   cd build\n   cmake ..\n   make\n   sudo ./EfficientAD-M -s [.wts] // serialize model to plan file\n   sudo ./EfficientAD-M -d [.engine] [image folder] // deserialize and run inference, the images in [image folder] will be processed\n   ```\n\n# Latency\n\naverage cost of doInference(in `efficientad_detect.cpp`) from second time with batch=1 under the windows environment above\n\n|               | FP32 |\n| :-----------: | :--: |\n| EfficientAD-M | 12ms |\n"
  },
  {
    "path": "efficient_ad/efficientAD_det.cpp",
    "content": "#include <cuda_runtime.h>\n\n#include <chrono>\n#include <cmath>\n#include <cstdint>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n\n#include \"config.h\"\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"utils.h\"\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n// const static int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\nconst static int kInputSize = 3 * 256 * 256;\nconst static int kOutputSize = 1 * 256 * 256;\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, float& gd, float& gw,\n                std::string& img_dir) {\n    if (argc != 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\") {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n    } else if (std::string(argv[1]) == \"-d\") {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nvoid prepare_infer_buffers(ICudaEngine* engine, float** gpu_input_buffer, float** gpu_output_buffer,\n                           float** cpu_output_buffer) {\n    // assert(engine->getNbIOTensors() == 2);\n    assert(engine->getNbBindings() == 2);\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    // nvinfer1::Dims outputDims = engine->getBindingDimensions(outputIndex);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n\n    // Create GPU in/output buffers on device\n    CUDA_CHECK(cudaMalloc((void**)gpu_input_buffer, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)gpu_output_buffer, kBatchSize * 1 * kOutputSize * sizeof(float)));  // 3 or 1 ??\n    // Create CPU output buffers on host\n    *cpu_output_buffer = new float[kBatchSize * kOutputSize];\n}\n\nvoid preprocessImg(cv::Mat& img, int newh, int neww) {\n    cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n    cv::resize(img, img, cv::Size(neww, newh));\n    img.convertTo(img, CV_32FC3);\n    // ImageNet normalize\n    img /= 255.0f;\n    img -= cv::Scalar(0.485, 0.456, 0.406);\n    img /= cv::Scalar(0.229, 0.224, 0.225);\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, std::vector<void*>& gpu_buffers,\n           std::vector<float>& cpu_input_data, std::vector<float>& cpu_output_data, int batchsize) {\n    // copy input data from host (CPU) to device (GPU)\n    CUDA_CHECK(cudaMemcpyAsync(gpu_buffers[0], cpu_input_data.data(), cpu_input_data.size() * sizeof(float),\n                               cudaMemcpyHostToDevice, stream));\n    // execute inference using context provided by engine\n    context.enqueue(batchsize, gpu_buffers.data(), stream, nullptr);\n    // copy output back from device (GPU) to host (CPU)\n    CUDA_CHECK(cudaMemcpyAsync(cpu_output_data.data(), gpu_buffers[1], batchsize * kOutputSize * sizeof(float),\n                               cudaMemcpyDeviceToHost, stream));\n    // synchronize the stream to prevent issues (block CUDA and wait for CUDA operations to be completed)\n    cudaStreamSynchronize(stream);\n}\n\nvoid serialize_engine(unsigned int max_batchsize, float& gd, float& gw, std::string& wts_name,\n                      std::string& engine_name) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = nullptr;\n    engine = build_efficientAD_engine(max_batchsize, builder, config, DataType::kFLOAT, gd, gw, wts_name);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    IHostMemory* serialized_engine = engine->serialize();\n    assert(serialized_engine != nullptr);\n\n    // Save engine to file\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cerr << \"Could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    // Close everything down\n    engine->destroy();\n    config->destroy();\n    serialized_engine->destroy();\n    builder->destroy();\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine != nullptr);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n\n    delete[] serialized_engine;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(kGpuId);\n\n    std::string wts_name = \"\";\n    std::string engine_name = \"\";\n    float gd = 1.0f, gw = 1.0f;\n    std::string img_dir;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, gd, gw, img_dir)) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./efficientad_det -s [.wts] [.engine]  // serialize model to plan file\" << std::endl;\n        std::cerr\n                << \"./efficientad_det -d [.engine] [../../datas/images/...]  // deserialize plan file and run inference\"\n                << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(kBatchSize, gd, gw, wts_name, engine_name);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n\n    // create CUDA stream for simultaneous CUDA operations\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n\n    // prepare cpu and gpu buffers\n    void *gpu_input_buffer, *gpu_output_buffer;\n    CUDA_CHECK(cudaMalloc(&gpu_input_buffer, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc(&gpu_output_buffer, kBatchSize * 1 * kOutputSize * sizeof(float)));  // 3 or 1 ??\n    std::vector<void*> gpu_buffers = {gpu_input_buffer, gpu_output_buffer};\n    std::vector<float> cpu_input_data(kBatchSize * kInputSize, 0);\n    std::vector<float> cpu_output_data(kBatchSize * kOutputSize, 0);\n\n    // read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    std::vector<cv::Mat> originImg_batch;\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            originImg_batch.push_back(img.clone());\n            preprocessImg(img, kInputW, kInputH);\n            assert(img.cols * img.rows * 3 == 3 * 256 * 256);\n            for (int c = 0; c < 3; c++) {\n                for (int h = 0; h < img.rows; h++) {\n                    for (int w = 0; w < img.cols; w++) {\n                        cpu_input_data[c * img.rows * img.cols + h * img.cols + w] = img.at<cv::Vec3f>(h, w)[c];\n                    }\n                }\n            }\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n        // infer(*context, stream, (void**)gpu_buffers, cpu_input_data, cpu_output_buffer, kBatchSize);\n        infer(*context, stream, gpu_buffers, cpu_input_data, cpu_output_data,\n              kBatchSize);  // change to save into vec `cpu_output_data`\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n\n        // postProcess\n        cv::Mat img_1(256, 256, CV_8UC1);\n        for (int row = 0; row < 256; row++) {\n            for (int col = 0; col < 256; col++) {\n                float value = cpu_output_data[row * 256 + col];\n                if (value < 0)  // clip(0,1)\n                    value = 0;\n                else if (value > 1)\n                    value = 1;\n                img_1.at<uchar>(row, col) = static_cast<uchar>(value * 255);\n            }\n        }\n\n        cv::Mat HeatMap, colorMap;\n        // genHeatMap(img_batch[0], img_1, HeatMap);\n        cv::applyColorMap(img_1, colorMap, cv::COLORMAP_JET);\n        cv::resize(originImg_batch[i], originImg_batch[i], cv::Size(256, 256));\n        cv::cvtColor(originImg_batch[i], originImg_batch[i], cv::COLOR_RGB2BGR);\n        cv::addWeighted(originImg_batch[i], 0.5, colorMap, 0.5, 0, HeatMap);\n\n        // Save images\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            cv::imwrite(\"_output\" + img_name_batch[j], img_1);\n            cv::imwrite(\"_heatmap\" + img_name_batch[j], HeatMap);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(gpu_buffers[0]));\n    CUDA_CHECK(cudaFree(gpu_buffers[1]));\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    return 0;\n}\n"
  },
  {
    "path": "efficient_ad/src/config.h",
    "content": "#pragma once\n\n/* --------------------------------------------------------\n * These configs are related to tensorrt model, if these are changed,\n * please re-compile and re-serialize the tensorrt model.\n * --------------------------------------------------------*/\n\n// For INT8, you need prepare the calibration dataset, please refer to\n#define USE_FP32  // set USE_INT8 or USE_FP16 or USE_FP32\n\n// These are used to define input/output tensor names,\n// you can set them to whatever you want.\nconst static char* kInputTensorName = \"data\";\nconst static char* kOutputTensorName = \"prob\";\n\nconstexpr static int kBatchSize = 1;\n\n// input width and height must by divisible by 32\nconstexpr static int kInputH = 256;\nconstexpr static int kInputW = 256;\n\n/* --------------------------------------------------------\n * These configs are NOT related to tensorrt model, if these are changed,\n * please re-compile, but no need to re-serialize the tensorrt model.\n * --------------------------------------------------------*/\n\n// default GPU_id\nconst static int kGpuId = 0;\n\n// If your image size is larger than 4096 * 3112, please increase this value\nconst static int kMaxInputImageSize = 4096 * 3112;\n"
  },
  {
    "path": "efficient_ad/src/cuda_utils.h",
    "content": "#ifndef TRTX_CUDA_UTILS_H_\n#define TRTX_CUDA_UTILS_H_\n\n#include <cuda_runtime_api.h>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n#endif  // CUDA_CHECK\n\n#endif  // TRTX_CUDA_UTILS_H_\n"
  },
  {
    "path": "efficient_ad/src/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"NvInferRuntimeCommon.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "efficient_ad/src/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#include <NvInfer.h>\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "efficient_ad/src/model.cpp",
    "content": "#include \"model.h\"\n\n#include <cassert>\n#include <cmath>\n#include <cstring>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n\n#include \"config.h\"\n\nusing namespace nvinfer1;\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstatic std::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nvoid printNetworkLayers(INetworkDefinition* network) {\n    int numLayers = network->getNbLayers();\n    // std::cout << \"currently num of layers: \" << numLayers << std::endl;\n\n    auto dataTypeToString = [](DataType type) {\n        switch (type) {\n            case DataType::kFLOAT:\n                return \"kFLOAT\";\n            case DataType::kHALF:\n                return \"kHALF\";\n            case DataType::kINT8:\n                return \"kINT8\";\n            case DataType::kINT32:\n                return \"kINT32\";\n            case DataType::kBOOL:\n                return \"kBOOL\";\n            default:\n                return \"Unknown\";\n        }\n    };\n\n    for (int i = 0; i < numLayers; ++i) {\n        ILayer* layer = network->getLayer(i);\n        std::cout << \"--- Layer\" << i << \" = \" << layer->getName() << std::endl;\n        std::cout << \"input & output tensor type: \" << dataTypeToString(layer->getInput(0)->getType()) << \"\\t\"\n                  << dataTypeToString(layer->getOutput(0)->getType()) << std::endl;\n\n        // input\n        int inTensorNum = layer->getNbInputs();\n        for (int j = 0; j < inTensorNum; ++j) {\n            // std::cout << layer->getInput(j)->getDimensions().nbDims;\n            Dims dims_in = layer->getInput(j)->getDimensions();\n            std::cout << \"input shape[\" << j << \"]: (\";\n            for (int k = 0; k < dims_in.nbDims; ++k) {\n                std::cout << dims_in.d[k];\n                if (k < dims_in.nbDims - 1) {\n                    std::cout << \", \";\n                }\n            }\n            std::cout << \")\\t\";\n        }\n        std::cout << std::endl;\n\n        // output\n        int outTensorNum = layer->getNbOutputs();\n        for (int j = 0; j < outTensorNum; ++j) {\n            // std::cout << layer->getOutput(j)->getName();\n            Dims dims_out = layer->getOutput(j)->getDimensions();\n            std::cout << \"output shape: (\";\n            for (int k = 0; k < dims_out.nbDims; ++k) {\n                std::cout << dims_out.d[k];\n                if (k < dims_out.nbDims - 1) {\n                    std::cout << \", \";\n                }\n            }\n            std::cout << \")\";\n        }\n        std::cout << \"\\n\" << std::endl;\n    }\n}\n\nstatic IScaleLayer* NormalizeInput(INetworkDefinition* network, ITensor& input) {\n    float meanValues[3] = {-0.485f, -0.456f, -0.406f};\n    float stdValues[3] = {1.0f / 0.229f, 1.0f / 0.224f, 1.0f / 0.225f};\n    Weights meanWeights{DataType::kFLOAT, meanValues, 3};\n    Weights stdWeights{DataType::kFLOAT, stdValues, 3};\n\n    IScaleLayer* NormaLayer = network->addScale(input, ScaleMode::kCHANNEL, meanWeights, stdWeights, Weights{});\n    assert(NormaLayer != nullptr);\n\n    return NormaLayer;\n}\n\nstatic IScaleLayer* NormalizeTeacherMap(INetworkDefinition* network, std::map<std::string, Weights>& weightMap,\n                                        ITensor& input) {\n    float* mean = (float*)weightMap[\"mean_std.mean\"].values;\n    float* std = (float*)weightMap[\"mean_std.std\"].values;\n    int len = weightMap[\"mean_std.mean\"].count;\n\n    // 1.scale\n    float* scaleVal = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scaleVal[i] = 1.0 / std[i];\n    }\n    Weights scale{DataType::kFLOAT, scaleVal, len};\n\n    // 2.shift\n    float* shiftVal = nullptr;\n    shiftVal = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shiftVal[i] = -mean[i];\n    }\n    Weights shift{DataType::kFLOAT, shiftVal, len};\n\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, Weights{}, Weights{});\n    assert(scale_1);\n    IScaleLayer* scale_2 = network->addScale(*scale_1->getOutput(0), ScaleMode::kCHANNEL, Weights{}, scale, Weights{});\n    assert(scale_2);\n\n    return scale_2;\n}\n\nstatic ILayer* NormalizeFinalMap(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\n                                 std::string name) {\n    float* qa = (float*)weightMap[\"quantiles.qa_\" + name].values;\n    float* qb = (float*)weightMap[\"quantiles.qb_\" + name].values;\n    int len = weightMap[\"quantiles.qa_\" + name].count;\n\n    Weights qbWeight_2{DataType::kFLOAT, qb, len};\n\n    // fmap_st - qa_st\n    float* shiftVal_1 = nullptr;\n    shiftVal_1 = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shiftVal_1[i] = -qa[i];\n    }\n    Weights qa_shiftWeight_1{DataType::kFLOAT, shiftVal_1, len};\n    IScaleLayer* mapNorm_subLayer_1 =\n            network->addScale(input, ScaleMode::kUNIFORM, qa_shiftWeight_1, Weights{}, Weights{});\n    assert(mapNorm_subLayer_1);\n\n    // qb_st - qa_st\n    float* shiftVal_2 = nullptr;\n    shiftVal_2 = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shiftVal_2[i] = qb[i] - qa[i];\n    }\n\n    // (fmap_st - qa_st) / (qb_st - qa_st)\n    float* scaleVal_1 = nullptr;\n    scaleVal_1 = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scaleVal_1[i] = 1.0f / shiftVal_2[i];\n    }\n    Weights scaleWeight_1{DataType::kFLOAT, scaleVal_1, len};\n    IScaleLayer* mapNorm_divLayer_1 = network->addScale(*mapNorm_subLayer_1->getOutput(0), ScaleMode::kUNIFORM,\n                                                        Weights{}, scaleWeight_1, Weights{});\n    assert(mapNorm_divLayer_1);\n\n    // ((fmap_st - qa_st) / (qb_st - qa_st)) * 0.1\n    float* scaleVal_2 = nullptr;\n    scaleVal_2 = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scaleVal_2[i] = 0.1f;\n    }\n    Weights scaleWeight_2{DataType::kFLOAT, scaleVal_2, 1};\n    IScaleLayer* mapNorm_Layer = network->addScale(*mapNorm_divLayer_1->getOutput(0), ScaleMode::kUNIFORM, Weights{},\n                                                   scaleWeight_2, Weights{});\n    assert(mapNorm_Layer);\n\n    return mapNorm_Layer;\n}\n\nstatic ILayer* convRelu(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\n                        int outch, int ksize, int s, int p, int g, std::string lname, bool withRelu) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network->addConvolutionNd(\n            input, outch, DimsHW{ksize, ksize}, weightMap[lname + \".weight\"],\n            weightMap[lname + \".bias\"]);  // if without bias weights, the results won't match with torch version\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n    conv1->setNbGroups(g);\n    conv1->setName((lname).c_str());\n\n    if (!withRelu)\n        return conv1;\n\n    auto relu = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    assert(relu);\n\n    return relu;\n}\n\nstatic IResizeLayer* interpolate(INetworkDefinition* network, ITensor& input, Dims upsampleScale,\n                                 ResizeMode resizeMode) {\n    IResizeLayer* interpolateLayer = network->addResize(input);\n    assert(interpolateLayer);\n    interpolateLayer->setOutputDimensions(upsampleScale);\n    interpolateLayer->setResizeMode(resizeMode);\n\n    return interpolateLayer;\n}\n\nstatic ILayer* interpConvRelu(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\n                              int outch, int ksize, int s, int p, int g, std::string lname, int dim) {\n    IResizeLayer* interpolateLayer = network->addResize(input);\n    assert(interpolateLayer != nullptr);\n    interpolateLayer->setOutputDimensions(Dims3{input.getDimensions().d[0], dim, dim});\n    interpolateLayer->setResizeMode(ResizeMode::kLINEAR);\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*interpolateLayer->getOutput(0), outch, DimsHW{ksize, ksize},\n                                                         weightMap[lname + \".weight\"], weightMap[lname + \".bias\"]);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n    conv1->setNbGroups(g);\n    conv1->setName((lname + \".conv\").c_str());\n\n    auto relu = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    assert(relu);\n\n    return relu;\n}\n\nstatic IPoolingLayer* avgPool2d(INetworkDefinition* network, ITensor& input, int kernelSize, int stride, int padding) {\n    IPoolingLayer* poolLayer = network->addPooling(input, PoolingType::kAVERAGE, DimsHW{kernelSize, kernelSize});\n    assert(poolLayer);\n    poolLayer->setStride(DimsHW{stride, stride});\n    poolLayer->setPadding(DimsHW{padding, padding});\n\n    return poolLayer;\n}\n\nstatic void slice(INetworkDefinition* network, ITensor& input, std::vector<ITensor*>& layer_vec) {\n    Dims inputDims = input.getDimensions();\n    ISliceLayer* slice1 = network->addSlice(input, Dims3{0, 0, 0},\n                                            Dims3{inputDims.d[0] / 2, inputDims.d[1], inputDims.d[2]}, Dims3{1, 1, 1});\n    assert(slice1);\n\n    ISliceLayer* slice2 = network->addSlice(input, Dims3{inputDims.d[0] / 2, 0, 0},\n                                            Dims3{inputDims.d[0] / 2, inputDims.d[1], inputDims.d[2]}, Dims3{1, 1, 1});\n    assert(slice2);\n\n    layer_vec.push_back(slice1->getOutput(0));\n    layer_vec.push_back(slice2->getOutput(0));\n}\n\nstatic IElementWiseLayer* mergeMap(INetworkDefinition* network, ITensor& input1, ITensor& input2) {\n    float* scaleVal = nullptr;\n    scaleVal = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    for (int i = 0; i < 1; i++) {\n        scaleVal[i] = 0.5f;\n    }\n    Weights scaleWeight{DataType::kFLOAT, scaleVal, 1};\n    IScaleLayer* mergeMapLayer1 = network->addScale(input1, ScaleMode::kUNIFORM, Weights{}, scaleWeight, Weights{});\n    assert(mergeMapLayer1);\n\n    IScaleLayer* mergeMapLayer2 = network->addScale(input2, ScaleMode::kUNIFORM, Weights{}, scaleWeight, Weights{});\n    assert(mergeMapLayer2);\n\n    IElementWiseLayer* mergedMapLayer = network->addElementWise(\n            *mergeMapLayer1->getOutput(0), *mergeMapLayer2->getOutput(0), ElementWiseOperation::kSUM);\n    assert(mergedMapLayer);\n\n    return mergedMapLayer;\n}\n\nICudaEngine* build_efficientAD_engine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt,\n                                      float& gd, float& gw, std::string& wts_name) {\n    /* create network object */\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    /* create input tensor {3, kInputH, kInputW} */\n    ITensor* InputData = network->addInput(kInputTensorName, dt, Dims3{3, kInputH, kInputW});\n    assert(InputData);\n\n    /* create weight map */\n    std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n    /* AE */\n    // auto BN1 = NormalizeInput(network, *InputData);\n    // encoder\n    auto enconv1 = convRelu(network, weightMap, *InputData, 32, 4, 2, 1, 1, \"ae.encoder.enconv1\", true);\n    auto enconv2 = convRelu(network, weightMap, *enconv1->getOutput(0), 32, 4, 2, 1, 1, \"ae.encoder.enconv2\", true);\n    auto enconv3 = convRelu(network, weightMap, *enconv2->getOutput(0), 64, 4, 2, 1, 1, \"ae.encoder.enconv3\", true);\n    auto enconv4 = convRelu(network, weightMap, *enconv3->getOutput(0), 64, 4, 2, 1, 1, \"ae.encoder.enconv4\", true);\n    auto enconv5 = convRelu(network, weightMap, *enconv4->getOutput(0), 64, 4, 2, 1, 1, \"ae.encoder.enconv5\", true);\n    auto enconv6 = convRelu(network, weightMap, *enconv5->getOutput(0), 64, 8, 1, 0, 1, \"ae.encoder.enconv6\", false);\n    // decoder\n    auto deconv1 = interpConvRelu(network, weightMap, *enconv6->getOutput(0), 64, 4, 1, 2, 1, \"ae.decoder.deconv1\", 3);\n    auto deconv2 = interpConvRelu(network, weightMap, *deconv1->getOutput(0), 64, 4, 1, 2, 1, \"ae.decoder.deconv2\", 8);\n    auto deconv3 = interpConvRelu(network, weightMap, *deconv2->getOutput(0), 64, 4, 1, 2, 1, \"ae.decoder.deconv3\", 15);\n    auto deconv4 = interpConvRelu(network, weightMap, *deconv3->getOutput(0), 64, 4, 1, 2, 1, \"ae.decoder.deconv4\", 32);\n    auto deconv5 = interpConvRelu(network, weightMap, *deconv4->getOutput(0), 64, 4, 1, 2, 1, \"ae.decoder.deconv5\", 63);\n    auto deconv6 =\n            interpConvRelu(network, weightMap, *deconv5->getOutput(0), 64, 4, 1, 2, 1, \"ae.decoder.deconv6\", 127);\n    auto deconv7 = interpConvRelu(network, weightMap, *deconv6->getOutput(0), 64, 3, 1, 1, 1, \"ae.decoder.deconv7\", 56);\n    auto deconv8 = convRelu(network, weightMap, *deconv7->getOutput(0), 384, 3, 1, 1, 1, \"ae.decoder.deconv8\", false);\n\n    /* PDN_medium_teacher */\n    // no BN added after the convolutional layer\n    auto teacher1 = convRelu(network, weightMap, *InputData, 256, 4, 1, 0, 1, \"teacher.conv1\", true);\n    auto avgPool1 = avgPool2d(network, *teacher1->getOutput(0), 2, 2, 0);\n    auto teacher2 = convRelu(network, weightMap, *avgPool1->getOutput(0), 512, 4, 1, 0, 1, \"teacher.conv2\", true);\n    auto avgPool2 = avgPool2d(network, *teacher2->getOutput(0), 2, 2, 0);\n    auto teacher3 = convRelu(network, weightMap, *avgPool2->getOutput(0), 512, 1, 1, 0, 1, \"teacher.conv3\", true);\n    auto teacher4 = convRelu(network, weightMap, *teacher3->getOutput(0), 512, 3, 1, 0, 1, \"teacher.conv4\", true);\n    auto teacher5 = convRelu(network, weightMap, *teacher4->getOutput(0), 384, 4, 1, 0, 1, \"teacher.conv5\", true);\n    auto teacher6 = convRelu(network, weightMap, *teacher5->getOutput(0), 384, 1, 1, 0, 1, \"teacher.conv6\", false);\n\n    /* PDN_medium_student */\n    auto student1 = convRelu(network, weightMap, *InputData, 256, 4, 1, 0, 1, \"student.conv1\", true);\n    auto avgPool3 = avgPool2d(network, *student1->getOutput(0), 2, 2, 0);\n    auto student2 = convRelu(network, weightMap, *avgPool3->getOutput(0), 512, 4, 1, 0, 1, \"student.conv2\", true);\n    auto avgPool4 = avgPool2d(network, *student2->getOutput(0), 2, 2, 0);\n    auto student3 = convRelu(network, weightMap, *avgPool4->getOutput(0), 512, 1, 1, 0, 1, \"student.conv3\", true);\n    auto student4 = convRelu(network, weightMap, *student3->getOutput(0), 512, 3, 1, 0, 1, \"student.conv4\", true);\n    auto student5 = convRelu(network, weightMap, *student4->getOutput(0), 768, 4, 1, 0, 1, \"student.conv5\", true);\n    auto student6 = convRelu(network, weightMap, *student5->getOutput(0), 768, 1, 1, 0, 1, \"student.conv6\", false);\n\n    /* postCalculate */\n    auto normal_teacher_output = NormalizeTeacherMap(network, weightMap, *teacher6->getOutput(0));\n    std::vector<ITensor*> layer_vec{};\n    slice(network, *student6->getOutput(0), layer_vec);\n    ITensor* y_st = layer_vec[0];\n    ITensor* y_stae = layer_vec[1];\n\n    // distance_st\n    IElementWiseLayer* sub_st =\n            network->addElementWise(*normal_teacher_output->getOutput(0), *y_st, ElementWiseOperation::kSUB);\n    assert(sub_st);\n    IElementWiseLayer* distance_st =\n            network->addElementWise(*sub_st->getOutput(0), *sub_st->getOutput(0), ElementWiseOperation::kPROD);\n    assert(distance_st);\n\n    // distance_stae\n    IElementWiseLayer* sub_stae = network->addElementWise(*deconv8->getOutput(0), *y_stae, ElementWiseOperation::kSUB);\n    assert(sub_stae);\n    IElementWiseLayer* distance_stae =\n            network->addElementWise(*sub_stae->getOutput(0), *sub_stae->getOutput(0), ElementWiseOperation::kPROD);\n    assert(distance_stae);\n\n    IReduceLayer* map_st = network->addReduce(*distance_st->getOutput(0), ReduceOperation::kAVG, 1, true);\n    assert(map_st);\n    IReduceLayer* map_stae = network->addReduce(*distance_stae->getOutput(0), ReduceOperation::kAVG, 1, true);\n    assert(map_stae);\n\n    IPaddingLayer* padMap_st = network->addPadding(*map_st->getOutput(0), DimsHW{4, 4}, DimsHW{4, 4});\n    assert(padMap_st);\n    IPaddingLayer* padMap_stae = network->addPadding(*map_stae->getOutput(0), DimsHW{4, 4}, DimsHW{4, 4});\n    assert(padMap_stae);\n\n    IResizeLayer* interpMap_st =\n            interpolate(network, *padMap_st->getOutput(0),\n                        Dims3{padMap_st->getOutput(0)->getDimensions().d[0], 256, 256}, ResizeMode::kLINEAR);\n    assert(interpMap_st);\n    IResizeLayer* interpMap_stae =\n            interpolate(network, *padMap_stae->getOutput(0),\n                        Dims3{padMap_stae->getOutput(0)->getDimensions().d[0], 256, 256}, ResizeMode::kLINEAR);\n    assert(interpMap_stae);\n\n    ILayer* normalizedMap_st = NormalizeFinalMap(network, weightMap, *interpMap_st->getOutput(0), \"st\");\n    assert(normalizedMap_st);\n    ILayer* normalizedMap_stae = NormalizeFinalMap(network, weightMap, *interpMap_stae->getOutput(0), \"ae\");\n    assert(normalizedMap_stae);\n\n    IElementWiseLayer* mergedMapLayer =\n            mergeMap(network, *normalizedMap_st->getOutput(0), *normalizedMap_st->getOutput(0));\n    printNetworkLayers(network);\n\n    /* ouput */\n    mergedMapLayer->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*mergedMapLayer->getOutput(0));\n\n    /* Engine config */\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#if defined(USE_FP16)\n    config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(BuilderFlag::kINT8);\n    Int8EntropyCalibrator2* calibrator =\n            new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n"
  },
  {
    "path": "efficient_ad/src/model.h",
    "content": "#pragma once\n\n#include <NvInfer.h>\n\n#include <string>\n\nnvinfer1::ICudaEngine* build_efficientAD_engine(unsigned int maxBatchSize, nvinfer1::IBuilder* builder,\n                                                nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt, float& gd,\n                                                float& gw, std::string& wts_name);\n"
  },
  {
    "path": "efficient_ad/src/postprocess.h",
    "content": "#pragma once\n\n#include <opencv2/opencv.hpp>\n\nvoid genHeatMap(cv::Mat originImg, cv::Mat& anomalyGrayMap, cv::Mat& HeatMap) {\n    cv::Mat colorMap;\n    cv::applyColorMap(colorMap, anomalyGrayMap, cv::COLORMAP_JET);\n    cv::addWeighted(originImg, 0.5, colorMap, 0.5, 0, HeatMap);\n}\n"
  },
  {
    "path": "efficient_ad/src/utils.h",
    "content": "#pragma once\n\n#include <dirent.h>\n#include <cstring>\n#include <fstream>\n#include <sstream>\n#include <string>\n#include <unordered_map>\n#include <vector>\n\nstatic inline int read_files_in_dir(const char* p_dir_name, std::vector<std::string>& file_names) {\n    DIR* p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 && strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n"
  },
  {
    "path": "efficientnet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(efficientnet)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nadd_executable(efficientnet  ${PROJECT_SOURCE_DIR}/efficientnet.cpp)\ntarget_link_libraries(efficientnet nvinfer)\ntarget_link_libraries(efficientnet cudart)\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "efficientnet/README.md",
    "content": "# EfficientNet\n\nA TensorRT implementation of EfficientNet.\nFor the Pytorch implementation, you can refer to [EfficientNet-PyTorch](https://github.com/lukemelas/EfficientNet-PyTorch)\n\n## How to run\n\n1. install `efficientnet_pytorch`\n```\npip install efficientnet_pytorch\n```\n\n2. gennerate `.wts` file\n```\npython gen_wts.py\n```\n\n3. build\n\n```\nmkdir build\ncd build\ncmake ..\nmake\n```\n4. serialize model to engine\n```\n./efficientnet -s [.wts] [.engine] [b0 b1 b2 b3 ... b7]  // serialize model to engine file\n```\nsuch as\n```\n./efficientnet -s ../efficientnet-b3.wts efficientnet-b3.engine b3\n```\n5. deserialize and do infer\n```\n./efficientnet -d [.engine] [b0 b1 b2 b3 ... b7]   // deserialize engine file and run inference\n```\nsuch as \n```\n./efficientnet -d efficientnet-b3.engine b3\n```\n6. see if the output is same as pytorch side\n\n\nFor more models, please refer to [tensorrtx](https://github.com/wang-xinyu/tensorrtx)\n"
  },
  {
    "path": "efficientnet/efficientnet.cpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include \"utils.hpp\"\n\n#define USE_FP32 //USE_FP16\n#define INPUT_NAME \"data\"\n#define OUTPUT_NAME \"prob\"\n#define MAX_BATCH_SIZE 8\n\nusing namespace nvinfer1;\nstatic Logger gLogger;\n\nstatic std::vector<BlockArgs>\n\tblock_args_list = {\n\t\tBlockArgs{1, 3, 1, 1, 32, 16, 0.25, true},\n\t\tBlockArgs{2, 3, 2, 6, 16, 24, 0.25, true},\n\t\tBlockArgs{2, 5, 2, 6, 24, 40, 0.25, true},\n\t\tBlockArgs{3, 3, 2, 6, 40, 80, 0.25, true},\n\t\tBlockArgs{3, 5, 1, 6, 80, 112, 0.25, true},\n\t\tBlockArgs{4, 5, 2, 6, 112, 192, 0.25, true},\n\t\tBlockArgs{1, 3, 1, 6, 192, 320, 0.25, true}};\n\nstatic std::map<std::string, GlobalParams>\n\tglobal_params_map = {\n\t\t// input_h,input_w,num_classes,batch_norm_epsilon,\n\t\t// width_coefficient,depth_coefficient,depth_divisor, min_depth\n\t\t{\"b0\", GlobalParams{224, 224, 1000, 0.001, 1.0, 1.0, 8, -1}},\n\t\t{\"b1\", GlobalParams{240, 240, 1000, 0.001, 1.0, 1.1, 8, -1}},\n\t\t{\"b2\", GlobalParams{260, 260, 1000, 0.001, 1.1, 1.2, 8, -1}},\n\t\t{\"b3\", GlobalParams{300, 300, 1000, 0.001, 1.2, 1.4, 8, -1}},\n\t\t{\"b4\", GlobalParams{380, 380, 1000, 0.001, 1.4, 1.8, 8, -1}},\n\t\t{\"b5\", GlobalParams{456, 456, 1000, 0.001, 1.6, 2.2, 8, -1}},\n\t\t{\"b6\", GlobalParams{528, 528, 1000, 0.001, 1.8, 2.6, 8, -1}},\n\t\t{\"b7\", GlobalParams{600, 600, 1000, 0.001, 2.0, 3.1, 8, -1}},\n\t\t{\"b8\", GlobalParams{672, 672, 1000, 0.001, 2.2, 3.6, 8, -1}},\n\t\t{\"l2\", GlobalParams{800, 800, 1000, 0.001, 4.3, 5.3, 8, -1}},\n};\n\nICudaEngine *createEngine(unsigned int maxBatchSize, IBuilder *builder, IBuilderConfig *config, DataType dt, std::string path_wts, std::vector<BlockArgs> block_args_list, GlobalParams global_params)\n{\n\tfloat bn_eps = global_params.batch_norm_epsilon;\n\tDimsHW image_size = DimsHW{global_params.input_h, global_params.input_w};\n\n\tstd::map<std::string, Weights> weightMap = loadWeights(path_wts);\n\tWeights emptywts{DataType::kFLOAT, nullptr, 0};\n\tINetworkDefinition *network = builder->createNetworkV2(0U);\n\tITensor *data = network->addInput(INPUT_NAME, dt, Dims3{3, global_params.input_h, global_params.input_w});\n\tassert(data);\n\n\tint out_channels = roundFilters(32, global_params);\n\tauto conv_stem = addSamePaddingConv2d(network, weightMap, *data, out_channels, 3, 2, 1, 1, image_size, \"_conv_stem\");\n\tauto bn0 = addBatchNorm2d(network, weightMap, *conv_stem->getOutput(0), \"_bn0\", bn_eps);\n\tauto swish0 = addSwish(network, *bn0->getOutput(0));\n\tITensor *x = swish0->getOutput(0);\n\timage_size = calculateOutputImageSize(image_size, 2);\n\tint block_id = 0;\n\tfor (int i = 0; i < block_args_list.size(); i++)\n\t{\n\t\tBlockArgs block_args = block_args_list[i];\n\n\t\tblock_args.input_filters = roundFilters(block_args.input_filters, global_params);\n\t\tblock_args.output_filters = roundFilters(block_args.output_filters, global_params);\n\t\tblock_args.num_repeat = roundRepeats(block_args.num_repeat, global_params);\n\t\tx = MBConvBlock(network, weightMap, *x, \"_blocks.\" + std::to_string(block_id), block_args, global_params, image_size);\n\n\t\tassert(x);\n\t\tblock_id++;\n\t\timage_size = calculateOutputImageSize(image_size, block_args.stride);\n\t\tif (block_args.num_repeat > 1)\n\t\t{\n\t\t\tblock_args.input_filters = block_args.output_filters;\n\t\t\tblock_args.stride = 1;\n\t\t}\n\t\tfor (int r = 0; r < block_args.num_repeat - 1; r++)\n\t\t{\n\t\t\tx = MBConvBlock(network, weightMap, *x, \"_blocks.\" + std::to_string(block_id), block_args, global_params, image_size);\n\t\t\tblock_id++;\n\t\t}\n\t}\n\tout_channels = roundFilters(1280, global_params);\n\tauto conv_head = addSamePaddingConv2d(network, weightMap, *x, out_channels, 1, 1, 1, 1, image_size, \"_conv_head\", false);\n\tauto bn1 = addBatchNorm2d(network, weightMap, *conv_head->getOutput(0), \"_bn1\", bn_eps);\n\tauto swish1 = addSwish(network, *bn1->getOutput(0));\n\tauto avg_pool = network->addPoolingNd(*swish1->getOutput(0), PoolingType::kAVERAGE, image_size);\n\n\tIFullyConnectedLayer *final = network->addFullyConnected(*avg_pool->getOutput(0), global_params.num_classes, weightMap[\"_fc.weight\"], weightMap[\"_fc.bias\"]);\n\tassert(final);\n\n\tfinal->getOutput(0)->setName(OUTPUT_NAME);\n\tnetwork->markOutput(*final->getOutput(0));\n\n\t// Build engine\n\tbuilder->setMaxBatchSize(maxBatchSize);\n\tconfig->setMaxWorkspaceSize(1 << 20);\n#ifdef USE_FP16\n\tconfig->setFlag(BuilderFlag::kFP16);\n#endif\n\tstd::cout << \"build engine ...\" << std::endl;\n\n\tICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);\n\tassert(engine != nullptr);\n\n\tstd::cout << \"build finished\" << std::endl;\n\t// Don't need the network any more\n\tnetwork->destroy();\n\t// Release host memory\n\tfor (auto &mem : weightMap)\n\t{\n\t\tfree((void *)(mem.second.values));\n\t}\n\n\treturn engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory **modelStream, std::string wtsPath, std::vector<BlockArgs> block_args_list, GlobalParams global_params)\n{\n\t// Create builder\n\tIBuilder *builder = createInferBuilder(gLogger);\n\tIBuilderConfig *config = builder->createBuilderConfig();\n\n\t// Create model to populate the network, then set the outputs and create an engine\n\tICudaEngine *engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT, wtsPath, block_args_list, global_params);\n\tassert(engine != nullptr);\n\n\t// Serialize the engine\n\t(*modelStream) = engine->serialize();\n\n\t// Close everything down\n\tengine->destroy();\n\tbuilder->destroy();\n\tconfig->destroy();\n}\nvoid doInference(IExecutionContext &context, float *input, float *output, int batchSize, GlobalParams global_params)\n{\n\tconst ICudaEngine &engine = context.getEngine();\n\n\t// Pointers to input and output device buffers to pass to engine.\n\t// Engine requires exactly IEngine::getNbBindings() number of buffers.\n\tassert(engine.getNbBindings() == 2);\n\tvoid *buffers[2];\n\n\t// In order to bind the buffers, we need to know the names of the input and output tensors.\n\t// Note that indices are guaranteed to be less than IEngine::getNbBindings()\n\tconst int inputIndex = engine.getBindingIndex(INPUT_NAME);\n\tconst int outputIndex = engine.getBindingIndex(OUTPUT_NAME);\n\n\t// Create GPU buffers on device\n\tCHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * global_params.input_h * global_params.input_w * sizeof(float)));\n\tCHECK(cudaMalloc(&buffers[outputIndex], batchSize * global_params.num_classes * sizeof(float)));\n\n\t// Create stream\n\tcudaStream_t stream;\n\tCHECK(cudaStreamCreate(&stream));\n\n\t// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n\tCHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * global_params.input_h * global_params.input_w * sizeof(float), cudaMemcpyHostToDevice, stream));\n\tcontext.enqueue(batchSize, buffers, stream, nullptr);\n\tCHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * global_params.num_classes * sizeof(float), cudaMemcpyDeviceToHost, stream));\n\tcudaStreamSynchronize(stream);\n\n\t// Release stream and buffers\n\tcudaStreamDestroy(stream);\n\tCHECK(cudaFree(buffers[inputIndex]));\n\tCHECK(cudaFree(buffers[outputIndex]));\n}\n\nbool parse_args(int argc, char **argv, std::string &wts, std::string &engine, std::string &backbone)\n{\n\tif (std::string(argv[1]) == \"-s\" && argc == 5)\n\t{\n\t\twts = std::string(argv[2]);\n\t\tengine = std::string(argv[3]);\n\t\tbackbone = std::string(argv[4]);\n\t}\n\telse if (std::string(argv[1]) == \"-d\" && argc == 4)\n\t{\n\t\tengine = std::string(argv[2]);\n\t\tbackbone = std::string(argv[3]);\n\t}\n\telse\n\t{\n\t\treturn false;\n\t}\n\treturn true;\n}\n\nint main(int argc, char **argv)\n{\n\tstd::string wtsPath = \"\";\n\tstd::string engine_name = \"\";\n\tstd::string backbone = \"\";\n\tif (!parse_args(argc, argv, wtsPath, engine_name, backbone))\n\t{\n\t\tstd::cerr << \"arguments not right!\" << std::endl;\n\t\tstd::cerr << \"./efficientnet -s [.wts] [.engine] [b0 b1 b2 b3 ... b7]  // serialize model to engine file\" << std::endl;\n\t\tstd::cerr << \"./efficientnet -d [.engine] [b0 b1 b2 b3 ... b7]   // deserialize engine file and run inference\" << std::endl;\n\t\treturn -1;\n\t}\n\tGlobalParams global_params = global_params_map[backbone];\n\t// create a model using the API directly and serialize it to a stream\n\tif (!wtsPath.empty())\n\t{\n\t\tIHostMemory *modelStream{nullptr};\n\t\tAPIToModel(MAX_BATCH_SIZE, &modelStream, wtsPath, block_args_list, global_params);\n\t\tassert(modelStream != nullptr);\n\n\t\tstd::ofstream p(engine_name, std::ios::binary);\n\t\tif (!p)\n\t\t{\n\t\t\tstd::cerr << \"could not open plan output file\" << std::endl;\n\t\t\treturn -1;\n\t\t}\n\t\tp.write(reinterpret_cast<const char *>(modelStream->data()), modelStream->size());\n\t\tmodelStream->destroy();\n\t\treturn 1;\n\t}\n\n\tchar *trtModelStream{nullptr};\n\tsize_t size{0};\n\n\tstd::ifstream file(engine_name, std::ios::binary);\n\tif (file.good())\n\t{\n\t\tfile.seekg(0, file.end);\n\t\tsize = file.tellg();\n\t\tfile.seekg(0, file.beg);\n\t\ttrtModelStream = new char[size];\n\t\tassert(trtModelStream);\n\t\tfile.read(trtModelStream, size);\n\t\tfile.close();\n\t}\n\telse\n\t{\n\t\tstd::cerr << \"could not open plan file\" << std::endl;\n\t\treturn -1;\n\t}\n\n\t// dummy input\n\tfloat *data = new float[3 * global_params.input_h * global_params.input_w];\n\tfor (int i = 0; i < 3 * global_params.input_h * global_params.input_w; i++)\n\t\tdata[i] = 0.1;\n\n\tIRuntime *runtime = createInferRuntime(gLogger);\n\tassert(runtime != nullptr);\n\tICudaEngine *engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n\tassert(engine != nullptr);\n\tIExecutionContext *context = engine->createExecutionContext();\n\tassert(context != nullptr);\n\tdelete[] trtModelStream;\n\n\t// Run inference\n\tfloat *prob = new float[global_params.num_classes];\n\tfor (int i = 0; i < 100; i++)\n\t{\n\t\tauto start = std::chrono::system_clock::now();\n\t\tdoInference(*context, data, prob, 1, global_params);\n\t\tauto end = std::chrono::system_clock::now();\n\t\tstd::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\t}\n\tfor (unsigned int i = 0; i < 20; i++)\n\t{\n\t\tstd::cout << prob[i] << \", \";\n\t}\n\tstd::cout << std::endl;\n\t// Destroy the engine\n\tcontext->destroy();\n\tengine->destroy();\n\truntime->destroy();\n\tdelete data;\n\tdelete prob;\n\n\treturn 0;\n}\n"
  },
  {
    "path": "efficientnet/gen_wts.py",
    "content": "import torch\nimport struct\nfrom efficientnet_pytorch import EfficientNet\nmodel = EfficientNet.from_pretrained('efficientnet-b3')\n\nmodel.eval()\nf = open('efficientnet-b3.wts', 'w')\nf.write('{}\\n'.format(len(model.state_dict().keys())))\nfor k, v in model.state_dict().items():\n    vr = v.reshape(-1).cpu().numpy()\n    f.write('{} {} '.format(k, len(vr)))\n    for vv in vr:\n        f.write(' ')\n        f.write(struct.pack('>f',float(vv)).hex())\n    f.write('\\n')\nf.close()\n"
  },
  {
    "path": "efficientnet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "efficientnet/utils.hpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include <math.h>\n#include <string>\n#include <algorithm>\nusing namespace nvinfer1;\n\n#define CHECK(status)                                          \\\n    do                                                         \\\n    {                                                          \\\n        auto ret = (status);                                   \\\n        if (ret != 0)                                          \\\n        {                                                      \\\n            std::cerr << \"Cuda failure: \" << ret << std::endl; \\\n            abort();                                           \\\n        }                                                      \\\n    } while (0)\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t *val = reinterpret_cast<uint32_t *>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nstruct BlockArgs\n{\n    int num_repeat;\n    int kernel_size;\n    int stride;\n    float expand_ratio;\n    int input_filters;\n    int output_filters;\n    float se_ratio;\n    bool id_skip;\n};\n\nstruct GlobalParams\n{\n    int input_h;\n    int input_w;\n    int num_classes;\n    float batch_norm_epsilon;\n    float width_coefficient;\n    float depth_coefficient;\n    int depth_divisor;\n    int min_depth;\n};\n\nint roundFilters(int filters, GlobalParams global_params)\n{\n    float multiplier = global_params.width_coefficient;\n    int divisor = global_params.depth_divisor;\n    int min_depth = global_params.min_depth;\n    filters = int(filters * multiplier);\n    if (min_depth < 0)\n    {\n        min_depth = divisor;\n    }\n    // follow the formula transferred from official TensorFlow implementation\n    int new_filters = std::max(min_depth, int(int(filters + divisor / 2) / divisor) * divisor);\n    if (new_filters < 0.9 * filters) // prevent rounding by more than 10%\n        new_filters += divisor;\n    return int(new_filters);\n}\n\nDimsHW calculateOutputImageSize(DimsHW image_size, int stride)\n{\n    int image_h = int(ceil(float(image_size.h()) / float(stride)));\n    int image_w = int(ceil(float(image_size.w()) / float(stride)));\n    return DimsHW{image_h, image_w};\n}\n\nint roundRepeats(int repeats, GlobalParams global_params)\n{\n    float multiplier = global_params.depth_coefficient;\n    // follow the formula transferred from official TensorFlow implementation\n    int new_repeats = int(ceil(multiplier * repeats));\n    return new_repeats;\n}\n\nIScaleLayer *addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights> &weightMap, ITensor &input, std::string lname, float eps)\n{\n    float *gamma = (float *)weightMap[lname + \".weight\"].values;\n    float *beta = (float *)weightMap[lname + \".bias\"].values;\n    float *mean = (float *)weightMap[lname + \".running_mean\"].values;\n    float *var = (float *)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n    float *scval = reinterpret_cast<float *>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++)\n    {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n\n    float *shval = reinterpret_cast<float *>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++)\n    {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float *>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++)\n    {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer *scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nIConvolutionLayer *addSamePaddingConv2d(INetworkDefinition *network, std::map<std::string, Weights> &weightMap, ITensor &input, int outch, int kernel_size, int stride, int dilation, int groups, DimsHW image_size, std::string lname, bool bias = true)\n{\n    int ih = image_size.h();\n    int iw = image_size.w();\n    int kh = kernel_size;\n    int kw = kernel_size;\n    int sh = stride;\n    int sw = stride;\n    int oh = ceil(float(ih) / float(sh));\n    int ow = ceil(float(iw) / float(sw));\n    int pad_h = std::max((oh - 1) * stride + (kh - 1) * dilation + 1 - ih, 0);\n    int pad_w = std::max((ow - 1) * stride + (kw - 1) * dilation + 1 - iw, 0);\n    int pad_left = 0;\n    int pad_right = 0;\n    int pad_top = 0;\n    int pad_bottom = 0;\n    if (pad_h > 0 || pad_w > 0)\n    {\n        pad_left = int(pad_w / 2);\n        pad_right = pad_w - int(pad_w / 2);\n        pad_top = int(pad_h / 2);\n        pad_bottom = pad_h - int(pad_h / 2);\n    }\n    Weights bias_wt{DataType::kFLOAT, nullptr, 0};\n    if (bias)\n    {\n        bias_wt = weightMap[lname + \".bias\"];\n    }\n    IConvolutionLayer *conv = network->addConvolutionNd(input, outch, DimsHW{kh, kw}, weightMap[lname + \".weight\"], bias_wt);\n    conv->setPrePadding(DimsHW{pad_top, pad_left});\n    conv->setPostPadding(DimsHW{pad_bottom, pad_right});\n    conv->setStrideNd(DimsHW{stride, stride});\n    conv->setDilationNd(DimsHW{dilation, dilation});\n    conv->setNbGroups(groups);\n    return conv;\n}\n\nILayer *addSwish(INetworkDefinition *network, ITensor &input)\n{\n    //swish\n    auto *sigmoid = network->addActivation(input, ActivationType::kSIGMOID);\n    auto *ew = network->addElementWise(input, *sigmoid->getOutput(0), ElementWiseOperation::kPROD);\n    return ew;\n}\n\nITensor *MBConvBlock(INetworkDefinition *network, std::map<std::string, Weights> &weightMap, ITensor &input, std::string lname, BlockArgs block_args, GlobalParams global_params, DimsHW image_size)\n{\n    bool has_se = block_args.se_ratio > 0 && block_args.se_ratio <= 1;\n    bool id_skip = block_args.id_skip;\n    float bn_eps = global_params.batch_norm_epsilon;\n    int input_filters = block_args.input_filters;\n    int output_filters = block_args.output_filters;\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    ITensor *x = &input;\n    int inp = block_args.input_filters;\n    int oup = int(block_args.input_filters * block_args.expand_ratio);\n    // expand_ratio != 1\n    if (fabs(block_args.expand_ratio - 1) > 1e-5)\n    {\n        auto expand_conv = addSamePaddingConv2d(network, weightMap, input, oup, 1, 1, 1, 1, image_size, lname + \"._expand_conv\");\n        auto bn0 = addBatchNorm2d(network, weightMap, *expand_conv->getOutput(0), lname + \"._bn0\", bn_eps);\n        auto swish0 = addSwish(network, *bn0->getOutput(0));\n        x = swish0->getOutput(0);\n    }\n    int k = block_args.kernel_size;\n    int s = block_args.stride;\n    auto depthwise_conv = addSamePaddingConv2d(network, weightMap, *x, oup, k, s, 1, oup, image_size, lname + \"._depthwise_conv\", false);\n    auto bn1 = addBatchNorm2d(network, weightMap, *depthwise_conv->getOutput(0), lname + \"._bn1\", bn_eps);\n    //swish\n    auto swish1 = addSwish(network, *bn1->getOutput(0));\n    x = swish1->getOutput(0);\n    image_size = calculateOutputImageSize(image_size, s);\n    if (has_se)\n    {\n        auto avg_pool = network->addPoolingNd(*x, PoolingType::kAVERAGE, image_size);\n        int num_squeezed_channels = std::max(1, int(input_filters * block_args.se_ratio));\n        auto se_reduce = addSamePaddingConv2d(network, weightMap, *avg_pool->getOutput(0), num_squeezed_channels, 1, 1, 1, 1, DimsHW{1, 1}, lname + \"._se_reduce\");\n\n        auto swish2 = addSwish(network, *se_reduce->getOutput(0));\n        auto se_expand = addSamePaddingConv2d(network, weightMap, *swish2->getOutput(0), oup, 1, 1, 1, 1, DimsHW{1, 1}, lname + \"._se_expand\");\n\n        auto *sigmoid = network->addActivation(*se_expand->getOutput(0), ActivationType::kSIGMOID);\n        auto *ew = network->addElementWise(*x, *sigmoid->getOutput(0), ElementWiseOperation::kPROD);\n        x = ew->getOutput(0);\n    }\n    int final_oup = block_args.output_filters;\n    auto project_conv = addSamePaddingConv2d(network, weightMap, *x, final_oup, 1, 1, 1, 1, image_size, lname + \"._project_conv\");\n\n    auto bn2 = addBatchNorm2d(network, weightMap, *project_conv->getOutput(0), lname + \"._bn2\", bn_eps);\n    x = bn2->getOutput(0);\n\n    if (id_skip && block_args.stride == 1 && input_filters == output_filters)\n    {\n        auto *ew = network->addElementWise(input, *x, ElementWiseOperation::kSUM);\n        x = ew->getOutput(0);\n    }\n    return x;\n}\n"
  },
  {
    "path": "ghostnet/README.md",
    "content": "# GhostNet\r\n\r\nGhostNetv1 architecture is from the paper \"GhostNet: More Features from Cheap Operations\" [(https://arxiv.org/abs/1911.11907)](https://arxiv.org/abs/1911.11907).\r\nGhostNetv2 architecture is from the paper \"GhostNetV2: Enhance Cheap Operation with Long-Range Attention\" [(https://arxiv.org/abs/2211.12905)](https://arxiv.org/abs/2211.12905).\r\n\r\nFor the PyTorch implementations, you can refer to [huawei-noah/ghostnet](https://github.com/huawei-noah/ghostnet).\r\n\r\nBoth versions use the following techniques in their TensorRT implementations:\r\n\r\n- **BatchNorm** layer is implemented by TensorRT's **Scale** layer.\r\n- **Ghost Modules** are used to generate more features from cheap operations, as described in the paper.\r\n- Replacing `IPoolingLayer` with `IReduceLayer` in TensorRT for Global Average Pooling. The `IReduceLayer` allows you to perform reduction operations (such as sum, average, max) over specified dimensions without being constrained by the kernel size limitations of pooling layers.\r\n\r\n## Project Structure\r\n\r\n```plaintext\r\nghostnet\r\n│\r\n├── ghostnetv1\r\n│   ├── CMakeLists.txt\r\n│   ├── gen_wts.py\r\n│   ├── ghostnetv1.cpp\r\n│   └── logging.h\r\n│\r\n├── ghostnetv2\r\n│   ├── CMakeLists.txt\r\n│   ├── gen_wts.py\r\n│   ├── ghostnetv2.cpp\r\n│   └── logging.h\r\n│\r\n└── README.md\r\n```\r\n\r\n## Steps to use GhostNet in TensorRT\r\n\r\n### 1. Generate `.wts` files for both GhostNetv1 and GhostNetv2\r\n\r\n```bash\r\n# For ghostnetv1\r\npython ghostnetv1/gen_wts.py\r\n\r\n# For ghostnetv2\r\npython ghostnetv2/gen_wts.py\r\n```\r\n\r\n### 2. Build the project\r\n\r\n```bash\r\ncd tensorrtx/ghostnet\r\nmkdir build\r\ncd build\r\ncmake ..\r\nmake\r\n```\r\n\r\n### 3. Serialize the models to engine files\r\n\r\nUse the following commands to serialize the PyTorch models into TensorRT engine files (`ghostnetv1.engine` and `ghostnetv2.engine`):\r\n\r\n```bash\r\n# For ghostnetv1\r\nsudo ./ghostnetv1 -s\r\n\r\n# For ghostnetv2\r\nsudo ./ghostnetv2 -s\r\n```\r\n\r\n### 4. Run inference using the engine files\r\n\r\nOnce the engine files are generated, you can run inference with the following commands:\r\n\r\n```bash\r\n# For ghostnetv1\r\nsudo ./ghostnetv1 -d\r\n\r\n# For ghostnetv2\r\nsudo ./ghostnetv2 -d\r\n```\r\n\r\n### 5. Verify output\r\n\r\nCompare the output with the PyTorch implementation from [huawei-noah/ghostnet](https://github.com/huawei-noah/ghostnet) to ensure that the TensorRT results are consistent with the PyTorch model.\r\n"
  },
  {
    "path": "ghostnet/ghostnetv1/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\r\n\r\nproject(ghostnetv1)\r\n\r\nadd_definitions(-std=c++11)\r\n\r\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\r\nset(CMAKE_CXX_STANDARD 11)\r\nset(CMAKE_BUILD_TYPE Debug)\r\n\r\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\r\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\r\n# cuda\r\ninclude_directories(/usr/local/cuda/include)\r\nlink_directories(/usr/local/cuda/lib64)\r\n# tensorrt\r\ninclude_directories(/usr/include/x86_64-linux-gnu/)\r\nlink_directories(/usr/lib/x86_64-linux-gnu/)\r\n\r\nadd_executable(ghostnetv1 ${PROJECT_SOURCE_DIR}/ghostnetv1.cpp)\r\ntarget_link_libraries(ghostnetv1 nvinfer)\r\ntarget_link_libraries(ghostnetv1 cudart)\r\n\r\nadd_definitions(-O2 -pthread)\r\n"
  },
  {
    "path": "ghostnet/ghostnetv1/gen_wts.py",
    "content": "\"\"\"\r\nCreates a GhostNet Model as defined in:\r\nGhostNet: More Features from Cheap Operations By Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, Chang Xu.\r\nhttps://arxiv.org/abs/1911.11907\r\nModified from https://github.com/d-li14/mobilenetv3.pytorch and https://github.com/rwightman/pytorch-image-models\r\n\"\"\"\r\nimport torch\r\nimport torch.nn as nn\r\nimport torch.onnx\r\nimport struct\r\nimport torch\r\nimport torch.nn.functional as F\r\nimport math\r\n\r\n\r\ndef _make_divisible(v, divisor, min_value=None):\r\n    \"\"\"\r\n    This function is taken from the original tf repo.\r\n    It ensures that all layers have a channel number that is divisible by 8\r\n    It can be seen here:\r\n    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py\r\n    \"\"\"\r\n    if min_value is None:\r\n        min_value = divisor\r\n    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)\r\n    # Make sure that round down does not go down by more than 10%.\r\n    if new_v < 0.9 * v:\r\n        new_v += divisor\r\n    return new_v\r\n\r\n\r\ndef hard_sigmoid(x, inplace: bool = False):\r\n    if inplace:\r\n        return x.add_(3.).clamp_(0., 6.).div_(6.)\r\n    else:\r\n        return F.relu6(x + 3.) / 6.\r\n\r\n\r\nclass SqueezeExcite(nn.Module):\r\n    def __init__(self, in_chs, se_ratio=0.25, reduced_base_chs=None,\r\n                 act_layer=nn.ReLU, gate_fn=hard_sigmoid, divisor=4, **_):\r\n        super(SqueezeExcite, self).__init__()\r\n        self.gate_fn = gate_fn\r\n        reduced_chs = _make_divisible((reduced_base_chs or in_chs) * se_ratio, divisor)\r\n        self.avg_pool = nn.AdaptiveAvgPool2d(1)\r\n        self.conv_reduce = nn.Conv2d(in_chs, reduced_chs, 1, bias=True)\r\n        self.act1 = act_layer(inplace=True)\r\n        self.conv_expand = nn.Conv2d(reduced_chs, in_chs, 1, bias=True)\r\n\r\n    def forward(self, x):\r\n        x_se = self.avg_pool(x)\r\n        x_se = self.conv_reduce(x_se)\r\n        x_se = self.act1(x_se)\r\n        x_se = self.conv_expand(x_se)\r\n        x = x * self.gate_fn(x_se)\r\n        return x\r\n\r\n\r\nclass ConvBnAct(nn.Module):\r\n    def __init__(self, in_chs, out_chs, kernel_size,\r\n                 stride=1, act_layer=nn.ReLU):\r\n        super(ConvBnAct, self).__init__()\r\n        self.conv = nn.Conv2d(in_chs, out_chs, kernel_size, stride, kernel_size//2, bias=False)\r\n        self.bn1 = nn.BatchNorm2d(out_chs)\r\n        self.act1 = act_layer(inplace=True)\r\n\r\n    def forward(self, x):\r\n        x = self.conv(x)\r\n        x = self.bn1(x)\r\n        x = self.act1(x)\r\n        return x\r\n\r\n\r\nclass GhostModule(nn.Module):\r\n    def __init__(self, inp, oup, kernel_size=1, ratio=2, dw_size=3, stride=1, relu=True):\r\n        super(GhostModule, self).__init__()\r\n        self.oup = oup\r\n        init_channels = math.ceil(oup / ratio)\r\n        new_channels = init_channels*(ratio-1)\r\n\r\n        self.primary_conv = nn.Sequential(\r\n            nn.Conv2d(inp, init_channels, kernel_size, stride, kernel_size//2, bias=False),\r\n            nn.BatchNorm2d(init_channels),\r\n            nn.ReLU(inplace=True) if relu else nn.Sequential(),\r\n        )\r\n\r\n        self.cheap_operation = nn.Sequential(\r\n            nn.Conv2d(init_channels, new_channels, dw_size, 1, dw_size//2, groups=init_channels, bias=False),\r\n            nn.BatchNorm2d(new_channels),\r\n            nn.ReLU(inplace=True) if relu else nn.Sequential(),\r\n        )\r\n\r\n    def forward(self, x):\r\n        x1 = self.primary_conv(x)\r\n        x2 = self.cheap_operation(x1)\r\n        out = torch.cat([x1, x2], dim=1)\r\n        return out[:, :self.oup, :, :]\r\n\r\n\r\nclass GhostBottleneck(nn.Module):\r\n    \"\"\" Ghost bottleneck w/ optional SE\"\"\"\r\n\r\n    def __init__(self, in_chs, mid_chs, out_chs, dw_kernel_size=3,\r\n                 stride=1, act_layer=nn.ReLU, se_ratio=0.):\r\n        super(GhostBottleneck, self).__init__()\r\n        has_se = se_ratio is not None and se_ratio > 0.\r\n        self.stride = stride\r\n\r\n        # Point-wise expansion\r\n        self.ghost1 = GhostModule(in_chs, mid_chs, relu=True)\r\n\r\n        # Depth-wise convolution\r\n        if self.stride > 1:\r\n            self.conv_dw = nn.Conv2d(mid_chs, mid_chs, dw_kernel_size, stride=stride,\r\n                                     padding=(dw_kernel_size-1)//2, groups=mid_chs, bias=False)\r\n            self.bn_dw = nn.BatchNorm2d(mid_chs)\r\n\r\n        # Squeeze-and-excitation\r\n        if has_se:\r\n            self.se = SqueezeExcite(mid_chs, se_ratio=se_ratio)\r\n        else:\r\n            self.se = None\r\n\r\n        # Point-wise linear projection\r\n        self.ghost2 = GhostModule(mid_chs, out_chs, relu=False)\r\n\r\n        # shortcut\r\n        if (in_chs == out_chs and self.stride == 1):\r\n            self.shortcut = nn.Sequential()\r\n        else:\r\n            self.shortcut = nn.Sequential(\r\n                nn.Conv2d(in_chs, in_chs, dw_kernel_size, stride=stride,\r\n                          padding=(dw_kernel_size-1)//2, groups=in_chs, bias=False),\r\n                nn.BatchNorm2d(in_chs),\r\n                nn.Conv2d(in_chs, out_chs, 1, stride=1, padding=0, bias=False),\r\n                nn.BatchNorm2d(out_chs),\r\n            )\r\n\r\n    def forward(self, x):\r\n        residual = x\r\n\r\n        # 1st ghost bottleneck\r\n        x = self.ghost1(x)\r\n\r\n        # Depth-wise convolution\r\n        if self.stride > 1:\r\n            x = self.conv_dw(x)\r\n            x = self.bn_dw(x)\r\n\r\n        # Squeeze-and-excitation\r\n        if self.se is not None:\r\n            x = self.se(x)\r\n\r\n        # 2nd ghost bottleneck\r\n        x = self.ghost2(x)\r\n\r\n        x += self.shortcut(residual)\r\n        return x\r\n\r\n\r\nclass GhostNet(nn.Module):\r\n    def __init__(self, cfgs, num_classes=1000, width=1.0, dropout=0.2):\r\n        super(GhostNet, self).__init__()\r\n        # setting of inverted residual blocks\r\n        self.cfgs = cfgs\r\n        self.dropout = dropout\r\n\r\n        # building first layer\r\n        output_channel = _make_divisible(16 * width, 4)\r\n        self.conv_stem = nn.Conv2d(3, output_channel, 3, 2, 1, bias=False)\r\n        self.bn1 = nn.BatchNorm2d(output_channel)\r\n        self.act1 = nn.ReLU(inplace=True)\r\n        input_channel = output_channel\r\n\r\n        # building inverted residual blocks\r\n        stages = []\r\n        block = GhostBottleneck\r\n        for cfg in self.cfgs:\r\n            layers = []\r\n            for k, exp_size, c, se_ratio, s in cfg:\r\n                output_channel = _make_divisible(c * width, 4)\r\n                hidden_channel = _make_divisible(exp_size * width, 4)\r\n                layers.append(block(input_channel, hidden_channel, output_channel, k, s,\r\n                              se_ratio=se_ratio))\r\n                input_channel = output_channel\r\n            stages.append(nn.Sequential(*layers))\r\n\r\n        output_channel = _make_divisible(exp_size * width, 4)\r\n        stages.append(nn.Sequential(ConvBnAct(input_channel, output_channel, 1)))\r\n        input_channel = output_channel\r\n\r\n        self.blocks = nn.Sequential(*stages)\r\n\r\n        # building last several layers\r\n        output_channel = 1280\r\n        self.global_pool = nn.AdaptiveAvgPool2d((1, 1))\r\n        self.conv_head = nn.Conv2d(input_channel, output_channel, 1, 1, 0, bias=True)\r\n        self.act2 = nn.ReLU(inplace=True)\r\n        self.classifier = nn.Linear(output_channel, num_classes)\r\n\r\n    def forward(self, x):\r\n        x = self.conv_stem(x)\r\n        x = self.bn1(x)\r\n        x = self.act1(x)\r\n        x = self.blocks(x)\r\n        x = self.global_pool(x)\r\n        x = self.conv_head(x)\r\n        x = self.act2(x)\r\n        x = x.view(x.size(0), -1)\r\n        if self.dropout > 0.:\r\n            x = F.dropout(x, p=self.dropout, training=self.training)\r\n        x = self.classifier(x)\r\n        return x\r\n\r\n\r\ndef ghostnet(**kwargs):\r\n    \"\"\"\r\n    Constructs a GhostNet model\r\n    \"\"\"\r\n    cfgs = [\r\n        # k, t, c, SE, s\r\n        # stage1\r\n        [[3,  16,  16, 0, 1]],\r\n        # stage2\r\n        [[3,  48,  24, 0, 2]],\r\n        [[3,  72,  24, 0, 1]],\r\n        # stage3\r\n        [[5,  72,  40, 0.25, 2]],\r\n        [[5, 120,  40, 0.25, 1]],\r\n        # stage4\r\n        [[3, 240,  80, 0, 2]],\r\n        [[3, 200,  80, 0, 1],\r\n         [3, 184,  80, 0, 1],\r\n         [3, 184,  80, 0, 1],\r\n         [3, 480, 112, 0.25, 1],\r\n         [3, 672, 112, 0.25, 1]],\r\n        # stage5\r\n        [[5, 672, 160, 0.25, 2]],\r\n        [[5, 960, 160, 0, 1],\r\n         [5, 960, 160, 0.25, 1],\r\n         [5, 960, 160, 0, 1],\r\n         [5, 960, 160, 0.25, 1]]\r\n    ]\r\n    return GhostNet(cfgs, **kwargs)\r\n\r\n\r\ndef setup_seed(seed):\r\n    torch.manual_seed(seed)\r\n    torch.cuda.manual_seed_all(seed)\r\n    torch.backends.cudnn.deterministic = True\r\n\r\n\r\n# Function to export weights in the specified format\r\ndef export_weight(model):\r\n    f = open(\"ghostnetv1.weights\", 'w')\r\n    f.write(\"{}\\n\".format(len(model.state_dict().keys())))\r\n\r\n    # Convert weights to hexadecimal format\r\n    for k, v in model.state_dict().items():\r\n        print('exporting ... {}: {}'.format(k, v.shape))\r\n\r\n        # Reshape the weights to 1D\r\n        vr = v.reshape(-1).cpu().numpy()\r\n        f.write(\"{} {}\".format(k, len(vr)))\r\n        for vv in vr:\r\n            f.write(\" \")\r\n            f.write(struct.pack(\">f\", float(vv)).hex())\r\n        f.write(\"\\n\")\r\n\r\n    f.close()\r\n\r\n\r\n# Function to evaluate the model (optional)\r\ndef eval_model(input, model):\r\n    output = model(input)\r\n    print(\"------from inference------\")\r\n    print(input)\r\n    print(output)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    setup_seed(1)\r\n\r\n    model = ghostnet(num_classes=1000, width=1.0, dropout=0.2)\r\n\r\n    model.eval()\r\n\r\n    input = torch.full((32, 3, 320, 256), 10.0)\r\n\r\n    export_weight(model)\r\n\r\n    eval_model(input, model)\r\n"
  },
  {
    "path": "ghostnet/ghostnetv1/ghostnetv1.cpp",
    "content": "#include <chrono>\r\n#include <cmath>\r\n#include <fstream>\r\n#include <iostream>\r\n#include <map>\r\n#include <sstream>\r\n#include <vector>\r\n#include \"NvInfer.h\"\r\n#include \"cuda_runtime_api.h\"\r\n#include \"logging.h\"\r\n\r\nusing namespace std;\r\n\r\n#define CHECK(status)                                          \\\r\n    do {                                                       \\\r\n        auto ret = (status);                                   \\\r\n        if (ret != 0) {                                        \\\r\n            std::cerr << \"Cuda failure: \" << ret << std::endl; \\\r\n            abort();                                           \\\r\n        }                                                      \\\r\n    } while (0)\r\n\r\n// stuff we know about the network and the input/output blobs\r\nstatic const int INPUT_H = 256;\r\nstatic const int INPUT_W = 320;\r\nstatic const int OUTPUT_SIZE = 1000;\r\nstatic const int batchSize = 32;\r\n\r\nconst char* INPUT_BLOB_NAME = \"data\";\r\nconst char* OUTPUT_BLOB_NAME = \"prob\";\r\nusing namespace nvinfer1;\r\n\r\nstatic Logger gLogger;\r\n\r\n// Load weights from files shared with TensorRT samples.\r\n// TensorRT weight files have a simple space delimited format:\r\n// [type] [size] <data x size in hex>\r\nstd::map<std::string, Weights> loadWeights(const std::string file) {\r\n    std::cout << \"Loading weights: \" << file << std::endl;\r\n    std::map<std::string, Weights> weightMap;\r\n\r\n    // Open weights file\r\n    std::ifstream input(file);\r\n    if (!input.is_open()) {\r\n        std::cerr << \"Unable to load weight file.\" << std::endl;\r\n        exit(EXIT_FAILURE);\r\n    }\r\n\r\n    // Read number of weight blobs\r\n    int32_t count;\r\n    input >> count;\r\n    if (count <= 0) {\r\n        std::cerr << \"Invalid weight map file.\" << std::endl;\r\n        exit(EXIT_FAILURE);\r\n    }\r\n\r\n    while (count--) {\r\n        Weights wt{DataType::kFLOAT, nullptr, 0};\r\n        uint32_t size;\r\n\r\n        // Read name and type of blob\r\n        std::string name;\r\n        input >> name >> std::dec >> size;\r\n        wt.type = DataType::kFLOAT;\r\n\r\n        // Load blob\r\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(uint32_t) * size));\r\n        for (uint32_t x = 0, y = size; x < y; ++x) {\r\n            input >> std::hex >> val[x];\r\n        }\r\n        wt.values = val;\r\n\r\n        wt.count = size;\r\n        weightMap[name] = wt;\r\n    }\r\n\r\n    return weightMap;\r\n}\r\n\r\nint _make_divisible(int v, int divisor, int min_value = -1) {\r\n    if (min_value == -1) {\r\n        min_value = divisor;\r\n    }\r\n\r\n    int new_v = std::max(min_value, (v + divisor / 2) / divisor * divisor);\r\n\r\n    if (new_v < static_cast<int>(0.9 * v)) {\r\n        new_v += divisor;\r\n    }\r\n\r\n    return new_v;\r\n}\r\n\r\nILayer* hardSigmoid(INetworkDefinition* network, ITensor& input) {\r\n\r\n    IActivationLayer* scale_layer = network->addActivation(input, ActivationType::kHARD_SIGMOID);\r\n\r\n    return scale_layer;\r\n}\r\n\r\nIScaleLayer* addBatchNorm2d(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\r\n                            std::string lname, float eps) {\r\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\r\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\r\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\r\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\r\n    int len = weightMap[lname + \".running_var\"].count;\r\n\r\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights scale{DataType::kFLOAT, scval, len};\r\n\r\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights shift{DataType::kFLOAT, shval, len};\r\n\r\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        pval[i] = 1.0;\r\n    }\r\n    Weights power{DataType::kFLOAT, pval, len};\r\n\r\n    weightMap[lname + \".scale\"] = scale;\r\n    weightMap[lname + \".shift\"] = shift;\r\n    weightMap[lname + \".power\"] = power;\r\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\r\n    assert(scale_1);\r\n    return scale_1;\r\n}\r\n\r\nIActivationLayer* convBnReluStem(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\r\n                                 int outch, std::string lname) {\r\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\r\n\r\n    IConvolutionLayer* conv1 =\r\n            network->addConvolutionNd(input, outch, DimsHW{3, 3}, weightMap[lname + \".weight\"], emptywts);\r\n    assert(conv1);\r\n    conv1->setStrideNd(DimsHW{2, 2});   // Stride = 2\r\n    conv1->setPaddingNd(DimsHW{1, 1});  // Padding = 1\r\n\r\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"bn1\", 1e-5);\r\n\r\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\r\n    assert(relu1);\r\n\r\n    return relu1;\r\n}\r\n\r\nILayer* convBnAct(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\r\n                  int out_channels, std::string lname, ActivationType actType = ActivationType::kRELU) {\r\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\r\n\r\n    IConvolutionLayer* conv =\r\n            network->addConvolutionNd(input, out_channels, DimsHW{1, 1}, weightMap[lname + \".conv.weight\"], emptywts);\r\n    assert(conv);\r\n    conv->setStrideNd(DimsHW{1, 1});\r\n\r\n    IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn1\", 1e-5);\r\n\r\n    IActivationLayer* act = network->addActivation(*bn->getOutput(0), actType);\r\n    assert(act);\r\n\r\n    return act;\r\n}\r\n\r\nILayer* squeezeExcite(INetworkDefinition* network, ITensor& input, std::map<std::string, Weights>& weightMap,\r\n                      int in_chs, float se_ratio = 0.25, std::string lname = \"\", float eps = 1e-5) {\r\n\r\n    IReduceLayer* avg_pool = network->addReduce(input, ReduceOperation::kAVG, 1 << 2 | 1 << 3, true);\r\n    assert(avg_pool);\r\n\r\n    // Reduce channels with 1x1 convolution\r\n    int reduced_chs = _make_divisible(static_cast<int>(in_chs * se_ratio), 4);\r\n    IConvolutionLayer* conv_reduce =\r\n            network->addConvolutionNd(*avg_pool->getOutput(0), reduced_chs, DimsHW{1, 1},\r\n                                      weightMap[lname + \".conv_reduce.weight\"], weightMap[lname + \".conv_reduce.bias\"]);\r\n    assert(conv_reduce);\r\n\r\n    IActivationLayer* relu1 = network->addActivation(*conv_reduce->getOutput(0), ActivationType::kRELU);\r\n    assert(relu1);\r\n\r\n    // Expand channels back with another 1x1 convolution\r\n    IConvolutionLayer* conv_expand =\r\n            network->addConvolutionNd(*relu1->getOutput(0), in_chs, DimsHW{1, 1},\r\n                                      weightMap[lname + \".conv_expand.weight\"], weightMap[lname + \".conv_expand.bias\"]);\r\n    assert(conv_expand);\r\n    cout << \"SE conv_expand -> \" << printTensorShape(conv_expand->getOutput(0)) << endl;\r\n\r\n    // Apply hardSigmoid function\r\n    ILayer* hard_sigmoid = hardSigmoid(network, *conv_expand->getOutput(0));\r\n    cout << \"hard_sigmoid conv_expand -> \" << printTensorShape(hard_sigmoid->getOutput(0)) << endl;\r\n\r\n    // Elementwise multiplication of input and gated SE output\r\n    IElementWiseLayer* scale = network->addElementWise(input, *hard_sigmoid->getOutput(0), ElementWiseOperation::kPROD);\r\n    assert(scale);\r\n\r\n    return scale;\r\n}\r\n\r\nILayer* ghostModule(INetworkDefinition* network, ITensor& input, std::map<std::string, Weights>& weightMap, int inp,\r\n                    int oup, int kernel_size = 1, int ratio = 2, int dw_size = 3, int stride = 1, bool relu = true,\r\n                    std::string lname = \"\") {\r\n    int init_channels = std::ceil(oup / ratio);\r\n    int new_channels = init_channels * (ratio - 1);\r\n\r\n    // Primary convolution\r\n    IConvolutionLayer* primary_conv = network->addConvolutionNd(input, init_channels, DimsHW{kernel_size, kernel_size},\r\n                                                                weightMap[lname + \".primary_conv.0.weight\"], Weights{});\r\n    primary_conv->setStrideNd(DimsHW{stride, stride});\r\n    primary_conv->setPaddingNd(DimsHW{kernel_size / 2, kernel_size / 2});\r\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *primary_conv->getOutput(0), lname + \".primary_conv.1\", 1e-5);\r\n\r\n    // Cheap operation (Depthwise Convolution)\r\n    IConvolutionLayer* cheap_conv =\r\n            network->addConvolutionNd(*bn1->getOutput(0), new_channels, DimsHW{dw_size, dw_size},\r\n                                      weightMap[lname + \".cheap_operation.0.weight\"], Weights{});\r\n    cheap_conv->setStrideNd(DimsHW{1, 1});\r\n    cheap_conv->setPaddingNd(DimsHW{dw_size / 2, dw_size / 2});\r\n    cheap_conv->setNbGroups(init_channels);\r\n    IScaleLayer* bn2 =\r\n            addBatchNorm2d(network, weightMap, *cheap_conv->getOutput(0), lname + \".cheap_operation.1\", 1e-5);\r\n\r\n    // Define relu1 and relu2\r\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\r\n    IActivationLayer* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\r\n\r\n    // Initialize inputs array based on the `relu` flag\r\n    std::vector<ITensor*> inputs_vec;\r\n    if (relu) {\r\n        inputs_vec = {relu1->getOutput(0), relu2->getOutput(0)};\r\n    } else {\r\n        inputs_vec = {bn1->getOutput(0), bn2->getOutput(0)};\r\n    }\r\n\r\n    ITensor* inputs[] = {inputs_vec[0], inputs_vec[1]};\r\n    IConcatenationLayer* concat = network->addConcatenation(inputs, 2);\r\n    std::cout << printTensorShape(concat->getOutput(0)) << std::endl;\r\n\r\n    // Slice the output to keep only the first `oup` channels\r\n    Dims start{4, {0, 0, 0, 0}};  // Starting from batch=0, channel=0, height=0, width=0\r\n    Dims size{4,\r\n              {concat->getOutput(0)->getDimensions().d[0], oup, concat->getOutput(0)->getDimensions().d[2],\r\n               concat->getOutput(0)\r\n                       ->getDimensions()\r\n                       .d[3]}};     // Keep all batches, first `oup` channels, all heights and widths\r\n    Dims stride_{4, {1, 1, 1, 1}};  // Stride is 1 for all dimensions\r\n\r\n    ISliceLayer* slice = network->addSlice(*concat->getOutput(0), start, size, stride_);\r\n    cout << \"slice\" << printTensorShape(slice->getOutput(0)) << endl;\r\n\r\n    return slice;\r\n}\r\n\r\nILayer* ghostBottleneck(INetworkDefinition* network, ITensor& input, std::map<std::string, Weights>& weightMap,\r\n                        int in_chs, int mid_chs, int out_chs, int dw_kernel_size = 3, int stride = 1,\r\n                        float se_ratio = 0.0f, std::string lname = \"\") {\r\n    ILayer* ghost1 = ghostModule(network, input, weightMap, in_chs, mid_chs, 1, 2, 3, 1, true, lname + \".ghost1\");\r\n\r\n    ILayer* depthwise_conv = ghost1;\r\n    if (stride > 1) {\r\n        IConvolutionLayer* conv_dw =\r\n                network->addConvolutionNd(*ghost1->getOutput(0), mid_chs, DimsHW{dw_kernel_size, dw_kernel_size},\r\n                                          weightMap[lname + \".conv_dw.weight\"], Weights{});\r\n        conv_dw->setStrideNd(DimsHW{stride, stride});\r\n        conv_dw->setPaddingNd(DimsHW{(dw_kernel_size - 1) / 2, (dw_kernel_size - 1) / 2});\r\n        conv_dw->setNbGroups(mid_chs);  // Depth-wise convolution\r\n        IScaleLayer* bn_dw = addBatchNorm2d(network, weightMap, *conv_dw->getOutput(0), lname + \".bn_dw\", 1e-5);\r\n        depthwise_conv = bn_dw;\r\n    }\r\n\r\n    ILayer* se_layer = depthwise_conv;\r\n    if (se_ratio > 0.0f) {\r\n        se_layer = squeezeExcite(network, *depthwise_conv->getOutput(0), weightMap, mid_chs, se_ratio, lname + \".se\");\r\n    }\r\n\r\n    ILayer* ghost2 = ghostModule(network, *se_layer->getOutput(0), weightMap, mid_chs, out_chs, 1, 2, 3, 1, false,\r\n                                 lname + \".ghost2\");\r\n\r\n    ILayer* shortcut_layer = nullptr;\r\n    if (in_chs == out_chs && stride == 1) {\r\n        shortcut_layer = network->addIdentity(input);\r\n    } else {\r\n        IConvolutionLayer* conv_shortcut_dw =\r\n                network->addConvolutionNd(input, in_chs, DimsHW{dw_kernel_size, dw_kernel_size},\r\n                                          weightMap[lname + \".shortcut.0.weight\"], Weights{});\r\n\r\n        conv_shortcut_dw->setStrideNd(DimsHW{stride, stride});\r\n        conv_shortcut_dw->setPaddingNd(DimsHW{(dw_kernel_size - 1) / 2, (dw_kernel_size - 1) / 2});\r\n        conv_shortcut_dw->setNbGroups(in_chs);  // Depth-wise convolution\r\n        IScaleLayer* bn_shortcut_dw =\r\n                addBatchNorm2d(network, weightMap, *conv_shortcut_dw->getOutput(0), lname + \".shortcut.1\", 1e-5);\r\n\r\n        IConvolutionLayer* conv_shortcut_pw =\r\n                network->addConvolutionNd(*bn_shortcut_dw->getOutput(0), out_chs, DimsHW{1, 1},\r\n                                          weightMap[lname + \".shortcut.2.weight\"], Weights{});\r\n        IScaleLayer* bn_shortcut_pw =\r\n                addBatchNorm2d(network, weightMap, *conv_shortcut_pw->getOutput(0), lname + \".shortcut.3\", 1e-5);\r\n        shortcut_layer = bn_shortcut_pw;\r\n    }\r\n\r\n    IElementWiseLayer* ew_sum =\r\n            network->addElementWise(*ghost2->getOutput(0), *shortcut_layer->getOutput(0), ElementWiseOperation::kSUM);\r\n\r\n    return ew_sum;\r\n}\r\n\r\nICudaEngine* createEngine(IBuilder* builder, IBuilderConfig* config, DataType dt) {\r\n\r\n    INetworkDefinition* network =\r\n            builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\r\n\r\n    // Create input tensor of shape {batchSize, 3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\r\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims4{batchSize, 3, INPUT_H, INPUT_W});\r\n    assert(data);\r\n\r\n    std::map<std::string, Weights> weightMap = loadWeights(\"../ghostnetv1.weights\");\r\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\r\n\r\n    // Conv Stem\r\n    IActivationLayer* conv_stem = convBnReluStem(network, weightMap, *data, 16, \"conv_stem\");\r\n\r\n    ILayer* current_layer = conv_stem;\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 16, 16, 16, 3, 1, 0, \"blocks.0.0\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 16, 48, 24, 3, 2, 0, \"blocks.1.0\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 24, 72, 24, 3, 1, 0, \"blocks.2.0\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 24, 72, 40, 5, 2, 0.25, \"blocks.3.0\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 40, 120, 40, 5, 1, 0.25, \"blocks.4.0\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 40, 240, 80, 3, 2, 0, \"blocks.5.0\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 80, 200, 80, 3, 1, 0, \"blocks.6.0\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 80, 184, 80, 3, 1, 0, \"blocks.6.1\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 80, 184, 80, 3, 1, 0, \"blocks.6.2\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 80, 480, 112, 3, 1, 0.25, \"blocks.6.3\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 112, 672, 112, 3, 1, 0.25, \"blocks.6.4\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 112, 672, 160, 5, 2, 0.25, \"blocks.7.0\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 160, 960, 160, 5, 1, 0, \"blocks.8.0\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 160, 960, 160, 5, 1, 0.25, \"blocks.8.1\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 160, 960, 160, 5, 1, 0, \"blocks.8.2\");\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 160, 960, 160, 5, 1, 0.25, \"blocks.8.3\");\r\n\r\n    // Apply ConvBnAct\r\n    current_layer = convBnAct(network, weightMap, *current_layer->getOutput(0), 960, \"blocks.9.0\");\r\n    // Global Average Pooling\r\n    IReduceLayer* global_pool =\r\n            network->addReduce(*current_layer->getOutput(0), ReduceOperation::kAVG, 1 << 2 | 1 << 3, true);\r\n    assert(global_pool);\r\n\r\n    // Conv Head\r\n    IConvolutionLayer* conv_head = network->addConvolutionNd(\r\n            *global_pool->getOutput(0), 1280, DimsHW{1, 1}, weightMap[\"conv_head.weight\"], weightMap[\"conv_head.bias\"]);\r\n    IActivationLayer* act2 = network->addActivation(*conv_head->getOutput(0), ActivationType::kRELU);\r\n\r\n    // Fully Connected Layer (Classifier)\r\n    IFullyConnectedLayer* classifier = network->addFullyConnected(\r\n            *act2->getOutput(0), 1000, weightMap[\"classifier.weight\"], weightMap[\"classifier.bias\"]);\r\n    classifier->getOutput(0)->setName(OUTPUT_BLOB_NAME);\r\n    network->markOutput(*classifier->getOutput(0));\r\n\r\n    // Build engine\r\n    config->setMaxWorkspaceSize(1 << 24);\r\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\r\n\r\n    // Don't need the network any more\r\n    network->destroy();\r\n\r\n    // Release host memory\r\n    for (auto& mem : weightMap) {\r\n        free((void*)(mem.second.values));\r\n    }\r\n\r\n    return engine;\r\n}\r\n\r\nvoid APIToModel(IHostMemory** modelStream) {\r\n    // Create builder\r\n    IBuilder* builder = createInferBuilder(gLogger);\r\n    IBuilderConfig* config = builder->createBuilderConfig();\r\n\r\n    // Create model to populate the network, then set the outputs and create an engine\r\n    ICudaEngine* engine = createEngine(builder, config, DataType::kFLOAT);\r\n    assert(engine != nullptr);\r\n\r\n    // Serialize the engine\r\n    (*modelStream) = engine->serialize();\r\n\r\n    // Close everything down\r\n    engine->destroy();\r\n    config->destroy();\r\n    builder->destroy();\r\n}\r\n\r\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\r\n    const ICudaEngine& engine = context.getEngine();\r\n\r\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\r\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\r\n\r\n    // Pointers to input and output device buffers to pass to engine.\r\n    void* buffers[2];\r\n\r\n    // Create GPU buffers on device\r\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\r\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\r\n\r\n    // Create stream\r\n    cudaStream_t stream;\r\n    CHECK(cudaStreamCreate(&stream));\r\n\r\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\r\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float),\r\n                          cudaMemcpyHostToDevice, stream));\r\n    context.enqueueV2(buffers, stream, nullptr);\r\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost,\r\n                          stream));\r\n    cudaStreamSynchronize(stream);\r\n\r\n    // Release stream and buffers\r\n    cudaStreamDestroy(stream);\r\n    CHECK(cudaFree(buffers[inputIndex]));\r\n    CHECK(cudaFree(buffers[outputIndex]));\r\n}\r\n\r\nint main(int argc, char** argv) {\r\n    if (argc != 2) {\r\n        std::cerr << \"arguments not right!\" << std::endl;\r\n        std::cerr << \"./ghostnetv1 -s   // serialize model to plan file\" << std::endl;\r\n        std::cerr << \"./ghostnetv1 -d   // deserialize plan file and run inference\" << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    // create a model using the API directly and serialize it to a stream\r\n    char* trtModelStream{nullptr};\r\n    size_t size{0};\r\n\r\n    if (std::string(argv[1]) == \"-s\") {\r\n        IHostMemory* modelStream{nullptr};\r\n        APIToModel(&modelStream);\r\n        assert(modelStream != nullptr);\r\n\r\n        std::ofstream p(\"ghostnetv1.engine\", std::ios::binary);\r\n        if (!p) {\r\n            std::cerr << \"could not open plan output file\" << std::endl;\r\n            return -1;\r\n        }\r\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\r\n        modelStream->destroy();\r\n        return 0;\r\n    } else if (std::string(argv[1]) == \"-d\") {\r\n        std::ifstream file(\"ghostnetv1.engine\", std::ios::binary);\r\n        if (file.good()) {\r\n            file.seekg(0, file.end);\r\n            size = file.tellg();\r\n            file.seekg(0, file.beg);\r\n            trtModelStream = new char[size];\r\n            assert(trtModelStream);\r\n            file.read(trtModelStream, size);\r\n            file.close();\r\n        }\r\n    } else {\r\n        return -1;\r\n    }\r\n\r\n    float* data = new float[batchSize * 3 * INPUT_H * INPUT_W];\r\n    for (int i = 0; i < batchSize * 3 * INPUT_H * INPUT_W; i++)\r\n        data[i] = 10.0;\r\n\r\n    float* prob = new float[batchSize * OUTPUT_SIZE];\r\n\r\n    IRuntime* runtime = createInferRuntime(gLogger);\r\n    assert(runtime != nullptr);\r\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\r\n    assert(engine != nullptr);\r\n    IExecutionContext* context = engine->createExecutionContext();\r\n    assert(context != nullptr);\r\n    delete[] trtModelStream;\r\n\r\n    doInference(*context, data, prob, batchSize);\r\n\r\n    std::cout << \"\\nOutput:\\n\\n\";\r\n    for (int i = 0; i < batchSize; i++) {\r\n        std::cout << \"Batch \" << i << \":\\n\";\r\n        for (unsigned int j = 0; j < OUTPUT_SIZE; j++) {\r\n            std::cout << prob[i * OUTPUT_SIZE + j] << \", \";\r\n            if (j % 10 == 0)\r\n                std::cout << j / 10 << std::endl;\r\n        }\r\n        std::cout << \"\\n\";\r\n    }\r\n\r\n    context->destroy();\r\n    engine->destroy();\r\n    runtime->destroy();\r\n    delete[] data;\r\n    delete[] prob;\r\n\r\n    return 0;\r\n}\r\n"
  },
  {
    "path": "ghostnet/ghostnetv1/logging.h",
    "content": "/*\r\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\r\n *\r\n * Licensed under the Apache License, Version 2.0 (the \"License\");\r\n * you may not use this file except in compliance with the License.\r\n * You may obtain a copy of the License at\r\n *\r\n *     http://www.apache.org/licenses/LICENSE-2.0\r\n *\r\n * Unless required by applicable law or agreed to in writing, software\r\n * distributed under the License is distributed on an \"AS IS\" BASIS,\r\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\r\n * See the License for the specific language governing permissions and\r\n * limitations under the License.\r\n */\r\n\r\n#ifndef TENSORRT_LOGGING_H\r\n#define TENSORRT_LOGGING_H\r\n\r\n#include <cassert>\r\n#include <ctime>\r\n#include <iomanip>\r\n#include <iostream>\r\n#include <ostream>\r\n#include <sstream>\r\n#include <string>\r\n#include \"NvInferRuntimeCommon.h\"\r\n\r\nusing Severity = nvinfer1::ILogger::Severity;\r\n\r\nclass LogStreamConsumerBuffer : public std::stringbuf {\r\n   public:\r\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\r\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\r\n\r\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\r\n\r\n    ~LogStreamConsumerBuffer() {\r\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\r\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\r\n        // if the pointer to the beginning is not equal to the pointer to the current position,\r\n        // call putOutput() to log the output to the stream\r\n        if (pbase() != pptr()) {\r\n            putOutput();\r\n        }\r\n    }\r\n\r\n    // synchronizes the stream buffer and returns 0 on success\r\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\r\n    // resetting the buffer and flushing the stream\r\n    virtual int sync() {\r\n        putOutput();\r\n        return 0;\r\n    }\r\n\r\n    void putOutput() {\r\n        if (mShouldLog) {\r\n            // prepend timestamp\r\n            std::time_t timestamp = std::time(nullptr);\r\n            tm* tm_local = std::localtime(&timestamp);\r\n            std::cout << \"[\";\r\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\r\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\r\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\r\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\r\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\r\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\r\n            // std::stringbuf::str() gets the string contents of the buffer\r\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\r\n            mOutput << mPrefix << str();\r\n            // set the buffer to empty\r\n            str(\"\");\r\n            // flush the stream\r\n            mOutput.flush();\r\n        }\r\n    }\r\n\r\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\r\n\r\n   private:\r\n    std::ostream& mOutput;\r\n    std::string mPrefix;\r\n    bool mShouldLog;\r\n};\r\n\r\n//!\r\n//! \\class LogStreamConsumerBase\r\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\r\n//!\r\nclass LogStreamConsumerBase {\r\n   public:\r\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\r\n        : mBuffer(stream, prefix, shouldLog) {}\r\n\r\n   protected:\r\n    LogStreamConsumerBuffer mBuffer;\r\n};\r\n\r\n//!\r\n//! \\class LogStreamConsumer\r\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\r\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\r\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\r\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\r\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\r\n//!  Please do not change the order of the parent classes.\r\n//!\r\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\r\n   public:\r\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\r\n    //!  Reportable severity determines if the messages are severe enough to be logged.\r\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\r\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\r\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\r\n          ,\r\n          mShouldLog(severity <= reportableSeverity),\r\n          mSeverity(severity) {}\r\n\r\n    LogStreamConsumer(LogStreamConsumer&& other)\r\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\r\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\r\n          ,\r\n          mShouldLog(other.mShouldLog),\r\n          mSeverity(other.mSeverity) {}\r\n\r\n    void setReportableSeverity(Severity reportableSeverity) {\r\n        mShouldLog = mSeverity <= reportableSeverity;\r\n        mBuffer.setShouldLog(mShouldLog);\r\n    }\r\n\r\n   private:\r\n    static std::ostream& severityOstream(Severity severity) {\r\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\r\n    }\r\n\r\n    static std::string severityPrefix(Severity severity) {\r\n        switch (severity) {\r\n            case Severity::kINTERNAL_ERROR:\r\n                return \"[F] \";\r\n            case Severity::kERROR:\r\n                return \"[E] \";\r\n            case Severity::kWARNING:\r\n                return \"[W] \";\r\n            case Severity::kINFO:\r\n                return \"[I] \";\r\n            case Severity::kVERBOSE:\r\n                return \"[V] \";\r\n            default:\r\n                assert(0);\r\n                return \"\";\r\n        }\r\n    }\r\n\r\n    bool mShouldLog;\r\n    Severity mSeverity;\r\n};\r\n\r\n//! \\class Logger\r\n//!\r\n//! \\brief Class which manages logging of TensorRT tools and samples\r\n//!\r\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\r\n//! and supports logging two types of messages:\r\n//!\r\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\r\n//! - Test pass/fail messages\r\n//!\r\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\r\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\r\n//!\r\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\r\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\r\n//!\r\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\r\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\r\n//! library and messages coming from the sample.\r\n//!\r\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\r\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\r\n//! object.\r\n\r\nclass Logger : public nvinfer1::ILogger {\r\n   public:\r\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\r\n\r\n    //!\r\n    //! \\enum TestResult\r\n    //! \\brief Represents the state of a given test\r\n    //!\r\n    enum class TestResult {\r\n        kRUNNING,  //!< The test is running\r\n        kPASSED,   //!< The test passed\r\n        kFAILED,   //!< The test failed\r\n        kWAIVED    //!< The test was waived\r\n    };\r\n\r\n    //!\r\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\r\n    //! \\return The nvinfer1::ILogger associated with this Logger\r\n    //!\r\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\r\n    //! we can eliminate the inheritance of Logger from ILogger\r\n    //!\r\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\r\n\r\n    //!\r\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\r\n    //!\r\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\r\n    //! inheritance from nvinfer1::ILogger\r\n    //!\r\n    void log(Severity severity, const char* msg) noexcept override {\r\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\r\n    }\r\n\r\n    //!\r\n    //! \\brief Method for controlling the verbosity of logging output\r\n    //!\r\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\r\n    //!\r\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\r\n\r\n    //!\r\n    //! \\brief Opaque handle that holds logging information for a particular test\r\n    //!\r\n    //! This object is an opaque handle to information used by the Logger to print test results.\r\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\r\n    //! with Logger::reportTest{Start,End}().\r\n    //!\r\n    class TestAtom {\r\n       public:\r\n        TestAtom(TestAtom&&) = default;\r\n\r\n       private:\r\n        friend class Logger;\r\n\r\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\r\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\r\n\r\n        bool mStarted;\r\n        std::string mName;\r\n        std::string mCmdline;\r\n    };\r\n\r\n    //!\r\n    //! \\brief Define a test for logging\r\n    //!\r\n    //! \\param[in] name The name of the test.  This should be a string starting with\r\n    //!                  \"TensorRT\" and containing dot-separated strings containing\r\n    //!                  the characters [A-Za-z0-9_].\r\n    //!                  For example, \"TensorRT.sample_googlenet\"\r\n    //! \\param[in] cmdline The command line used to reproduce the test\r\n    //\r\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\r\n    //!\r\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\r\n        return TestAtom(false, name, cmdline);\r\n    }\r\n\r\n    //!\r\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\r\n    //!        as input\r\n    //!\r\n    //! \\param[in] name The name of the test\r\n    //! \\param[in] argc The number of command-line arguments\r\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\r\n    //!\r\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\r\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\r\n        auto cmdline = genCmdlineString(argc, argv);\r\n        return defineTest(name, cmdline);\r\n    }\r\n\r\n    //!\r\n    //! \\brief Report that a test has started.\r\n    //!\r\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\r\n    //!\r\n    //! \\param[in] testAtom The handle to the test that has started\r\n    //!\r\n    static void reportTestStart(TestAtom& testAtom) {\r\n        reportTestResult(testAtom, TestResult::kRUNNING);\r\n        assert(!testAtom.mStarted);\r\n        testAtom.mStarted = true;\r\n    }\r\n\r\n    //!\r\n    //! \\brief Report that a test has ended.\r\n    //!\r\n    //! \\pre reportTestStart() has been called for the given testAtom\r\n    //!\r\n    //! \\param[in] testAtom The handle to the test that has ended\r\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\r\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\r\n    //!\r\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\r\n        assert(result != TestResult::kRUNNING);\r\n        assert(testAtom.mStarted);\r\n        reportTestResult(testAtom, result);\r\n    }\r\n\r\n    static int reportPass(const TestAtom& testAtom) {\r\n        reportTestEnd(testAtom, TestResult::kPASSED);\r\n        return EXIT_SUCCESS;\r\n    }\r\n\r\n    static int reportFail(const TestAtom& testAtom) {\r\n        reportTestEnd(testAtom, TestResult::kFAILED);\r\n        return EXIT_FAILURE;\r\n    }\r\n\r\n    static int reportWaive(const TestAtom& testAtom) {\r\n        reportTestEnd(testAtom, TestResult::kWAIVED);\r\n        return EXIT_SUCCESS;\r\n    }\r\n\r\n    static int reportTest(const TestAtom& testAtom, bool pass) {\r\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\r\n    }\r\n\r\n    Severity getReportableSeverity() const { return mReportableSeverity; }\r\n\r\n   private:\r\n    //!\r\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\r\n    //!\r\n    static const char* severityPrefix(Severity severity) {\r\n        switch (severity) {\r\n            case Severity::kINTERNAL_ERROR:\r\n                return \"[F] \";\r\n            case Severity::kERROR:\r\n                return \"[E] \";\r\n            case Severity::kWARNING:\r\n                return \"[W] \";\r\n            case Severity::kINFO:\r\n                return \"[I] \";\r\n            case Severity::kVERBOSE:\r\n                return \"[V] \";\r\n            default:\r\n                assert(0);\r\n                return \"\";\r\n        }\r\n    }\r\n\r\n    //!\r\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\r\n    //!\r\n    static const char* testResultString(TestResult result) {\r\n        switch (result) {\r\n            case TestResult::kRUNNING:\r\n                return \"RUNNING\";\r\n            case TestResult::kPASSED:\r\n                return \"PASSED\";\r\n            case TestResult::kFAILED:\r\n                return \"FAILED\";\r\n            case TestResult::kWAIVED:\r\n                return \"WAIVED\";\r\n            default:\r\n                assert(0);\r\n                return \"\";\r\n        }\r\n    }\r\n\r\n    //!\r\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\r\n    //!\r\n    static std::ostream& severityOstream(Severity severity) {\r\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\r\n    }\r\n\r\n    //!\r\n    //! \\brief method that implements logging test results\r\n    //!\r\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\r\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\r\n                                         << testAtom.mCmdline << std::endl;\r\n    }\r\n\r\n    //!\r\n    //! \\brief generate a command line string from the given (argc, argv) values\r\n    //!\r\n    static std::string genCmdlineString(int argc, char const* const* argv) {\r\n        std::stringstream ss;\r\n        for (int i = 0; i < argc; i++) {\r\n            if (i > 0)\r\n                ss << \" \";\r\n            ss << argv[i];\r\n        }\r\n        return ss.str();\r\n    }\r\n\r\n    Severity mReportableSeverity;\r\n};\r\n\r\nnamespace {\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\r\n}\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\r\n}\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\r\n}\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\r\n}\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\r\n//         (\"fatal\" severity)\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\r\n}\r\n\r\n}  // anonymous namespace\r\n\r\n#endif  // TENSORRT_LOGGING_H\r\n"
  },
  {
    "path": "ghostnet/ghostnetv2/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\r\n\r\nproject(ghostnetv2)\r\n\r\nadd_definitions(-std=c++11)\r\n\r\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\r\nset(CMAKE_CXX_STANDARD 11)\r\nset(CMAKE_BUILD_TYPE Debug)\r\n\r\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\r\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\r\n# cuda\r\ninclude_directories(/usr/local/cuda/include)\r\nlink_directories(/usr/local/cuda/lib64)\r\n# tensorrt\r\ninclude_directories(/usr/include/x86_64-linux-gnu/)\r\nlink_directories(/usr/lib/x86_64-linux-gnu/)\r\n\r\nadd_executable(ghostnetv2 ${PROJECT_SOURCE_DIR}/ghostnetv2.cpp)\r\ntarget_link_libraries(ghostnetv2 nvinfer)\r\ntarget_link_libraries(ghostnetv2 cudart)\r\n\r\nadd_definitions(-O2 -pthread)\r\n"
  },
  {
    "path": "ghostnet/ghostnetv2/gen_wts.py",
    "content": "import torch\r\nimport torch.nn as nn\r\nimport torch.onnx\r\nimport struct\r\n\r\nimport torch\r\nimport torch.nn.functional as F\r\nimport math\r\n\r\nfrom timm.models.registry import register_model\r\n\r\n\r\ndef _make_divisible(v, divisor, min_value=None):\r\n    \"\"\"\r\n    This function is taken from the original tf repo.\r\n    It ensures that all layers have a channel number that is divisible by 8\r\n    It can be seen here:\r\n    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py\r\n    \"\"\"\r\n    if min_value is None:\r\n        min_value = divisor\r\n    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)\r\n    # Make sure that round down does not go down by more than 10%.\r\n    if new_v < 0.9 * v:\r\n        new_v += divisor\r\n    return new_v\r\n\r\n\r\ndef hard_sigmoid(x, inplace: bool = False):\r\n    if inplace:\r\n        return x.add_(3.).clamp_(0., 6.).div_(6.)\r\n    else:\r\n        return F.relu6(x + 3.) / 6.\r\n\r\n\r\nclass SqueezeExcite(nn.Module):\r\n    def __init__(self, in_chs, se_ratio=0.25, reduced_base_chs=None,\r\n                 act_layer=nn.ReLU, gate_fn=hard_sigmoid, divisor=4, **_):\r\n        super(SqueezeExcite, self).__init__()\r\n        self.gate_fn = gate_fn\r\n        reduced_chs = _make_divisible((reduced_base_chs or in_chs) * se_ratio, divisor)\r\n        self.avg_pool = nn.AdaptiveAvgPool2d(1)\r\n        self.conv_reduce = nn.Conv2d(in_chs, reduced_chs, 1, bias=True)\r\n        self.act1 = act_layer(inplace=True)\r\n        self.conv_expand = nn.Conv2d(reduced_chs, in_chs, 1, bias=True)\r\n\r\n    def forward(self, x):\r\n        x_se = self.avg_pool(x)\r\n        x_se = self.conv_reduce(x_se)\r\n        x_se = self.act1(x_se)\r\n        x_se = self.conv_expand(x_se)\r\n        x = x * self.gate_fn(x_se)\r\n        return x\r\n\r\n\r\nclass ConvBnAct(nn.Module):\r\n    def __init__(self, in_chs, out_chs, kernel_size,\r\n                 stride=1, act_layer=nn.ReLU):\r\n        super(ConvBnAct, self).__init__()\r\n        self.conv = nn.Conv2d(in_chs, out_chs, kernel_size, stride, kernel_size//2, bias=False)\r\n        self.bn1 = nn.BatchNorm2d(out_chs)\r\n        self.act1 = act_layer(inplace=True)\r\n\r\n    def forward(self, x):\r\n        x = self.conv(x)\r\n        x = self.bn1(x)\r\n        x = self.act1(x)\r\n        return x\r\n\r\n\r\nclass GhostModuleV2(nn.Module):\r\n    def __init__(self, inp, oup, kernel_size=1, ratio=2, dw_size=3, stride=1, relu=True, mode=None, args=None):\r\n        super(GhostModuleV2, self).__init__()\r\n        self.mode = mode\r\n        self.gate_fn = nn.Sigmoid()\r\n\r\n        if self.mode in ['original']:\r\n            self.oup = oup\r\n            init_channels = math.ceil(oup / ratio)\r\n            new_channels = init_channels*(ratio-1)\r\n            self.primary_conv = nn.Sequential(\r\n                nn.Conv2d(inp, init_channels, kernel_size, stride, kernel_size//2, bias=False),\r\n                nn.BatchNorm2d(init_channels),\r\n                nn.ReLU(inplace=True) if relu else nn.Sequential(),\r\n            )\r\n            self.cheap_operation = nn.Sequential(\r\n                nn.Conv2d(init_channels, new_channels, dw_size, 1, dw_size//2, groups=init_channels, bias=False),\r\n                nn.BatchNorm2d(new_channels),\r\n                nn.ReLU(inplace=True) if relu else nn.Sequential(),\r\n            )\r\n        elif self.mode in ['attn']:\r\n            self.oup = oup\r\n            init_channels = math.ceil(oup / ratio)\r\n            new_channels = init_channels*(ratio-1)\r\n            self.primary_conv = nn.Sequential(\r\n                nn.Conv2d(inp, init_channels, kernel_size, stride, kernel_size//2, bias=False),\r\n                nn.BatchNorm2d(init_channels),\r\n                nn.ReLU(inplace=True) if relu else nn.Sequential(),\r\n            )\r\n            self.cheap_operation = nn.Sequential(\r\n                nn.Conv2d(init_channels, new_channels, dw_size, 1, dw_size//2, groups=init_channels, bias=False),\r\n                nn.BatchNorm2d(new_channels),\r\n                nn.ReLU(inplace=True) if relu else nn.Sequential(),\r\n            )\r\n            self.short_conv = nn.Sequential(\r\n                nn.Conv2d(inp, oup, kernel_size, stride, kernel_size//2, bias=False),\r\n                nn.BatchNorm2d(oup),\r\n                nn.Conv2d(oup, oup, kernel_size=(1, 5), stride=1, padding=(0, 2), groups=oup, bias=False),\r\n                nn.BatchNorm2d(oup),\r\n                nn.Conv2d(oup, oup, kernel_size=(5, 1), stride=1, padding=(2, 0), groups=oup, bias=False),\r\n                nn.BatchNorm2d(oup),\r\n            )\r\n\r\n    def forward(self, x):\r\n        if self.mode in ['original']:\r\n            x1 = self.primary_conv(x)\r\n            x2 = self.cheap_operation(x1)\r\n            out = torch.cat([x1, x2], dim=1)\r\n            return out[:, :self.oup, :, :]\r\n        elif self.mode in ['attn']:\r\n            res = self.short_conv(F.avg_pool2d(x, kernel_size=2, stride=2))\r\n            x1 = self.primary_conv(x)\r\n            x2 = self.cheap_operation(x1)\r\n            out = torch.cat([x1, x2], dim=1)\r\n            return out[:, :self.oup, :, :]*F.interpolate(self.gate_fn(res),\r\n                                                         size=(out.shape[-2], out.shape[-1]), mode='nearest')\r\n\r\n\r\nclass GhostBottleneckV2(nn.Module):\r\n\r\n    def __init__(self, in_chs, mid_chs, out_chs, dw_kernel_size=3,\r\n                 stride=1, act_layer=nn.ReLU, se_ratio=0., layer_id=None, args=None):\r\n        super(GhostBottleneckV2, self).__init__()\r\n        has_se = se_ratio is not None and se_ratio > 0.\r\n        self.stride = stride\r\n\r\n        # Point-wise expansion\r\n        if layer_id <= 1:\r\n            self.ghost1 = GhostModuleV2(in_chs, mid_chs, relu=True, mode='original', args=args)\r\n        else:\r\n            self.ghost1 = GhostModuleV2(in_chs, mid_chs, relu=True, mode='attn', args=args)\r\n\r\n        # Depth-wise convolution\r\n        if self.stride > 1:\r\n            self.conv_dw = nn.Conv2d(mid_chs, mid_chs, dw_kernel_size, stride=stride,\r\n                                     padding=(dw_kernel_size-1)//2, groups=mid_chs, bias=False)\r\n            self.bn_dw = nn.BatchNorm2d(mid_chs)\r\n\r\n        # Squeeze-and-excitation\r\n        if has_se:\r\n            self.se = SqueezeExcite(mid_chs, se_ratio=se_ratio)\r\n        else:\r\n            self.se = None\r\n\r\n        self.ghost2 = GhostModuleV2(mid_chs, out_chs, relu=False, mode='original', args=args)\r\n\r\n        # shortcut\r\n        if (in_chs == out_chs and self.stride == 1):\r\n            self.shortcut = nn.Sequential()\r\n        else:\r\n            self.shortcut = nn.Sequential(\r\n                nn.Conv2d(in_chs, in_chs, dw_kernel_size, stride=stride,\r\n                          padding=(dw_kernel_size-1)//2, groups=in_chs, bias=False),\r\n                nn.BatchNorm2d(in_chs),\r\n                nn.Conv2d(in_chs, out_chs, 1, stride=1, padding=0, bias=False),\r\n                nn.BatchNorm2d(out_chs),\r\n            )\r\n\r\n    def forward(self, x):\r\n        residual = x\r\n        x = self.ghost1(x)\r\n        if self.stride > 1:\r\n            x = self.conv_dw(x)\r\n            x = self.bn_dw(x)\r\n        if self.se is not None:\r\n            x = self.se(x)\r\n        x = self.ghost2(x)\r\n        x += self.shortcut(residual)\r\n        return x\r\n\r\n\r\nclass GhostNetV2(nn.Module):\r\n    def __init__(self, cfgs, num_classes=1000, width=1.0, dropout=0.2, block=GhostBottleneckV2, args=None):\r\n        super(GhostNetV2, self).__init__()\r\n        self.cfgs = cfgs\r\n        self.dropout = dropout\r\n\r\n        # building first layer\r\n        output_channel = _make_divisible(16 * width, 4)\r\n        self.conv_stem = nn.Conv2d(3, output_channel, 3, 2, 1, bias=False)\r\n        self.bn1 = nn.BatchNorm2d(output_channel)\r\n        self.act1 = nn.ReLU(inplace=True)\r\n        input_channel = output_channel\r\n\r\n        # building inverted residual blocks\r\n        stages = []\r\n        layer_id = 0\r\n        for cfg in self.cfgs:\r\n            layers = []\r\n            for k, exp_size, c, se_ratio, s in cfg:\r\n                output_channel = _make_divisible(c * width, 4)\r\n                hidden_channel = _make_divisible(exp_size * width, 4)\r\n                layers.append(block(input_channel, hidden_channel, output_channel, k, s,\r\n                                    se_ratio=se_ratio, layer_id=layer_id, args=args))\r\n                input_channel = output_channel\r\n                layer_id += 1\r\n            stages.append(nn.Sequential(*layers))\r\n\r\n        output_channel = _make_divisible(exp_size * width, 4)\r\n        stages.append(nn.Sequential(ConvBnAct(input_channel, output_channel, 1)))\r\n        input_channel = output_channel\r\n\r\n        self.blocks = nn.Sequential(*stages)\r\n\r\n        # building last several layers\r\n        output_channel = 1280\r\n        self.global_pool = nn.AdaptiveAvgPool2d((1, 1))\r\n        self.conv_head = nn.Conv2d(input_channel, output_channel, 1, 1, 0, bias=True)\r\n        self.act2 = nn.ReLU(inplace=True)\r\n        self.classifier = nn.Linear(output_channel, num_classes)\r\n\r\n    def forward(self, x):\r\n        x = self.conv_stem(x)\r\n        x = self.bn1(x)\r\n        x = self.act1(x)\r\n        x = self.blocks(x)\r\n        x = self.global_pool(x)\r\n        x = self.conv_head(x)\r\n        x = self.act2(x)\r\n        x = x.view(x.size(0), -1)\r\n        if self.dropout > 0.:\r\n            x = F.dropout(x, p=self.dropout, training=self.training)\r\n        x = self.classifier(x)\r\n        return x\r\n\r\n\r\n@register_model\r\ndef ghostnetv2(**kwargs):\r\n    cfgs = [\r\n        # k, t, c, SE, s\r\n        [[3,  16,  16, 0, 1]],\r\n        [[3,  48,  24, 0, 2]],\r\n        [[3,  72,  24, 0, 1]],\r\n        [[5,  72,  40, 0.25, 2]],\r\n        [[5, 120,  40, 0.25, 1]],\r\n        [[3, 240,  80, 0, 2]],\r\n        [[3, 200,  80, 0, 1],\r\n         [3, 184,  80, 0, 1],\r\n         [3, 184,  80, 0, 1],\r\n         [3, 480, 112, 0.25, 1],\r\n         [3, 672, 112, 0.25, 1]],\r\n        [[5, 672, 160, 0.25, 2]],\r\n        [[5, 960, 160, 0, 1],\r\n         [5, 960, 160, 0.25, 1],\r\n         [5, 960, 160, 0, 1],\r\n         [5, 960, 160, 0.25, 1]]\r\n    ]\r\n    return GhostNetV2(cfgs, num_classes=kwargs['num_classes'],\r\n                      width=kwargs['width'],\r\n                      dropout=kwargs['dropout'],\r\n                      args=kwargs['args'])\r\n\r\n\r\ndef setup_seed(seed):\r\n    torch.manual_seed(seed)\r\n    torch.cuda.manual_seed_all(seed)\r\n    torch.backends.cudnn.deterministic = True\r\n\r\n\r\n# Function to export weights in the specified format\r\ndef export_weight(model):\r\n    f = open(\"ghostnetv2.weights\", 'w')\r\n    f.write(\"{}\\n\".format(len(model.state_dict().keys())))\r\n\r\n    # Convert weights to hexadecimal format\r\n    for k, v in model.state_dict().items():\r\n        print('exporting ... {}: {}'.format(k, v.shape))\r\n\r\n        # Reshape the weights to 1D\r\n        vr = v.reshape(-1).cpu().numpy()\r\n        f.write(\"{} {}\".format(k, len(vr)))\r\n        for vv in vr:\r\n            f.write(\" \")\r\n            f.write(struct.pack(\">f\", float(vv)).hex())\r\n        f.write(\"\\n\")\r\n\r\n    f.close()\r\n\r\n\r\n# Function to evaluate the model (optional)\r\ndef eval_model(input, model):\r\n    output = model(input)\r\n    print(\"------from inference------\")\r\n    print(input)\r\n    print(output)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    setup_seed(1)\r\n\r\n    # Create an instance of GhostNetV2\r\n    model = ghostnetv2(width=1.0, num_classes=1000, dropout=0.2, args=None)\r\n    model.eval()\r\n\r\n    # Dummy input tensor (adjust the shape as per your requirement)\r\n    input = torch.full((32, 3, 320, 256), 10.0)\r\n\r\n    # Export the model weights\r\n    export_weight(model)\r\n\r\n    # Evaluate the model\r\n    eval_model(input, model)\r\n"
  },
  {
    "path": "ghostnet/ghostnetv2/ghostnetv2.cpp",
    "content": "#include <chrono>\r\n#include <cmath>\r\n#include <fstream>\r\n#include <iostream>\r\n#include <map>\r\n#include <sstream>\r\n#include <vector>\r\n#include \"NvInfer.h\"\r\n#include \"cuda_runtime_api.h\"\r\n#include \"logging.h\"\r\n\r\nusing namespace std;\r\n\r\n#define CHECK(status)                                          \\\r\n    do {                                                       \\\r\n        auto ret = (status);                                   \\\r\n        if (ret != 0) {                                        \\\r\n            std::cerr << \"Cuda failure: \" << ret << std::endl; \\\r\n            abort();                                           \\\r\n        }                                                      \\\r\n    } while (0)\r\n\r\n// Define input/output parameters\r\nstatic const int INPUT_H = 256;\r\nstatic const int INPUT_W = 320;\r\nstatic const int OUTPUT_SIZE = 1000;\r\nstatic const int batchSize = 32;\r\n\r\nconst char* INPUT_BLOB_NAME = \"data\";\r\nconst char* OUTPUT_BLOB_NAME = \"prob\";\r\nusing namespace nvinfer1;\r\n\r\nstatic Logger gLogger;\r\n\r\n// Load weight file\r\nstd::map<std::string, Weights> loadWeights(const std::string file) {\r\n    std::cout << \"Loading weights: \" << file << std::endl;\r\n    std::map<std::string, Weights> weightMap;\r\n\r\n    // Open the weight file\r\n    std::ifstream input(file);\r\n    if (!input.is_open()) {\r\n        std::cerr << \"Unable to load weight file.\" << std::endl;\r\n        exit(EXIT_FAILURE);\r\n    }\r\n\r\n    // Read the number of weights\r\n    int32_t count;\r\n    input >> count;\r\n    if (count <= 0) {\r\n        std::cerr << \"Invalid weight map file.\" << std::endl;\r\n        exit(EXIT_FAILURE);\r\n    }\r\n\r\n    while (count--) {\r\n        Weights wt{DataType::kFLOAT, nullptr, 0};\r\n        uint32_t size;\r\n\r\n        // Read the name and size\r\n        std::string name;\r\n        input >> name >> std::dec >> size;\r\n        wt.type = DataType::kFLOAT;\r\n\r\n        // Load weight data\r\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(uint32_t) * size));\r\n        for (uint32_t x = 0, y = size; x < y; ++x) {\r\n            input >> std::hex >> val[x];\r\n        }\r\n        wt.values = val;\r\n\r\n        wt.count = size;\r\n        weightMap[name] = wt;\r\n    }\r\n\r\n    return weightMap;\r\n}\r\n\r\nint _make_divisible(int v, int divisor, int min_value = -1) {\r\n    // If min_value is not specified, set it to divisor\r\n    if (min_value == -1) {\r\n        min_value = divisor;\r\n    }\r\n\r\n    // Calculate new channel size to be divisible by divisor\r\n    int new_v = std::max(min_value, (v + divisor / 2) / divisor * divisor);\r\n\r\n    // Ensure rounding down does not reduce by more than 10%\r\n    if (new_v < static_cast<int>(0.9 * v)) {\r\n        new_v += divisor;\r\n    }\r\n\r\n    return new_v;\r\n}\r\n\r\nILayer* hardSigmoid(INetworkDefinition* network, ITensor& input) {\r\n    // Apply Hard Sigmoid activation function\r\n    IActivationLayer* scale_layer = network->addActivation(input, ActivationType::kHARD_SIGMOID);\r\n\r\n    // Return the output after activation\r\n    return scale_layer;\r\n}\r\n\r\nIScaleLayer* addBatchNorm2d(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\r\n                            std::string lname, float eps) {\r\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\r\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\r\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\r\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\r\n    int len = weightMap[lname + \".running_var\"].count;\r\n\r\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights scale{DataType::kFLOAT, scval, len};\r\n\r\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights shift{DataType::kFLOAT, shval, len};\r\n\r\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        pval[i] = 1.0;\r\n    }\r\n    Weights power{DataType::kFLOAT, pval, len};\r\n\r\n    weightMap[lname + \".scale\"] = scale;\r\n    weightMap[lname + \".shift\"] = shift;\r\n    weightMap[lname + \".power\"] = power;\r\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\r\n    assert(scale_1);\r\n    return scale_1;\r\n}\r\n\r\nIActivationLayer* convBnReluStem(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\r\n                                 int outch, std::string lname) {\r\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\r\n\r\n    // Step 1: Convolution layer\r\n    IConvolutionLayer* conv1 =\r\n            network->addConvolutionNd(input, outch, DimsHW{3, 3}, weightMap[lname + \".weight\"], emptywts);\r\n    assert(conv1);\r\n    conv1->setStrideNd(DimsHW{2, 2});   // Stride of 2\r\n    conv1->setPaddingNd(DimsHW{1, 1});  // Padding of 1\r\n\r\n    // Step 2: Batch normalization layer\r\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"bn1\", 1e-5);\r\n\r\n    // Step 3: ReLU activation\r\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\r\n    assert(relu1);\r\n\r\n    return relu1;  // Return the result after activation\r\n}\r\n\r\nILayer* convBnAct(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\r\n                  int out_channels, std::string lname, ActivationType actType = ActivationType::kRELU) {\r\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\r\n\r\n    // Add convolution layer\r\n    IConvolutionLayer* conv =\r\n            network->addConvolutionNd(input, out_channels, DimsHW{1, 1}, weightMap[lname + \".conv.weight\"], emptywts);\r\n    assert(conv);\r\n    conv->setStrideNd(DimsHW{1, 1});\r\n\r\n    // Add batch normalization layer\r\n    IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn1\", 1e-5);\r\n\r\n    // Add activation layer (default is ReLU)\r\n    IActivationLayer* act = network->addActivation(*bn->getOutput(0), actType);\r\n    assert(act);\r\n\r\n    return act;\r\n}\r\n\r\nILayer* squeezeExcite(INetworkDefinition* network, ITensor& input, std::map<std::string, Weights>& weightMap,\r\n                      int in_chs, float se_ratio = 0.25, std::string lname = \"\", float eps = 1e-5) {\r\n    // Step 1: Global average pooling\r\n    IReduceLayer* avg_pool = network->addReduce(input, ReduceOperation::kAVG, 1 << 2 | 1 << 3, true);\r\n    assert(avg_pool);\r\n\r\n    // Step 2: 1x1 convolution for dimension reduction\r\n    int reduced_chs = _make_divisible(static_cast<int>(in_chs * se_ratio), 4);\r\n    IConvolutionLayer* conv_reduce =\r\n            network->addConvolutionNd(*avg_pool->getOutput(0), reduced_chs, DimsHW{1, 1},\r\n                                      weightMap[lname + \".conv_reduce.weight\"], weightMap[lname + \".conv_reduce.bias\"]);\r\n    assert(conv_reduce);\r\n\r\n    // Step 3: ReLU activation\r\n    IActivationLayer* relu1 = network->addActivation(*conv_reduce->getOutput(0), ActivationType::kRELU);\r\n    assert(relu1);\r\n\r\n    // Step 4: 1x1 convolution for dimension expansion\r\n    IConvolutionLayer* conv_expand =\r\n            network->addConvolutionNd(*relu1->getOutput(0), in_chs, DimsHW{1, 1},\r\n                                      weightMap[lname + \".conv_expand.weight\"], weightMap[lname + \".conv_expand.bias\"]);\r\n    assert(conv_expand);\r\n\r\n    // Step 5: Hard Sigmoid activation\r\n    ILayer* hard_sigmoid = hardSigmoid(network, *conv_expand->getOutput(0));\r\n\r\n    // Step 6: Multiply input by the output of SE module\r\n    IElementWiseLayer* scale = network->addElementWise(input, *hard_sigmoid->getOutput(0), ElementWiseOperation::kPROD);\r\n    assert(scale);\r\n\r\n    return scale;\r\n}\r\n\r\nILayer* ghostModuleV2(INetworkDefinition* network, ITensor& input, std::map<std::string, Weights>& weightMap, int inp,\r\n                      int oup, int kernel_size = 1, int ratio = 2, int dw_size = 3, int stride = 1, bool relu = true,\r\n                      std::string lname = \"\", std::string mode = \"original\") {\r\n    int init_channels = std::ceil(oup / ratio);\r\n    int new_channels = init_channels * (ratio - 1);\r\n\r\n    // Primary convolution\r\n    IConvolutionLayer* primary_conv = network->addConvolutionNd(input, init_channels, DimsHW{kernel_size, kernel_size},\r\n                                                                weightMap[lname + \".primary_conv.0.weight\"], Weights{});\r\n    primary_conv->setStrideNd(DimsHW{stride, stride});\r\n    primary_conv->setPaddingNd(DimsHW{kernel_size / 2, kernel_size / 2});\r\n\r\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *primary_conv->getOutput(0), lname + \".primary_conv.1\", 1e-5);\r\n\r\n    ITensor* act1_output = bn1->getOutput(0);\r\n    if (relu) {\r\n        IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\r\n        act1_output = relu1->getOutput(0);\r\n    }\r\n\r\n    // Cheap operation\r\n    IConvolutionLayer* cheap_conv =\r\n            network->addConvolutionNd(*act1_output, new_channels, DimsHW{dw_size, dw_size},\r\n                                      weightMap[lname + \".cheap_operation.0.weight\"], Weights{});\r\n    cheap_conv->setStrideNd(DimsHW{1, 1});\r\n    cheap_conv->setPaddingNd(DimsHW{dw_size / 2, dw_size / 2});\r\n    cheap_conv->setNbGroups(init_channels);\r\n\r\n    IScaleLayer* bn2 =\r\n            addBatchNorm2d(network, weightMap, *cheap_conv->getOutput(0), lname + \".cheap_operation.1\", 1e-5);\r\n\r\n    ITensor* act2_output = bn2->getOutput(0);\r\n    if (relu) {\r\n        IActivationLayer* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\r\n        act2_output = relu2->getOutput(0);\r\n    }\r\n\r\n    // Concatenate\r\n    ITensor* concat_inputs[] = {act1_output, act2_output};\r\n    IConcatenationLayer* concat = network->addConcatenation(concat_inputs, 2);\r\n\r\n    // Slice to oup channels\r\n    Dims start{4, {0, 0, 0, 0}};\r\n    Dims size = concat->getOutput(0)->getDimensions();\r\n    size.d[1] = oup;\r\n    Dims stride_{4, {1, 1, 1, 1}};\r\n\r\n    ISliceLayer* slice = network->addSlice(*concat->getOutput(0), start, size, stride_);\r\n\r\n    ITensor* out = slice->getOutput(0);\r\n\r\n    if (mode == \"original\") {\r\n        return slice;\r\n    } else if (mode == \"attn\") {\r\n        // Attention mechanism\r\n        // Average pooling\r\n        IPoolingLayer* avg_pool = network->addPoolingNd(input, PoolingType::kAVERAGE, DimsHW{2, 2});\r\n        avg_pool->setStrideNd(DimsHW{2, 2});\r\n\r\n        ITensor* avg_pooled = avg_pool->getOutput(0);\r\n\r\n        // Short convolution branch\r\n        IConvolutionLayer* short_conv1 =\r\n                network->addConvolutionNd(*avg_pooled, oup, DimsHW{kernel_size, kernel_size},\r\n                                          weightMap[lname + \".short_conv.0.weight\"], Weights{});\r\n        short_conv1->setStrideNd(DimsHW{1, 1});\r\n        short_conv1->setPaddingNd(DimsHW{kernel_size / 2, kernel_size / 2});\r\n        IScaleLayer* short_bn1 =\r\n                addBatchNorm2d(network, weightMap, *short_conv1->getOutput(0), lname + \".short_conv.1\", 1e-5);\r\n\r\n        // Conv with kernel size (1,5)\r\n        IConvolutionLayer* short_conv2 = network->addConvolutionNd(\r\n                *short_bn1->getOutput(0), oup, DimsHW{1, 5}, weightMap[lname + \".short_conv.2.weight\"], Weights{});\r\n        short_conv2->setStrideNd(DimsHW{1, 1});\r\n        short_conv2->setPaddingNd(DimsHW{0, 2});\r\n        short_conv2->setNbGroups(oup);\r\n        IScaleLayer* short_bn2 =\r\n                addBatchNorm2d(network, weightMap, *short_conv2->getOutput(0), lname + \".short_conv.3\", 1e-5);\r\n\r\n        // Conv with kernel size (5,1)\r\n        IConvolutionLayer* short_conv3 = network->addConvolutionNd(\r\n                *short_bn2->getOutput(0), oup, DimsHW{5, 1}, weightMap[lname + \".short_conv.4.weight\"], Weights{});\r\n        short_conv3->setStrideNd(DimsHW{1, 1});\r\n        short_conv3->setPaddingNd(DimsHW{2, 0});\r\n        short_conv3->setNbGroups(oup);\r\n        IScaleLayer* short_bn3 =\r\n                addBatchNorm2d(network, weightMap, *short_conv3->getOutput(0), lname + \".short_conv.5\", 1e-5);\r\n\r\n        ITensor* res = short_bn3->getOutput(0);\r\n\r\n        // Sigmoid activation\r\n        IActivationLayer* gate = network->addActivation(*res, ActivationType::kSIGMOID);\r\n\r\n        // Upsample to the same size as out\r\n        IResizeLayer* gate_upsampled = network->addResize(*gate->getOutput(0));\r\n        gate_upsampled->setResizeMode(ResizeMode::kNEAREST);\r\n        Dims out_dims = out->getDimensions();\r\n        gate_upsampled->setOutputDimensions(out_dims);\r\n\r\n        // Element-wise multiplication\r\n        IElementWiseLayer* scaled_out =\r\n                network->addElementWise(*out, *gate_upsampled->getOutput(0), ElementWiseOperation::kPROD);\r\n\r\n        return scaled_out;\r\n    } else {\r\n        std::cerr << \"Invalid mode: \" << mode << \" in ghostModuleV2\" << std::endl;\r\n        return nullptr;\r\n    }\r\n}\r\n\r\nILayer* ghostBottleneck(INetworkDefinition* network, ITensor& input, std::map<std::string, Weights>& weightMap,\r\n                        int in_chs, int mid_chs, int out_chs, int dw_kernel_size = 3, int stride = 1,\r\n                        float se_ratio = 0.0f, std::string lname = \"\", int layer_id = 0) {\r\n    // Determine mode based on layer_id\r\n    std::string mode = (layer_id <= 1) ? \"original\" : \"attn\";\r\n\r\n    // ghost1\r\n    ILayer* ghost1 =\r\n            ghostModuleV2(network, input, weightMap, in_chs, mid_chs, 1, 2, 3, 1, true, lname + \".ghost1\", mode);\r\n\r\n    ILayer* depthwise_conv = ghost1;\r\n    if (stride > 1) {\r\n        IConvolutionLayer* conv_dw =\r\n                network->addConvolutionNd(*ghost1->getOutput(0), mid_chs, DimsHW{dw_kernel_size, dw_kernel_size},\r\n                                          weightMap[lname + \".conv_dw.weight\"], Weights{});\r\n        conv_dw->setStrideNd(DimsHW{stride, stride});\r\n        conv_dw->setPaddingNd(DimsHW{(dw_kernel_size - 1) / 2, (dw_kernel_size - 1) / 2});\r\n        conv_dw->setNbGroups(mid_chs);\r\n        IScaleLayer* bn_dw = addBatchNorm2d(network, weightMap, *conv_dw->getOutput(0), lname + \".bn_dw\", 1e-5);\r\n        depthwise_conv = bn_dw;\r\n    }\r\n\r\n    ILayer* se_layer = depthwise_conv;\r\n    if (se_ratio > 0.0f) {\r\n        se_layer = squeezeExcite(network, *depthwise_conv->getOutput(0), weightMap, mid_chs, se_ratio, lname + \".se\");\r\n    }\r\n\r\n    // ghost2 uses original mode\r\n    ILayer* ghost2 = ghostModuleV2(network, *se_layer->getOutput(0), weightMap, mid_chs, out_chs, 1, 2, 3, 1, false,\r\n                                   lname + \".ghost2\", \"original\");\r\n\r\n    ILayer* shortcut_layer = nullptr;\r\n    if (in_chs == out_chs && stride == 1) {\r\n        shortcut_layer = network->addIdentity(input);\r\n    } else {\r\n        IConvolutionLayer* conv_shortcut_dw =\r\n                network->addConvolutionNd(input, in_chs, DimsHW{dw_kernel_size, dw_kernel_size},\r\n                                          weightMap[lname + \".shortcut.0.weight\"], Weights{});\r\n        conv_shortcut_dw->setStrideNd(DimsHW{stride, stride});\r\n        conv_shortcut_dw->setPaddingNd(DimsHW{(dw_kernel_size - 1) / 2, (dw_kernel_size - 1) / 2});\r\n        conv_shortcut_dw->setNbGroups(in_chs);\r\n        IScaleLayer* bn_shortcut_dw =\r\n                addBatchNorm2d(network, weightMap, *conv_shortcut_dw->getOutput(0), lname + \".shortcut.1\", 1e-5);\r\n\r\n        IConvolutionLayer* conv_shortcut_pw =\r\n                network->addConvolutionNd(*bn_shortcut_dw->getOutput(0), out_chs, DimsHW{1, 1},\r\n                                          weightMap[lname + \".shortcut.2.weight\"], Weights{});\r\n        IScaleLayer* bn_shortcut_pw =\r\n                addBatchNorm2d(network, weightMap, *conv_shortcut_pw->getOutput(0), lname + \".shortcut.3\", 1e-5);\r\n        shortcut_layer = bn_shortcut_pw;\r\n    }\r\n\r\n    IElementWiseLayer* ew_sum =\r\n            network->addElementWise(*ghost2->getOutput(0), *shortcut_layer->getOutput(0), ElementWiseOperation::kSUM);\r\n\r\n    return ew_sum;\r\n}\r\n\r\nICudaEngine* createEngine(IBuilder* builder, IBuilderConfig* config, DataType dt) {\r\n    // Use explicit batch mode\r\n    INetworkDefinition* network =\r\n            builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\r\n\r\n    // Create input tensor\r\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims4{batchSize, 3, INPUT_H, INPUT_W});\r\n    assert(data);\r\n\r\n    // Load weights\r\n    std::map<std::string, Weights> weightMap = loadWeights(\"../ghostnetv2.weights\");\r\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\r\n\r\n    // Step 1: Conv Stem\r\n    IActivationLayer* conv_stem = convBnReluStem(network, weightMap, *data, 16, \"conv_stem\");\r\n\r\n    ILayer* current_layer = conv_stem;\r\n\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 16, 16, 16, 3, 1, 0.0f, \"blocks.0.0\", 0);\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 16, 48, 24, 3, 2, 0.0f, \"blocks.1.0\", 1);\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 24, 72, 24, 3, 1, 0.0f, \"blocks.2.0\", 2);\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 24, 72, 40, 5, 2, 0.25f, \"blocks.3.0\", 3);\r\n    current_layer = ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 40, 120, 40, 5, 1, 0.25f,\r\n                                    \"blocks.4.0\", 4);\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 40, 240, 80, 3, 2, 0.0f, \"blocks.5.0\", 5);\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 80, 200, 80, 3, 1, 0.0f, \"blocks.6.0\", 6);\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 80, 184, 80, 3, 1, 0.0f, \"blocks.6.1\", 7);\r\n    current_layer =\r\n            ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 80, 184, 80, 3, 1, 0.0f, \"blocks.6.2\", 8);\r\n    current_layer = ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 80, 480, 112, 3, 1, 0.25f,\r\n                                    \"blocks.6.3\", 9);\r\n    current_layer = ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 112, 672, 112, 3, 1, 0.25f,\r\n                                    \"blocks.6.4\", 10);\r\n    current_layer = ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 112, 672, 160, 5, 2, 0.25f,\r\n                                    \"blocks.7.0\", 11);\r\n    current_layer = ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 160, 960, 160, 5, 1, 0.0f,\r\n                                    \"blocks.8.0\", 12);\r\n    current_layer = ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 160, 960, 160, 5, 1, 0.25f,\r\n                                    \"blocks.8.1\", 13);\r\n    current_layer = ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 160, 960, 160, 5, 1, 0.0f,\r\n                                    \"blocks.8.2\", 14);\r\n    current_layer = ghostBottleneck(network, *current_layer->getOutput(0), weightMap, 160, 960, 160, 5, 1, 0.25f,\r\n                                    \"blocks.8.3\", 15);\r\n\r\n    // Apply ConvBnAct\r\n    current_layer = convBnAct(network, weightMap, *current_layer->getOutput(0), 960, \"blocks.9.0\");\r\n\r\n    // Global average pooling\r\n    IReduceLayer* global_pool =\r\n            network->addReduce(*current_layer->getOutput(0), ReduceOperation::kAVG, 1 << 2 | 1 << 3, true);\r\n    assert(global_pool);\r\n\r\n    // Conv Head\r\n    IConvolutionLayer* conv_head = network->addConvolutionNd(\r\n            *global_pool->getOutput(0), 1280, DimsHW{1, 1}, weightMap[\"conv_head.weight\"], weightMap[\"conv_head.bias\"]);\r\n    IActivationLayer* act2 = network->addActivation(*conv_head->getOutput(0), ActivationType::kRELU);\r\n\r\n    // Fully connected layer (classifier)\r\n    IFullyConnectedLayer* classifier = network->addFullyConnected(\r\n            *act2->getOutput(0), 1000, weightMap[\"classifier.weight\"], weightMap[\"classifier.bias\"]);\r\n    classifier->getOutput(0)->setName(OUTPUT_BLOB_NAME);\r\n    network->markOutput(*classifier->getOutput(0));\r\n\r\n    // Build the engine\r\n    config->setMaxWorkspaceSize(1 << 24);\r\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\r\n\r\n    // Destroy the network\r\n    network->destroy();\r\n\r\n    // Free memory\r\n    for (auto& mem : weightMap) {\r\n        free((void*)(mem.second.values));\r\n    }\r\n\r\n    return engine;\r\n}\r\n\r\nvoid APIToModel(IHostMemory** modelStream) {\r\n    // Create builder\r\n    IBuilder* builder = createInferBuilder(gLogger);\r\n    IBuilderConfig* config = builder->createBuilderConfig();\r\n\r\n    // Create model and serialize\r\n    ICudaEngine* engine = createEngine(builder, config, DataType::kFLOAT);\r\n    assert(engine != nullptr);\r\n\r\n    // Serialize the engine\r\n    (*modelStream) = engine->serialize();\r\n\r\n    // Release resources\r\n    engine->destroy();\r\n    config->destroy();\r\n    builder->destroy();\r\n}\r\n\r\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\r\n    const ICudaEngine& engine = context.getEngine();\r\n\r\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\r\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\r\n\r\n    // Input and output buffers\r\n    void* buffers[2];\r\n\r\n    // Create buffers on device\r\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\r\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\r\n\r\n    // Create stream\r\n    cudaStream_t stream;\r\n    CHECK(cudaStreamCreate(&stream));\r\n\r\n    // Copy input data to device, execute inference, and copy output back to host\r\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float),\r\n                          cudaMemcpyHostToDevice, stream));\r\n    context.enqueueV2(buffers, stream, nullptr);\r\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost,\r\n                          stream));\r\n    cudaStreamSynchronize(stream);\r\n\r\n    // Release stream and buffers\r\n    cudaStreamDestroy(stream);\r\n    CHECK(cudaFree(buffers[inputIndex]));\r\n    CHECK(cudaFree(buffers[outputIndex]));\r\n}\r\n\r\nint main(int argc, char** argv) {\r\n    if (argc != 2) {\r\n        std::cerr << \"arguments not right!\" << std::endl;\r\n        std::cerr << \"./ghostnetv2 -s   // serialize model to plan file\" << std::endl;\r\n        std::cerr << \"./ghostnetv2 -d   // deserialize plan file and run inference\" << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    // Create model and serialize\r\n    char* trtModelStream{nullptr};\r\n    size_t size{0};\r\n\r\n    if (std::string(argv[1]) == \"-s\") {\r\n        IHostMemory* modelStream{nullptr};\r\n        APIToModel(&modelStream);\r\n        assert(modelStream != nullptr);\r\n\r\n        std::ofstream p(\"ghostnetv2.engine\", std::ios::binary);\r\n        if (!p) {\r\n            std::cerr << \"could not open plan output file\" << std::endl;\r\n            return -1;\r\n        }\r\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\r\n        modelStream->destroy();\r\n        return 0;\r\n    } else if (std::string(argv[1]) == \"-d\") {\r\n        std::ifstream file(\"ghostnetv2.engine\", std::ios::binary);\r\n        if (file.good()) {\r\n            file.seekg(0, file.end);\r\n            size = file.tellg();\r\n            file.seekg(0, file.beg);\r\n            trtModelStream = new char[size];\r\n            assert(trtModelStream);\r\n            file.read(trtModelStream, size);\r\n            file.close();\r\n        }\r\n    } else {\r\n        return -1;\r\n    }\r\n\r\n    // Allocate input and output data\r\n    float* data = new float[batchSize * 3 * INPUT_H * INPUT_W];\r\n    for (int i = 0; i < batchSize * 3 * INPUT_H * INPUT_W; i++)\r\n        data[i] = 10.0;\r\n\r\n    float* prob = new float[batchSize * OUTPUT_SIZE];\r\n\r\n    IRuntime* runtime = createInferRuntime(gLogger);\r\n    assert(runtime != nullptr);\r\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\r\n    assert(engine != nullptr);\r\n    IExecutionContext* context = engine->createExecutionContext();\r\n    assert(context != nullptr);\r\n    delete[] trtModelStream;\r\n\r\n    // Execute inference\r\n    doInference(*context, data, prob, batchSize);\r\n\r\n    // Print output results\r\n    std::cout << \"\\nOutput:\\n\\n\";\r\n    for (int i = 0; i < batchSize; i++) {\r\n        std::cout << \"Batch \" << i << \":\\n\";\r\n        for (unsigned int j = 0; j < OUTPUT_SIZE; j++) {\r\n            std::cout << prob[i * OUTPUT_SIZE + j] << \", \";\r\n            if (j % 10 == 0)\r\n                std::cout << j / 10 << std::endl;\r\n        }\r\n        std::cout << \"\\n\";\r\n    }\r\n\r\n    // Release resources\r\n    context->destroy();\r\n    engine->destroy();\r\n    runtime->destroy();\r\n    delete[] data;\r\n    delete[] prob;\r\n\r\n    return 0;\r\n}\r\n"
  },
  {
    "path": "ghostnet/ghostnetv2/logging.h",
    "content": "/*\r\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\r\n *\r\n * Licensed under the Apache License, Version 2.0 (the \"License\");\r\n * you may not use this file except in compliance with the License.\r\n * You may obtain a copy of the License at\r\n *\r\n *     http://www.apache.org/licenses/LICENSE-2.0\r\n *\r\n * Unless required by applicable law or agreed to in writing, software\r\n * distributed under the License is distributed on an \"AS IS\" BASIS,\r\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\r\n * See the License for the specific language governing permissions and\r\n * limitations under the License.\r\n */\r\n\r\n#ifndef TENSORRT_LOGGING_H\r\n#define TENSORRT_LOGGING_H\r\n\r\n#include <cassert>\r\n#include <ctime>\r\n#include <iomanip>\r\n#include <iostream>\r\n#include <ostream>\r\n#include <sstream>\r\n#include <string>\r\n#include \"NvInferRuntimeCommon.h\"\r\n\r\nusing Severity = nvinfer1::ILogger::Severity;\r\n\r\nclass LogStreamConsumerBuffer : public std::stringbuf {\r\n   public:\r\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\r\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\r\n\r\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\r\n\r\n    ~LogStreamConsumerBuffer() {\r\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\r\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\r\n        // if the pointer to the beginning is not equal to the pointer to the current position,\r\n        // call putOutput() to log the output to the stream\r\n        if (pbase() != pptr()) {\r\n            putOutput();\r\n        }\r\n    }\r\n\r\n    // synchronizes the stream buffer and returns 0 on success\r\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\r\n    // resetting the buffer and flushing the stream\r\n    virtual int sync() {\r\n        putOutput();\r\n        return 0;\r\n    }\r\n\r\n    void putOutput() {\r\n        if (mShouldLog) {\r\n            // prepend timestamp\r\n            std::time_t timestamp = std::time(nullptr);\r\n            tm* tm_local = std::localtime(&timestamp);\r\n            std::cout << \"[\";\r\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\r\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\r\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\r\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\r\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\r\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\r\n            // std::stringbuf::str() gets the string contents of the buffer\r\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\r\n            mOutput << mPrefix << str();\r\n            // set the buffer to empty\r\n            str(\"\");\r\n            // flush the stream\r\n            mOutput.flush();\r\n        }\r\n    }\r\n\r\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\r\n\r\n   private:\r\n    std::ostream& mOutput;\r\n    std::string mPrefix;\r\n    bool mShouldLog;\r\n};\r\n\r\n//!\r\n//! \\class LogStreamConsumerBase\r\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\r\n//!\r\nclass LogStreamConsumerBase {\r\n   public:\r\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\r\n        : mBuffer(stream, prefix, shouldLog) {}\r\n\r\n   protected:\r\n    LogStreamConsumerBuffer mBuffer;\r\n};\r\n\r\n//!\r\n//! \\class LogStreamConsumer\r\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\r\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\r\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\r\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\r\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\r\n//!  Please do not change the order of the parent classes.\r\n//!\r\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\r\n   public:\r\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\r\n    //!  Reportable severity determines if the messages are severe enough to be logged.\r\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\r\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\r\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\r\n          ,\r\n          mShouldLog(severity <= reportableSeverity),\r\n          mSeverity(severity) {}\r\n\r\n    LogStreamConsumer(LogStreamConsumer&& other)\r\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\r\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\r\n          ,\r\n          mShouldLog(other.mShouldLog),\r\n          mSeverity(other.mSeverity) {}\r\n\r\n    void setReportableSeverity(Severity reportableSeverity) {\r\n        mShouldLog = mSeverity <= reportableSeverity;\r\n        mBuffer.setShouldLog(mShouldLog);\r\n    }\r\n\r\n   private:\r\n    static std::ostream& severityOstream(Severity severity) {\r\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\r\n    }\r\n\r\n    static std::string severityPrefix(Severity severity) {\r\n        switch (severity) {\r\n            case Severity::kINTERNAL_ERROR:\r\n                return \"[F] \";\r\n            case Severity::kERROR:\r\n                return \"[E] \";\r\n            case Severity::kWARNING:\r\n                return \"[W] \";\r\n            case Severity::kINFO:\r\n                return \"[I] \";\r\n            case Severity::kVERBOSE:\r\n                return \"[V] \";\r\n            default:\r\n                assert(0);\r\n                return \"\";\r\n        }\r\n    }\r\n\r\n    bool mShouldLog;\r\n    Severity mSeverity;\r\n};\r\n\r\n//! \\class Logger\r\n//!\r\n//! \\brief Class which manages logging of TensorRT tools and samples\r\n//!\r\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\r\n//! and supports logging two types of messages:\r\n//!\r\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\r\n//! - Test pass/fail messages\r\n//!\r\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\r\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\r\n//!\r\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\r\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\r\n//!\r\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\r\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\r\n//! library and messages coming from the sample.\r\n//!\r\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\r\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\r\n//! object.\r\n\r\nclass Logger : public nvinfer1::ILogger {\r\n   public:\r\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\r\n\r\n    //!\r\n    //! \\enum TestResult\r\n    //! \\brief Represents the state of a given test\r\n    //!\r\n    enum class TestResult {\r\n        kRUNNING,  //!< The test is running\r\n        kPASSED,   //!< The test passed\r\n        kFAILED,   //!< The test failed\r\n        kWAIVED    //!< The test was waived\r\n    };\r\n\r\n    //!\r\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\r\n    //! \\return The nvinfer1::ILogger associated with this Logger\r\n    //!\r\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\r\n    //! we can eliminate the inheritance of Logger from ILogger\r\n    //!\r\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\r\n\r\n    //!\r\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\r\n    //!\r\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\r\n    //! inheritance from nvinfer1::ILogger\r\n    //!\r\n    void log(Severity severity, const char* msg) noexcept override {\r\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\r\n    }\r\n\r\n    //!\r\n    //! \\brief Method for controlling the verbosity of logging output\r\n    //!\r\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\r\n    //!\r\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\r\n\r\n    //!\r\n    //! \\brief Opaque handle that holds logging information for a particular test\r\n    //!\r\n    //! This object is an opaque handle to information used by the Logger to print test results.\r\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\r\n    //! with Logger::reportTest{Start,End}().\r\n    //!\r\n    class TestAtom {\r\n       public:\r\n        TestAtom(TestAtom&&) = default;\r\n\r\n       private:\r\n        friend class Logger;\r\n\r\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\r\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\r\n\r\n        bool mStarted;\r\n        std::string mName;\r\n        std::string mCmdline;\r\n    };\r\n\r\n    //!\r\n    //! \\brief Define a test for logging\r\n    //!\r\n    //! \\param[in] name The name of the test.  This should be a string starting with\r\n    //!                  \"TensorRT\" and containing dot-separated strings containing\r\n    //!                  the characters [A-Za-z0-9_].\r\n    //!                  For example, \"TensorRT.sample_googlenet\"\r\n    //! \\param[in] cmdline The command line used to reproduce the test\r\n    //\r\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\r\n    //!\r\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\r\n        return TestAtom(false, name, cmdline);\r\n    }\r\n\r\n    //!\r\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\r\n    //!        as input\r\n    //!\r\n    //! \\param[in] name The name of the test\r\n    //! \\param[in] argc The number of command-line arguments\r\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\r\n    //!\r\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\r\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\r\n        auto cmdline = genCmdlineString(argc, argv);\r\n        return defineTest(name, cmdline);\r\n    }\r\n\r\n    //!\r\n    //! \\brief Report that a test has started.\r\n    //!\r\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\r\n    //!\r\n    //! \\param[in] testAtom The handle to the test that has started\r\n    //!\r\n    static void reportTestStart(TestAtom& testAtom) {\r\n        reportTestResult(testAtom, TestResult::kRUNNING);\r\n        assert(!testAtom.mStarted);\r\n        testAtom.mStarted = true;\r\n    }\r\n\r\n    //!\r\n    //! \\brief Report that a test has ended.\r\n    //!\r\n    //! \\pre reportTestStart() has been called for the given testAtom\r\n    //!\r\n    //! \\param[in] testAtom The handle to the test that has ended\r\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\r\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\r\n    //!\r\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\r\n        assert(result != TestResult::kRUNNING);\r\n        assert(testAtom.mStarted);\r\n        reportTestResult(testAtom, result);\r\n    }\r\n\r\n    static int reportPass(const TestAtom& testAtom) {\r\n        reportTestEnd(testAtom, TestResult::kPASSED);\r\n        return EXIT_SUCCESS;\r\n    }\r\n\r\n    static int reportFail(const TestAtom& testAtom) {\r\n        reportTestEnd(testAtom, TestResult::kFAILED);\r\n        return EXIT_FAILURE;\r\n    }\r\n\r\n    static int reportWaive(const TestAtom& testAtom) {\r\n        reportTestEnd(testAtom, TestResult::kWAIVED);\r\n        return EXIT_SUCCESS;\r\n    }\r\n\r\n    static int reportTest(const TestAtom& testAtom, bool pass) {\r\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\r\n    }\r\n\r\n    Severity getReportableSeverity() const { return mReportableSeverity; }\r\n\r\n   private:\r\n    //!\r\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\r\n    //!\r\n    static const char* severityPrefix(Severity severity) {\r\n        switch (severity) {\r\n            case Severity::kINTERNAL_ERROR:\r\n                return \"[F] \";\r\n            case Severity::kERROR:\r\n                return \"[E] \";\r\n            case Severity::kWARNING:\r\n                return \"[W] \";\r\n            case Severity::kINFO:\r\n                return \"[I] \";\r\n            case Severity::kVERBOSE:\r\n                return \"[V] \";\r\n            default:\r\n                assert(0);\r\n                return \"\";\r\n        }\r\n    }\r\n\r\n    //!\r\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\r\n    //!\r\n    static const char* testResultString(TestResult result) {\r\n        switch (result) {\r\n            case TestResult::kRUNNING:\r\n                return \"RUNNING\";\r\n            case TestResult::kPASSED:\r\n                return \"PASSED\";\r\n            case TestResult::kFAILED:\r\n                return \"FAILED\";\r\n            case TestResult::kWAIVED:\r\n                return \"WAIVED\";\r\n            default:\r\n                assert(0);\r\n                return \"\";\r\n        }\r\n    }\r\n\r\n    //!\r\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\r\n    //!\r\n    static std::ostream& severityOstream(Severity severity) {\r\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\r\n    }\r\n\r\n    //!\r\n    //! \\brief method that implements logging test results\r\n    //!\r\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\r\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\r\n                                         << testAtom.mCmdline << std::endl;\r\n    }\r\n\r\n    //!\r\n    //! \\brief generate a command line string from the given (argc, argv) values\r\n    //!\r\n    static std::string genCmdlineString(int argc, char const* const* argv) {\r\n        std::stringstream ss;\r\n        for (int i = 0; i < argc; i++) {\r\n            if (i > 0)\r\n                ss << \" \";\r\n            ss << argv[i];\r\n        }\r\n        return ss.str();\r\n    }\r\n\r\n    Severity mReportableSeverity;\r\n};\r\n\r\nnamespace {\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\r\n}\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\r\n}\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\r\n}\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\r\n}\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\r\n//         (\"fatal\" severity)\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\r\n}\r\n\r\n}  // anonymous namespace\r\n\r\n#endif  // TENSORRT_LOGGING_H\r\n"
  },
  {
    "path": "googlenet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.14)\n\nproject(\n  googlenet\n  VERSION 0.1\n  LANGUAGES C CXX CUDA)\n\nif(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)\n  set(CMAKE_CUDA_ARCHITECTURES\n      60\n      70\n      72\n      75\n      80\n      86\n      89)\nendif()\n\nset(CMAKE_CXX_STANDARD 17)\nset(CMAKE_CXX_STANDARD_REQUIRED ON)\nset(CMAKE_CUDA_STANDARD 17)\nset(CMAKE_CUDA_STANDARD_REQUIRED ON)\nset(CMAKE_EXPORT_COMPILE_COMMANDS ON)\nset(CMAKE_INCLUDE_CURRENT_DIR TRUE)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME \"Use static cudaruntime library\" OFF)\n\nfind_package(Threads REQUIRED)\nfind_package(CUDAToolkit REQUIRED)\nfind_package(OpenCV REQUIRED)\n\nif(NOT TARGET TensorRT::TensorRT)\n  include(FindTensorRT.cmake)\nendif()\n\nadd_executable(${PROJECT_NAME} googlenet.cpp)\n\ntarget_include_directories(${PROJECT_NAME} PUBLIC ${CMAKE_CURRENT_LIST_DIR}\n                                                  ${OpenCV_INCLUDE_DIRS})\n\ntarget_link_libraries(${PROJECT_NAME} PUBLIC Threads::Threads CUDA::cudart\n                                             TensorRT::TensorRT ${OpenCV_LIBS})\n"
  },
  {
    "path": "googlenet/FindTensorRT.cmake",
    "content": "cmake_minimum_required(VERSION 3.17.0)\n\nfunction(_guess_path var_name required_files)\n  set(_result \"\")\n\n  foreach(path_entry IN LISTS ARGN)\n    if(NOT EXISTS \"${path_entry}\")\n      message(DEBUG \"skip non-existing path '${path_entry}'\")\n      continue()\n    endif()\n\n    set(_ok TRUE)\n    foreach(required_file IN LISTS required_files)\n      if(NOT EXISTS \"${path_entry}/${required_file}\")\n        set(_ok FALSE)\n        message(DEBUG \"'${path_entry}' missing '${required_file}'\")\n        break()\n      endif()\n    endforeach()\n\n    if(_ok)\n      list(APPEND _result \"${path_entry}\")\n      message(DEBUG \"accept '${path_entry}'\")\n    else()\n      message(DEBUG \"reject '${path_entry}'\")\n    endif()\n  endforeach()\n\n  if(_result STREQUAL \"\")\n    message(\n      FATAL_ERROR\n        \"_guess_path(${var_name}) failed: no valid path found. required_files='${required_files}' candidates='${ARGN}'\"\n    )\n  endif()\n\n  set(${var_name}\n      \"${_result}\"\n      PARENT_SCOPE)\nendfunction()\n\n# add library\nadd_library(TensorRT IMPORTED INTERFACE)\nadd_library(TensorRT::TensorRT ALIAS TensorRT)\n\nset(TRT_VERSION\n    CACHE\n      STRING\n      \"TensorRT version, e.g. \\\"8.6.1.6\\\" or \\\"8.6.1.6+cuda12.0.1.011\\\", \\\"8.6.1.6.Windows10.x86_64.cuda-12.0\\\" etc\"\n)\n\nif(NOT TRT_VERSION STREQUAL \"\" AND NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  message(\n    WARNING\n      \"TRT_VERSION defined by cmake and environment variable both, using the later one\"\n  )\nendif()\n\nif(NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  set(TRT_VERSION $ENV{TRT_VERSION})\nendif()\n\nstring(REGEX MATCH \"([0-9]+)\" _match ${TRT_VERSION})\nset(TRT_MAJOR_VERSION \"${_match}\")\nunset(_match)\n\nif(WIN32)\n  set(TensorRT_DIR \"C:/Program Files/TensorRT-${TRT_VERSION}\")\n  if(NOT EXISTS \"${TensorRT_DIR}\")\n    message(\n      FATAL_ERROR\n        \"TensorRT_DIR=${TensorRT_DIR} does not exist!\"\n    )\n  endif()\n\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 10)\n    set(_modules nvinfer_10 nvinfer_plugin_10 nvinfer_vc_plugin_10\n                 nvinfer_dispatch_10 nvinfer_lean_10)\n    message(DEBUG \"Using ${_modules}\")\n  else()\n    set(_modules nvinfer nvinfer_plugin nvinfer_vc_plugin nvinfer_dispatch\n                 nvinfer_lean)\n  endif()\n\n  set(TensorRT_LIBRARY_DIR \"${TensorRT_DIR}/lib\")\n  set(TensorRT_INCLUDE_DIR \"${TensorRT_DIR}/include\")\nelseif(UNIX)\n  string(TOLOWER \"${CMAKE_SYSTEM_PROCESSOR}\" _trt_arch)\n  set(_trt_include_candidates)\n  if(_trt_arch MATCHES \"^(aarch64|arm64|arch64)$\")\n    set(_trt_include_candidates \"/usr/include/aarch64-linux-gnu\" \"/usr/include\"\n                                \"/usr/local/cuda/targets/aarch64-linux/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/aarch64-linux-gnu/lib\"\n        \"/usr/lib/aarch64-linux-gnu\" \"/usr/lib/aarch64-linux-gnu/tegra\"\n        \"/usr/lib\")\n  elseif(_trt_arch MATCHES \"^(x86_64|amd64)$\")\n    set(_trt_include_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/include\"\n        \"/usr/include/x86_64-linux-gnu\" \"/usr/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/lib\"\n        \"/usr/lib/x86_64-linux-gnu\" \"/usr/lib\")\n  else()\n    message(FATAL_ERROR \"Unknown architecture\")\n  endif()\n\n  set(_modules nvinfer nvinfer_plugin)\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 8)\n    list(APPEND _modules nvinfer_vc_plugin nvinfer_dispatch nvinfer_lean)\n  endif()\n\n  _guess_path(TensorRT_LIBRARY_DIR \"libnvinfer.so;libnvinfer_plugin.so\"\n              ${_trt_library_candidates})\n  message(STATUS \"TensorRT libraries: ${TensorRT_LIBRARY_DIR}\")\n  _guess_path(TensorRT_INCLUDE_DIR \"NvInfer.h\" ${_trt_include_candidates})\n  message(STATUS \"TensorRT includes: ${TensorRT_INCLUDE_DIR}\")\nendif()\n\nforeach(lib IN LISTS _modules)\n  find_library(\n    TensorRT_${lib}_LIBRARY\n    NAMES ${lib}\n    HINTS ${TensorRT_LIBRARY_DIR})\n  list(APPEND TensorRT_LIBRARIES ${TensorRT_${lib}_LIBRARY})\nendforeach()\n\ntarget_link_libraries(TensorRT INTERFACE ${TensorRT_LIBRARIES})\n\nmessage(STATUS \"Found TensorRT libs: ${TensorRT_LIBRARIES}\")\n\nset_target_properties(\n  TensorRT\n  PROPERTIES C_STANDARD 17\n             CXX_STANDARD 17\n             POSITION_INDEPENDENT_CODE ON\n             SKIP_BUILD_RPATH TRUE\n             BUILD_WITH_INSTALL_RPATH TRUE\n             INSTALL_RPATH \"$ORIGIN\"\n             INTERFACE_INCLUDE_DIRECTORIES \"${TensorRT_INCLUDE_DIR}\")\n\nunset(TRT_MAJOR_VERSION)\nunset(_modules)\nunset(_trt_include_candidates)\nunset(_trt_library_candidates)\nunset(_trt_arch)\n"
  },
  {
    "path": "googlenet/README.md",
    "content": "# Googlenet\n\n## Introduction\n\nGoogLeNet (Inception v1) model architecture from [Going Deeper with Convolutions](http://arxiv.org/abs/1409.4842). For model details, refer to code from [torchvision](https://github.com/pytorch/vision/blob/main/torchvision/models/googlenet.py#L29), for generating `.wts` file, refer to [pytorchx/googlenet](https://github.com/wang-xinyu/pytorchx/tree/master/googlenet)\n\n## Usage\n\n1. use `gen_wts.py` to generate wts file.\n\n```bash\npython3 gen_wts.py\n```\n\n2. build C++ code\n\n```bash\npushd tensorrtx/googlenet\ncmake -S . -B build -G Ninja --fresh\ncmake --build build\n```\n\n3. serialize wts model to engine file.\n\n```bash\n./build/googlenet -s\n```\n\n4. run inference\n\n```bash\n./build/googlenet -i\n```\n\noutput looks like:\n\n```bash\n...\n====\nExecution time: 637us\n-1.823, -0.9841, 0.6483, 0.7607, -0.4659, -1.407, -2.807, -1.175, -0.4034, -1.881, -1.267, -1.654, 0.7542, -1.777, -0.7118, -2.134, -1.542, 0.1852, -3.036, -0.5396, -0.1669,\n====\nprediction result:\nTop: 0 idx: 285, logits: 9.9, label: Egyptian cat\nTop: 1 idx: 281, logits: 8.304, label: tabby, tabby cat\nTop: 2 idx: 282, logits: 6.859, label: tiger cat\n```\n"
  },
  {
    "path": "googlenet/gen_wts.py",
    "content": "import struct\n\nimport cv2\nimport numpy as np\nimport torch\nfrom torchvision.models.googlenet import googlenet\n\n\ndef read_imagenet_labels() -> dict[int, str]:\n    \"\"\"\n    read ImageNet 1000 labels\n\n    Returns:\n        dict[int, str]: labels dict\n    \"\"\"\n    clsid2label = {}\n    with open(\"../assets/imagenet1000_clsidx_to_labels.txt\", \"r\") as f:\n        for i in f.readlines():\n            k, v = i.split(\": \")\n            clsid2label.setdefault(int(k), v[1:-3])\n    return clsid2label\n\n\ndef preprocess(img: np.array) -> torch.Tensor:\n    \"\"\"\n    a preprocess method align with ImageNet dataset\n\n    Args:\n        img (np.array): input image\n\n    Returns:\n        torch.Tensor: preprocessed image in `NCHW` layout\n    \"\"\"\n    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0\n    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_LINEAR)\n    mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)\n    std = np.array([0.229, 0.224, 0.225], dtype=np.float32)\n    img = (img - mean) / std\n    img = img.transpose(2, 0, 1)[None, ...]\n    return torch.from_numpy(img)\n\n\ndef main():\n    labels = read_imagenet_labels()\n\n    img = cv2.imread(\"../assets/cats.jpg\", cv2.IMREAD_COLOR)\n    img = preprocess(img)\n\n    model = googlenet(pretrained=True)\n    with torch.inference_mode():\n        model = model.eval()\n        output = model(img)\n        for i, batch in enumerate(torch.topk(output, k=3).indices):\n            for j, idx in enumerate(batch):\n                print(f\"\\tBatch: {i}, Top: {j}, logits: {output[i][idx]:.4f}, label: {labels[int(idx)]}\")\n        print(f\"{'=' * 32}\")\n\n    with open(\"../models/googlenet.wts\", \"w\") as f:\n        f.write(\"{}\\n\".format(len(model.state_dict().keys())))\n        for k, v in model.state_dict().items():\n            vr = v.reshape(-1).cpu().numpy()\n            f.write(\"{} {} \".format(k, len(vr)))\n            print(k, v.shape)\n            for vv in vr:\n                f.write(\" \")\n                f.write(struct.pack(\">f\", float(vv)).hex())\n            f.write(\"\\n\")\n        f.close()\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "googlenet/googlenet.cpp",
    "content": "#include <NvInfer.h>\n#include <cassert>\n#include <chrono>\n#include <cmath>\n#include <opencv2/opencv.hpp>\n#include <vector>\n#include \"logging.h\"\n#include \"utils.h\"\n\nusing WeightMap = std::map<std::string, Weights>;\nusing M = nvinfer1::MatrixOperation;\nusing E = nvinfer1::ElementWiseOperation;\nusing NDCF = nvinfer1::NetworkDefinitionCreationFlag;\n\nstatic Logger gLogger;\n\n// stuff we know about googlenet\nstatic constexpr const std::size_t N = 1;\nstatic constexpr const int32_t INPUT_H = 224;\nstatic constexpr const int32_t INPUT_W = 224;\nstatic constexpr const std::array<int32_t, 2> SIZES = {3 * INPUT_H * INPUT_W, 1000};\nstatic constexpr const std::array<const char*, 2> NAMES = {\"data\", \"prob\"};\nstatic constexpr const bool TRT_PREPROCESS = TRT_VERSION >= 8510 ? true : false;\nstatic constexpr const char* WTS_PATH = \"../models/googlenet.wts\";\nstatic constexpr const char* ENGINE_PATH = \"../models/googlenet.engine\";\nstatic constexpr const char* LABELS_PATH = \"../assets/imagenet1000_clsidx_to_labels.txt\";\nstatic constexpr const std::array<const float, 3> mean = {0.485f, 0.456f, 0.406f};\nstatic constexpr const std::array<const float, 3> stdv = {0.229f, 0.224f, 0.225f};\n\nauto addBatchNorm2d(INetworkDefinition* network, WeightMap& m, ITensor& input, const std::string& lname,\n                    float eps = 1e-3) -> ILayer* {\n    static Weights none{DataType::kFLOAT, nullptr, 0ll};\n    float* gamma = (float*)m[lname + \".weight\"].values;\n    float* beta = (float*)m[lname + \".bias\"].values;\n    float* mean = (float*)m[lname + \".running_mean\"].values;\n    float* var = (float*)m[lname + \".running_var\"].values;\n    int64_t len = m[lname + \".running_var\"].count;\n\n    auto* scval = static_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n\n    auto* shift_val = static_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shift_val[i] = beta[i] - (mean[i] * scval[i]);\n    }\n    Weights shift{DataType::kFLOAT, shift_val, len};\n\n    m[lname + \".scale\"] = scale;\n    m[lname + \".shift\"] = shift;\n    m[lname + \".power\"] = none;\n    auto* bn = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, none);\n    assert(bn);\n    bn->setName(lname.c_str());\n    return bn;\n}\n\n/**\n * @brief A basic conv2d+bn+relu layer from googlenet\n *\n * @param network network definition from TensorRT API\n * @param weightMap weight map\n * @param input input tensor\n * @param outch output channels\n * @param k kernel size for convolution\n * @param s stride size for convolution\n * @param p padding size for convolution\n * @param lname layer name from weight map\n * @return ILayer*\n */\nILayer* basicConv2d(INetworkDefinition* network, WeightMap& weightMap, ITensor& input, const std::string& lname,\n                    int32_t outch, int k, int s = 1, int p = 0) {\n    static const Weights none{DataType::kFLOAT, nullptr, 0ll};\n    auto* conv = network->addConvolutionNd(input, outch, DimsHW{k, k}, weightMap[lname + \".conv.weight\"], none);\n    auto* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\");\n    auto* relu = network->addActivation(*bn->getOutput(0), ActivationType::kRELU);\n    assert(conv && bn && relu);\n    conv->setName(lname.c_str());\n    bn->setName((lname + \".bn\").c_str());\n    relu->setName((lname + \".relu\").c_str());\n    conv->setStrideNd(DimsHW{s, s});\n    conv->setPaddingNd(DimsHW{p, p});\n    return relu;\n}\n\n/**\n * @brief Inception module from googlenet implementation in torchvision, see:\n * https://github.com/pytorch/vision/blob/v0.24.1/torchvision/models/googlenet.py#L184\n *\n * @param network network definition from TensorRT API\n * @param weightMap weight map\n * @param input input tensor\n * @param lname layer name from weight map\n * @param ch1x1\n * @param ch3x3red\n * @param ch3x3\n * @param ch5x5red\n * @param ch5x5\n * @param pool_proj\n * @return IConcatenationLayer*\n */\nIConcatenationLayer* inception(INetworkDefinition* network, WeightMap& weightMap, ITensor& input,\n                               const std::string& lname, int ch1x1, int ch3x3red, int ch3x3, int ch5x5red, int ch5x5,\n                               int pool_proj) {\n    // \"cbr\" means \"Conv-Batchnorm-Relu\"\n    auto* cbr1 = basicConv2d(network, weightMap, input, lname + \"branch1\", ch1x1, 1);\n    auto* cbr2 = basicConv2d(network, weightMap, input, lname + \"branch2.0\", ch3x3red, 1);\n    auto* cbr3 = basicConv2d(network, weightMap, *cbr2->getOutput(0), lname + \"branch2.1\", ch3x3, 3, 1, 1);\n    auto* cbr4 = basicConv2d(network, weightMap, input, lname + \"branch3.0\", ch5x5red, 1);\n    auto* cbr5 = basicConv2d(network, weightMap, *cbr4->getOutput(0), lname + \"branch3.1\", ch5x5, 3, 1, 1);\n    auto* pool1 = network->addPoolingNd(input, PoolingType::kMAX, DimsHW{3, 3});\n    auto* cbr6 = basicConv2d(network, weightMap, *pool1->getOutput(0), lname + \"branch4.1\", pool_proj, 1);\n    assert(cbr1 && cbr2 && cbr3 && cbr4 && cbr5 && pool1 && cbr6);\n    pool1->setStrideNd(DimsHW{1, 1});\n    pool1->setPaddingNd(DimsHW{1, 1});\n    pool1->setPaddingMode(PaddingMode::kEXPLICIT_ROUND_UP);\n\n    std::array<ITensor*, 4> inputTensors = {cbr1->getOutput(0), cbr3->getOutput(0), cbr5->getOutput(0),\n                                            cbr6->getOutput(0)};\n    IConcatenationLayer* cat1 = network->addConcatenation(inputTensors.data(), 4);\n    assert(cat1);\n    return cat1;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(int32_t N, IRuntime* runtime, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    WeightMap weightMap = loadWeights(WTS_PATH);\n\n#if TRT_VERSION >= 11200\n    auto flag = 1U << static_cast<int>(NDCF::kSTRONGLY_TYPED);\n#elif TRT_VERSION >= 10000\n    auto flag = 0U;\n#else\n    auto flag = 1U << static_cast<int>(NDCF::kEXPLICIT_BATCH);\n#endif\n    auto* network = builder->createNetworkV2(flag);\n\n    ITensor* input{nullptr};\n    if constexpr (TRT_PREPROCESS) {\n        dt = DataType::kUINT8;\n        input = network->addInput(NAMES[0], dt, Dims4{N, INPUT_H, INPUT_W, 3});\n        auto* trans = addTransformLayer(network, *input, true, mean, stdv);\n        input = trans->getOutput(0);\n    } else {\n        input = network->addInput(NAMES[0], dt, Dims4{N, 3, INPUT_H, INPUT_W});\n    }\n    assert(input);\n\n    auto* relu1 = basicConv2d(network, weightMap, *input, \"conv1\", 64, 7, 2, 3);\n    auto* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n    pool1->setPaddingMode(PaddingMode::kEXPLICIT_ROUND_UP);\n    pool1->setName(\"pool1\");\n\n    auto* relu2 = basicConv2d(network, weightMap, *pool1->getOutput(0), \"conv2\", 64, 1);\n    auto* relu3 = basicConv2d(network, weightMap, *relu2->getOutput(0), \"conv3\", 192, 3, 1, 1);\n    auto* pool2 = network->addPoolingNd(*relu3->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool2);\n    pool2->setStrideNd(DimsHW{2, 2});\n    pool2->setPaddingMode(PaddingMode::kEXPLICIT_ROUND_UP);\n    pool2->setName(\"pool2\");\n\n    auto* cat1 = inception(network, weightMap, *pool2->getOutput(0), \"inception3a.\", 64, 96, 128, 16, 32, 32);\n    auto* cat2 = inception(network, weightMap, *cat1->getOutput(0), \"inception3b.\", 128, 128, 192, 32, 96, 64);\n    auto* pool3 = network->addPoolingNd(*cat2->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool3);\n    pool3->setStrideNd(DimsHW{2, 2});\n    pool3->setPaddingMode(PaddingMode::kEXPLICIT_ROUND_UP);\n    pool3->setName(\"pool3\");\n\n    auto* cat3 = inception(network, weightMap, *pool3->getOutput(0), \"inception4a.\", 192, 96, 208, 16, 48, 64);\n    cat3 = inception(network, weightMap, *cat3->getOutput(0), \"inception4b.\", 160, 112, 224, 24, 64, 64);\n    cat3 = inception(network, weightMap, *cat3->getOutput(0), \"inception4c.\", 128, 128, 256, 24, 64, 64);\n    cat3 = inception(network, weightMap, *cat3->getOutput(0), \"inception4d.\", 112, 144, 288, 32, 64, 64);\n    cat3 = inception(network, weightMap, *cat3->getOutput(0), \"inception4e.\", 256, 160, 320, 32, 128, 128);\n\n    IPoolingLayer* pool4 = network->addPoolingNd(*cat3->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    assert(pool4);\n    pool4->setStrideNd(DimsHW{2, 2});\n    pool4->setPaddingMode(PaddingMode::kEXPLICIT_ROUND_UP);\n    pool4->setName(\"pool4\");\n\n    cat3 = inception(network, weightMap, *pool4->getOutput(0), \"inception5a.\", 256, 160, 320, 32, 128, 128);\n    cat3 = inception(network, weightMap, *cat3->getOutput(0), \"inception5b.\", 384, 192, 384, 48, 128, 128);\n\n    // this is a AdaptiveAvgPool2d in pytorch implementation\n    IPoolingLayer* pool5 = network->addPoolingNd(*cat3->getOutput(0), PoolingType::kAVERAGE, DimsHW{7, 7});\n    auto* shuffle = network->addShuffle(*pool5->getOutput(0));\n    assert(pool5 && shuffle);\n    shuffle->setName(\"shuffle\");\n    shuffle->setReshapeDimensions(Dims2{1, -1});  // \"-1\" means \"1024\"\n\n    auto* fcw = network->addConstant(Dims2{1000, 1024}, weightMap[\"fc.weight\"])->getOutput(0);\n    auto* fcb = network->addConstant(Dims2{1, 1000}, weightMap[\"fc.bias\"])->getOutput(0);\n    auto* fc0 = network->addMatrixMultiply(*shuffle->getOutput(0), M::kNONE, *fcw, M::kTRANSPOSE);\n    auto* fc1 = network->addElementWise(*fc0->getOutput(0), *fcb, E::kSUM);\n\n    fc1->getOutput(0)->setName(NAMES[1]);\n    network->markOutput(*fc1->getOutput(0));\n    // Build engine\n#if TRT_VERSION >= 8000\n    config->setMemoryPoolLimit(MemoryPoolType::kWORKSPACE, WORKSPACE_SIZE);\n    IHostMemory* mem = builder->buildSerializedNetwork(*network, *config);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(mem->data(), mem->size());\n    delete network;\n#else\n    builder->setMaxBatchSize(N);\n    config->setMaxWorkspaceSize(WORKSPACE_SIZE);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    network->destroy();\n#endif\n\n    std::cout << \"build finished\\n\";\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)mem.second.values);\n    }\n\n    return engine;\n}\n\nvoid APIToModel(int32_t N, IRuntime* runtime, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(N, runtime, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n#if TRT_VERSION >= 8000\n    delete engine;\n    delete config;\n    delete builder;\n#else\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n#endif\n}\n\nstd::vector<std::vector<float>> doInference(IExecutionContext& context, void* input, int64_t batchSize) {\n    const auto& engine = context.getEngine();\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n    std::vector<void*> buffers;\n\n#if TRT_VERSION >= 8000\n    const int32_t nIO = engine.getNbIOTensors();\n#else\n    const int32_t nIO = engine.getNbBindings();\n#endif\n\n    buffers.resize(nIO);\n    for (auto i = 0; i < nIO; ++i) {\n        std::size_t size = 0;\n#if TRT_VERSION >= 8000\n        auto* tensor_name = engine.getIOTensorName(i);\n        auto s = getSize(engine.getTensorDataType(tensor_name));\n        size = s * batchSize * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n        context.setTensorAddress(tensor_name, buffers[i]);\n#else\n        const int32_t idx = engine.getBindingIndex(NAMES[i]);\n        auto s = getSize(engine.getBindingDataType(idx));\n        assert(idx == i);\n        size = s * batchSize * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n#endif\n    }\n\n#if TRT_VERSION >= 8000\n    assert(context.enqueueV3(stream));\n#else\n    assert(context.enqueueV2(buffers.data(), stream, nullptr));\n#endif\n\n    std::vector<std::vector<float>> prob;\n    for (int i = 1; i < nIO; ++i) {\n        std::vector<float> tmp(batchSize * SIZES[i], std::nanf(\"\"));\n        std::size_t size = batchSize * SIZES[i] * sizeof(float);\n        CHECK(cudaMemcpyAsync(tmp.data(), buffers[i], size, cudaMemcpyDeviceToHost, stream));\n        prob.emplace_back(tmp);\n    }\n    CHECK(cudaStreamSynchronize(stream));\n\n    for (auto& buffer : buffers) {\n        CHECK(cudaFree(buffer));\n    }\n    CHECK(cudaStreamDestroy(stream));\n    return prob;\n}\n\nint main(int argc, char** argv) {\n    checkTrtEnv();\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\\n\";\n        std::cerr << \"./googlenet -s   // serialize model to plan file\\n\";\n        std::cerr << \"./googlenet -d   // deserialize plan file and run inference\\n\";\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    char* trtModelStream{nullptr};\n    std::streamsize size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, runtime, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(ENGINE_PATH, std::ios::binary | std::ios::trunc);\n        if (!p) {\n            std::cerr << \"could not open plan output file\\n\";\n            return -1;\n        }\n        if (modelStream->size() > static_cast<std::size_t>(std::numeric_limits<std::streamsize>::max())) {\n            std::cerr << \"this model is too large to serialize\\n\";\n            return -1;\n        }\n        const auto* data_ptr = reinterpret_cast<const char*>(modelStream->data());\n        auto data_size = static_cast<std::streamsize>(modelStream->size());\n        p.write(data_ptr, data_size);\n#if TRT_VERSION >= 8000\n        delete modelStream;\n#else\n        modelStream->destroy();\n#endif\n        return 0;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(ENGINE_PATH, std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return 1;\n    }\n\n#if TRT_VERSION >= 8000\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n#else\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n#endif\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n\n    const std::string img_path = \"../assets/cats.jpg\";\n    void* input = nullptr;\n    std::vector<float> flat_img;\n    cv::Mat img = cv::imread(img_path, cv::IMREAD_COLOR);\n\n    if constexpr (TRT_PREPROCESS) {\n        // for simplicity, resize image on cpu side\n        cv::resize(img, img, cv::Size(INPUT_W, INPUT_H), 0, 0, cv::INTER_LINEAR);\n        input = static_cast<void*>(img.data);\n    } else {\n        flat_img = preprocess_img(img, true, mean, stdv, N, INPUT_H, INPUT_W);\n        input = flat_img.data();\n    }\n    assert(input);\n\n    for (int32_t i = 0; i < 100; ++i) {\n        auto _start = std::chrono::system_clock::now();\n        auto prob = doInference(*context, input, 1);\n        auto _end = std::chrono::system_clock::now();\n        auto _time = std::chrono::duration_cast<std::chrono::microseconds>(_end - _start).count();\n        std::cout << \"Execution time: \" << _time << \"us\\n\";\n\n        for (const auto& vector : prob) {\n            int idx = 0;\n            for (auto v : vector) {\n                std::cout << std::setprecision(4) << v << \", \" << std::flush;\n                if (++idx > 20) {\n                    std::cout << \"\\n====\\n\";\n                    break;\n                }\n            }\n        }\n\n        if (i == 99) {\n            std::cout << \"prediction result:\\n\";\n            auto labels = loadImagenetLabelMap(LABELS_PATH);\n            int _top = 0;\n            for (auto& [idx, logits] : topk(prob[0], 3)) {\n                std::cout << \"Top: \" << _top++ << \" idx: \" << idx << \", logits: \" << logits\n                          << \", label: \" << labels[idx] << \"\\n\";\n            }\n        }\n    }\n    delete[] trtModelStream;\n\n#if TRT_VERSION >= 8000\n    delete context;\n    delete engine;\n    delete runtime;\n#else\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n#endif\n\n    return 0;\n}\n"
  },
  {
    "path": "googlenet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <cstdint>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include <utility>\n#include \"NvInferRuntime.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(std::move(prefix)), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) noexcept\n        : mOutput(other.mOutput), mPrefix(std::move(other.mPrefix)), mShouldLog(other.mShouldLog) {}\n\n    ~LogStreamConsumerBuffer() override {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    int sync() override {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mBuffer(stream, std::move(prefix), shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other) noexcept\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   private:\n    struct TestInfo;\n\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult : std::uint8_t {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << '\\n';\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, TestInfo info)\n            : mStarted(started), mName(std::move(info.name)), mCmdline(std::move(info.cmdline)) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom{false, TestInfo{name, cmdline}};\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    [[nodiscard]] Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    struct TestInfo {\n        std::string name;\n        std::string cmdline;\n    };\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << '\\n';\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kVERBOSE};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINFO};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kWARNING};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kERROR};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINTERNAL_ERROR};\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "googlenet/macros.h",
    "content": "#pragma once\n#include <NvInfer.h>\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#define TRT_VERSION \\\n    ((NV_TENSORRT_MAJOR * 1000) + (NV_TENSORRT_MINOR * 100) + (NV_TENSORRT_PATCH * 10) + NV_TENSORRT_BUILD)\n\n#if TRT_VERSION < 7220\n#error \"TensorRT >= 7.2.2 is required for this demo.\"\n#endif\n\n#if TRT_VERSION >= 8000\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n"
  },
  {
    "path": "googlenet/utils.h",
    "content": "#pragma once\n#include <cuda_runtime_api.h>\n#include <algorithm>\n#include <cassert>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <numeric>\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\nusing namespace nvinfer1;\n\nconstexpr const std::size_t WORKSPACE_SIZE = 16 << 20;\n\n#define CHECK(status)                                     \\\n    do {                                                  \\\n        auto ret = (status);                              \\\n        if (ret != cudaSuccess) {                         \\\n            std::cerr << \"Cuda failure: \" << ret << \"\\n\"; \\\n            std::abort();                                 \\\n        }                                                 \\\n    } while (0)\n\nstatic void checkTrtEnv(int device = 0) {\n#if TRT_VERSION < 8000\n    CHECK(cudaGetDevice(&device));\n    cudaDeviceProp prop{};\n    CHECK(cudaGetDeviceProperties(&prop, device));\n    const int sm = prop.major * 10 + prop.minor;\n    if (sm > 86) {\n        std::cerr << \"TensorRT < 8 does not support SM > 86 on this GPU.\";\n        std::abort();\n    }\n#endif\n}\n\n/**\n * @brief TensorRT weight files have a simple space delimited format:\n * [type] [size] <data x size in hex>\n * \n * @param file input weight file path\n * @return std::map<std::string, nvinfer1::Weights> \n */\nstatic auto loadWeights(const std::string& file) {\n    std::cout << \"Loading weights: \" << file << \"\\n\";\n    std::map<std::string, nvinfer1::Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> wt.count;\n\n        // Load blob\n        auto* val = new uint32_t[wt.count];\n        input >> std::hex;\n        for (auto x = 0ll; x < wt.count; ++x) {\n            input >> val[x];\n        }\n        wt.values = val;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\n/**\n * @brief a preprocess function aligning with ImageNet preprocess in torchvision, only support 3-channel image\n * \n * @param img opencv image with BGR layout\n * @param bgr2rgb whether to convert BGR to RGB\n * @param mean subtract mean\n * @param std divide std\n * @param n batch size\n * @param h resize height\n * @param w resize width\n * @return std::vector<float> contiguous flatten image data in float32 type\n */\nstatic std::vector<float> preprocess_img(cv::Mat& img, bool bgr2rgb, const std::array<const float, 3>& mean,\n                                         const std::array<const float, 3>& std, int n, int h, int w) {\n    const auto c = img.channels();\n    const auto size = c * h * w;\n    if (c != 3) {\n        std::cerr << \"this demo only supports 3 channel input image.\\n\";\n        std::abort();\n    }\n    if (bgr2rgb) {\n        cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n    }\n    cv::resize(img, img, cv::Size(w, h), 0, 0, cv::INTER_LINEAR);\n    img.convertTo(img, CV_32FC3, 1.f / 255);\n    img = (img - cv::Scalar(mean[0], mean[1], mean[2])) / cv::Scalar(std[0], std[1], std[2]);\n    std::vector<float> chw(static_cast<std::size_t>(n) * c * h * w, 0.f);\n\n    // fill all batch with the same input image\n    for (int i = 0; i < n; ++i) {\n        for (int y = 0; y < h; ++y) {\n            for (int x = 0; x < w; ++x) {\n                const cv::Vec3f v = img.at<cv::Vec3f>(y, x);\n                chw[i * size + 0 * h * w + y * w + x] = v[0];\n                chw[i * size + 1 * h * w + y * w + x] = v[1];\n                chw[i * size + 2 * h * w + y * w + x] = v[2];\n            }\n        }\n    }\n    return chw;\n}\n\nstatic auto topk(const std::vector<float>& v, int k) -> std::vector<std::pair<int, float>> {\n    if (k <= 0)\n        return {};\n    auto stride = std::min<std::ptrdiff_t>(k, static_cast<int64_t>(v.size()));\n\n    std::vector<int> idx(v.size());\n    std::iota(idx.begin(), idx.end(), 0);\n\n    std::partial_sort(idx.begin(), idx.begin() + k, idx.end(), [&](int a, int b) { return v[a] > v[b]; });\n\n    std::vector<std::pair<int, float>> out;\n    out.reserve(stride);\n    for (auto i = 0; i < stride; ++i)\n        out.emplace_back(idx[i], v[idx[i]]);\n    return out;\n}\n\nstatic std::map<int, std::string> loadImagenetLabelMap(const std::string& path) {\n    std::map<int, std::string> labels;\n    std::ifstream in(path);\n    if (!in.is_open()) {\n        return labels;\n    }\n    std::string line;\n    while (std::getline(in, line)) {\n        auto colon = line.find(':');\n        if (colon == std::string::npos) {\n            continue;\n        }\n        auto first_quote = line.find('\\'', colon);\n        if (first_quote == std::string::npos) {\n            continue;\n        }\n        auto second_quote = line.find('\\'', first_quote + 1);\n        if (second_quote == std::string::npos) {\n            continue;\n        }\n        int idx = std::stoi(line.substr(0, colon));\n        labels[idx] = line.substr(first_quote + 1, second_quote - first_quote - 1);\n    }\n    return labels;\n}\n\nstatic ILayer* addTransformLayer(INetworkDefinition* network, ITensor& input, bool bgr2rgb,\n                                 const std::array<const float, 3>& mean, const std::array<const float, 3>& std) {\n    struct ScaleParams {\n        std::array<float, 3> shift;\n        std::array<float, 3> scale;\n    };\n    static std::vector<std::unique_ptr<ScaleParams>> gScaleParams;\n    auto params = std::make_unique<ScaleParams>();\n    params->shift = {-mean[0] / std[0], -mean[1] / std[1], -mean[2] / std[2]};\n    params->scale = {1.f / (std[0] * 255.f), 1.f / (std[1] * 255.f), 1.f / (std[2] * 255.f)};\n\n    static const Weights empty{DataType::kFLOAT, nullptr, 0ll};\n    const Weights shift{DataType::kFLOAT, params->shift.data(), 3ll};\n    const Weights scale{DataType::kFLOAT, params->scale.data(), 3ll};\n\n    gScaleParams.emplace_back(std::move(params));\n\n    ITensor* in = &input;\n    if (input.getType() != DataType::kFLOAT) {\n#if TRT_VERSION >= 8000\n        auto* cast = network->addCast(input, DataType::kFLOAT);\n        assert(cast);\n        cast->setName(\"Cast to FP32\");\n        in = cast->getOutput(0);\n#else\n        auto* identity = network->addIdentity(input);\n        assert(identity);\n        identity->setName(\"Convert to FP32\");\n        identity->setOutputType(0, DataType::kFLOAT);\n        in = identity->getOutput(0);\n#endif\n    }\n    // Convert from NHWC to NCHW\n    auto* perm = network->addShuffle(*in);\n    assert(perm);\n    perm->setName(\"NHWC -> NCHW\");\n    perm->setFirstTranspose(Permutation{0, 3, 1, 2});\n\n    // Convert from BGR to RGB (optional)\n    ITensor* data{nullptr};\n    if (bgr2rgb) {\n        auto add_slice = [&](int c, const char* name) -> ITensor* {\n            auto dims = perm->getOutput(0)->getDimensions();\n            Dims4 start = {0, c, 0, 0}, stride = {1, 1, 1, 1};\n            Dims4 size = {dims.d[0], 1, dims.d[2], dims.d[3]};\n            auto* _slice = network->addSlice(*perm->getOutput(0), start, size, stride);\n            _slice->setName(name);\n            assert(_slice && _slice->getNbOutputs() == 1);\n            return _slice->getOutput(0);\n        };\n        std::array<ITensor*, 3> channels = {add_slice(2, \"R\"), add_slice(1, \"G\"), add_slice(0, \"B\")};\n        auto* cat = network->addConcatenation(channels.data(), 3);\n        assert(cat);\n        cat->setName(\"RGB\");\n        cat->setAxis(1);\n        data = cat->getOutput(0);\n    } else {\n        data = perm->getOutput(0);\n    }\n\n    // Normalize\n    auto* trans = network->addScale(*data, ScaleMode::kCHANNEL, shift, scale, empty);\n    assert(trans);\n    trans->setName(\"mean & std\");\n#if TRT_VERSION >= 8000\n    trans->setChannelAxis(1);\n#endif\n    return trans;\n}\n\nstatic size_t getSize(DataType dt) {\n    switch (dt) {\n#if TRT_VERSION >= 8510\n        case DataType::kUINT8:\n#endif\n        case DataType::kINT8:\n            return sizeof(int8_t);\n        case DataType::kFLOAT:\n            return sizeof(float);\n        case DataType::kHALF:\n            return sizeof(int16_t);\n        case DataType::kINT32:\n            return sizeof(int32_t);\n        default: {\n            std::cerr << \"Unsupported data type\\n\";\n            std::abort();\n        }\n    }\n}\n"
  },
  {
    "path": "hrnet/hrnet-image-classification/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(hrnet)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(hrnet ${PROJECT_SOURCE_DIR}/hrnet.cpp)\ntarget_link_libraries(hrnet nvinfer)\ntarget_link_libraries(hrnet cudart)\ntarget_link_libraries(hrnet ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "hrnet/hrnet-image-classification/README.md",
    "content": "# HRNet\n\nThe Pytorch implementation is [HRNet-Image-Classification](https://github.com/HRNet/HRNet-Image-Classification).  The implemented model is **HRNet-W18-C-Small-v2** \n\n\n## How to Run\n\n* 1. generate .wts\n\n  Download code and model from [HRNet-Image-Classification](https://github.com/HRNet/HRNet-Image-Classification) and config your environments.\n\n  Put `demo.py`  in the `YOUR_ROOT_DIR\\HRNet-Image-Classification\\tools `  folder, set `savewts in  main()` as `True`, and run, the .wts will be generated.\n\n* 2. cmake and make\n\n  ```\n  mkdir build\n  cd build\n  cmake ..\n  make\n  sudo ./hrnet -s             // serialize model to plan file i.e. 'hrnet.engine'\n  sudo ./hrnet -d  ../samples // deserialize plan file and run inference, the images in samples will be processed.\n  ```\n\n## Result\n\nThe test img:\n\n![](https://user-images.githubusercontent.com/20653176/93732833-ac103200-fc05-11ea-88ff-6f59f316a377.JPEG)\n\nPytorch Result:\n\n![image-20200921115119593](https://user-images.githubusercontent.com/20653176/93731787-225e6580-fc01-11ea-9578-393079cd1873.png)\n\nTRT Result:\n\n![image-20200921114959069](https://user-images.githubusercontent.com/20653176/93731788-238f9280-fc01-11ea-954f-2debc20e102a.png)\n"
  },
  {
    "path": "hrnet/hrnet-image-classification/common.hpp",
    "content": "#pragma once\n\n#include <fstream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <opencv2/opencv.hpp>\n#include <dirent.h>\n#include \"NvInfer.h\"\n#include \"NvInferPlugin.h\"\n#include \"cuda_runtime_api.h\"\n\nusing namespace nvinfer1;\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\nint read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n            strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{ DataType::kFLOAT, nullptr, 0 };\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n    //std::cout << \"len \" << len << std::endl;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{ DataType::kFLOAT, scval, len };\n\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{ DataType::kFLOAT, shval, len };\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{ DataType::kFLOAT, pval, len };\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* convBnLeaky(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int ksize, int s, int p, std::string convname, std::string bnname, bool bias = false) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n    IConvolutionLayer* conv1;\n    //Dims dim;\n    if (!bias)\n    {\n        conv1 = network->addConvolutionNd(input, outch, DimsHW{ ksize, ksize }, weightMap[convname + \".weight\"], emptywts);\n    }\n    else\n    {\n        conv1 = network->addConvolutionNd(input, outch, DimsHW{ ksize, ksize }, weightMap[convname + \".weight\"], weightMap[convname + \".bias\"]);\n    }\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ s, s });\n    conv1->setPaddingNd(DimsHW{ p, p });\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), bnname, 1e-4);\n    auto lr = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    return lr;\n}\n\nIActivationLayer* ResBlock2Conv(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int stride, std::string lname) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, inch, DimsHW{ 1, 1 }, weightMap[lname + \".conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ stride, stride });\n    conv1->setPaddingNd(DimsHW{ 0, 0 });\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn1\", 1e-5);\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    ///\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), inch, DimsHW{ 3, 3 }, weightMap[lname + \".conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{ stride, stride });\n    conv2->setPaddingNd(DimsHW{ 1, 1 });\n\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \".bn2\", 1e-5);\n\n    IActivationLayer* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n    //////\n    IConvolutionLayer* conv3 = network->addConvolutionNd(*relu2->getOutput(0), outch, DimsHW{ 1, 1 }, weightMap[lname + \".conv3.weight\"], emptywts);\n    assert(conv3);\n    conv1->setStrideNd(DimsHW{ stride, stride });\n    conv3->setPaddingNd(DimsHW{ 0, 0 });\n\n    IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \".bn3\", 1e-5);\n\n    IElementWiseLayer* ew1;\n    if (inch != outch) {\n        IConvolutionLayer* conv4 = network->addConvolutionNd(input, outch, DimsHW{ 1, 1 }, weightMap[lname + \".downsample.0.weight\"], emptywts);\n        assert(conv4);\n        conv4->setStrideNd(DimsHW{ stride, stride });\n        conv4->setPaddingNd(DimsHW{ 0, 0 });\n        IScaleLayer* bn4 = addBatchNorm2d(network, weightMap, *conv4->getOutput(0), lname + \".downsample.1\", 1e-5);\n        ew1 = network->addElementWise(*bn4->getOutput(0), *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    else {\n        ew1 = network->addElementWise(input, *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    IActivationLayer* relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n    return relu3;\n}\n\nIActivationLayer* ResBlock(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int stride, std::string lname) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n    // in 256 out 64\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ 1, 1 }, weightMap[lname + \".conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ stride, stride });\n    conv1->setPaddingNd(DimsHW{ 0, 0 });\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    ///\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{ 3, 3 }, weightMap[lname + \".conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{ stride, stride });\n    conv2->setPaddingNd(DimsHW{ 1, 1 });\n\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \".bn2\", 1e-5);\n\n    IActivationLayer* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n    //////\n    IConvolutionLayer* conv3 = network->addConvolutionNd(*relu2->getOutput(0), inch, DimsHW{ 1, 1 }, weightMap[lname + \".conv3.weight\"], emptywts);\n    assert(conv3);\n    conv1->setStrideNd(DimsHW{ stride, stride });\n    conv1->setPaddingNd(DimsHW{ 0, 0 });\n\n    IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \".bn3\", 1e-5);\n\n    IElementWiseLayer* ew1;\n    ew1 = network->addElementWise(input, *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n    return relu3;\n}\n\nIActivationLayer* liteResBlock(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, std::string lname) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n    // in 256 out 64\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ 3, 3 }, weightMap[lname + \".conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ 1, 1 });\n    conv1->setPaddingNd(DimsHW{ 1, 1 });\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    ///\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{ 3, 3 }, weightMap[lname + \".conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{ 1, 1 });\n    conv2->setPaddingNd(DimsHW{ 1, 1 });\n\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \".bn2\", 1e-5);\n\n    IElementWiseLayer* ew1;\n    ew1 = network->addElementWise(input, *bn2->getOutput(0), ElementWiseOperation::kSUM);\n\n    IActivationLayer* relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n    return relu3;\n}\n\nILayer* netAddUpsample(INetworkDefinition* network, ITensor* input, int inputChannels, int stride){\n    nvinfer1::Dims inpDims = input->getDimensions();\n    assert(inpDims.nbDims == 3); // chw\n    assert(inpDims.d[1] == inpDims.d[2]);\n    int h = inpDims.d[1];\n    int w = inpDims.d[2];\n    // add pre multiply matrix as a constant\n    /*\n    kSPATIA Elements correspond to different spatial data.\n\n    kCHANNEL Elements correspond to different channels.\n    */\n    nvinfer1::Dims preDims{ 3,\n                           {1, stride * h, w},\n                           {nvinfer1::DimensionType::kCHANNEL,\n                            nvinfer1::DimensionType::kSPATIAL,\n                            nvinfer1::DimensionType::kSPATIAL} };\n    int size = stride * h * w;\n    nvinfer1::Weights preMul{ nvinfer1::DataType::kFLOAT, nullptr, size };\n    float* preWt = new float[size];\n    /* (2*h * w)\n    [ [1, 0, ..., 0],\n      [1, 0, ..., 0],\n      [0, 1, ..., 0],\n      [0, 1, ..., 0],\n      ...,\n      ...,\n      [0, 0, ..., 1],\n      [0, 0, ..., 1] ]\n    */\n    for (int i = 0, idx = 0; i < h; ++i)\n    {\n        for (int s = 0; s < stride; ++s)\n        {\n            for (int j = 0; j < w; ++j, ++idx)\n            {\n                preWt[idx] = (i == j) ? 1.0 : 0.0;\n            }\n        }\n    }\n    preMul.values = preWt;\n    nvinfer1::IConstantLayer* preM = network->addConstant(preDims, preMul);\n    assert(preM != nullptr);\n    //std::string preLayerName = \"preMul_\" + std::to_string(layerIdx);\n    //preM->setName(preLayerName.c_str());\n    // add post multiply matrix as a constant\n    nvinfer1::Dims postDims{ 3,\n                            {1, h, stride * w},\n                            {nvinfer1::DimensionType::kCHANNEL,\n                             nvinfer1::DimensionType::kSPATIAL,\n                             nvinfer1::DimensionType::kSPATIAL} };\n    size = stride * h * w;\n    nvinfer1::Weights postMul{ nvinfer1::DataType::kFLOAT, nullptr, size };\n    float* postWt = new float[size];\n    /* (h * 2*w)\n    [ [1, 1, 0, 0, ..., 0, 0],\n      [0, 0, 1, 1, ..., 0, 0],\n      ...,\n      ...,\n      [0, 0, 0, 0, ..., 1, 1] ]\n    */\n    for (int i = 0, idx = 0; i < h; ++i)\n    {\n        for (int j = 0; j < stride * w; ++j, ++idx)\n        {\n            postWt[idx] = (j / stride == i) ? 1.0 : 0.0;\n        }\n    }\n    postMul.values = postWt;\n    nvinfer1::IConstantLayer* post_m = network->addConstant(postDims, postMul);\n    assert(post_m != nullptr);\n    // add matrix multiply layers for upsampling\n    nvinfer1::IMatrixMultiplyLayer* mm1\n        = network->addMatrixMultiply(*preM->getOutput(0),\n            nvinfer1::MatrixOperation::kNONE, *input,\n            nvinfer1::MatrixOperation::kNONE);\n    assert(mm1 != nullptr);\n    nvinfer1::IMatrixMultiplyLayer* mm2\n        = network->addMatrixMultiply(*mm1->getOutput(0),\n            nvinfer1::MatrixOperation::kNONE,\n            *post_m->getOutput(0),\n            nvinfer1::MatrixOperation::kNONE);\n    assert(mm2 != nullptr);\n    return mm2;\n}\n\n"
  },
  {
    "path": "hrnet/hrnet-image-classification/demo.py",
    "content": "# ------------------------------------------------------------------------------\n# ------------------------------------------------------------------------------\n# Copyright (c) Microsoft\n# Licensed under the MIT License.\n# Written by Bin Xiao (Bin.Xiao@microsoft.com)\n# Modified by Ke Sun (sunk@mail.ustc.edu.cn)\n# ------------------------------------------------------------------------------\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport argparse\nimport os\nimport sys\nimport shutil\nimport pprint\n\nimport torch\nimport torch.nn.parallel\nimport torch.backends.cudnn as cudnn\nimport torch.optim\nimport torch.utils.data\nimport torch.utils.data.distributed\nimport torchvision.datasets as datasets\nimport torchvision.transforms as transforms\n\nimport _init_paths\nimport models\nfrom config import config\nfrom config import update_config\nfrom core.function import validate\nfrom utils.modelsummary import get_model_summary\nfrom utils.utils import create_logger\nfrom core.evaluate import accuracy\n\nimport cv2\nimport numpy as np\nfrom PIL import Image\nimport struct\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description='Train keypoints network')\n    \n    parser.add_argument('--cfg',\n                        help='experiment configure file name',\n                        default=r\"E:\\LearningCodes\\GithubRepo\\HRNet-Image-Classification\\experiments\\cls_hrnet_w18_small_v2_sgd_lr5e-2_wd1e-4_bs32_x100.yaml\",\n                        type=str)\n    parser.add_argument('--modelDir',\n                        help='model directory',\n                        type=str,\n                        default='')\n    parser.add_argument('--logDir',\n                        help='log directory',\n                        type=str,\n                        default='')\n    parser.add_argument('--dataDir',\n                        help='data directory',\n                        type=str,\n                        default='')\n    parser.add_argument('--testModel',\n                        help='testModel',\n                        type=str,\n                        default=r'E:\\LearningCodes\\GithubRepo\\HRNet-Image-Classification\\hrnet_w18_small_model_v2.pth')\n    parser.add_argument('--testImg',\n                    help='imgs',\n                    type=str,\n                    default=r'E:\\Datasets\\tiny-imagenet-200\\tiny-imagenet-200\\val\\images\\val_41.JPEG')\n    args = parser.parse_args()\n    update_config(config, args)\n\n    return args\n\ndef main():\n    savewts = False\n    args = parse_args()\n\n    logger, final_output_dir, tb_log_dir = create_logger(\n        config, args.cfg, 'demo')\n\n    logger.info(pprint.pformat(args))\n    logger.info(pprint.pformat(config))\n\n    # cudnn related setting\n    cudnn.benchmark = config.CUDNN.BENCHMARK\n    torch.backends.cudnn.deterministic = config.CUDNN.DETERMINISTIC\n    torch.backends.cudnn.enabled = config.CUDNN.ENABLED\n\n    # eval() 函数用来执行一个字符串表达式，并返回表达式的值。\n    model = eval('models.'+config.MODEL.NAME+'.get_cls_net')(\n        config)\n\n    model.load_state_dict(torch.load(args.testModel))\n\n    if savewts:\n        f = open('HRNetClassify.wts', 'w')\n        f.write('{}\\n'.format(len(model.state_dict().keys())))\n        for k, v in model.state_dict().items():\n            vr = v.reshape(-1).cpu().numpy()\n            f.write('{} {} '.format(k, len(vr)))\n            for vv in vr:\n                f.write(' ')\n                f.write(struct.pack('>f', float(vv)).hex())\n            f.write('\\n')\n        exit(0)\n    # load img\n    image = cv2.imread(args.testImg) #BGR 0-255 hwc\n    #im = Image.open(args.testImg)\n    #print(im.getpixel((0,0)))  ## 0-255\n    #resize\n    # config.MODEL.IMAGE_SIZE[0]\n    resized_img = cv2.resize(image, (config.MODEL.IMAGE_SIZE[0], config.MODEL.IMAGE_SIZE[1]))\n    resized_img = cv2.cvtColor(resized_img, cv2.COLOR_BGR2RGB) #RGB\n    # normalize\n    mean = [0.485, 0.456, 0.406]\n    std = [0.229, 0.224, 0.225]\n    inp_image = ((resized_img/255. - mean) / std).astype(np.float32) # R-0.485  B-\n    inp_image = inp_image.transpose(2, 0, 1) # chw\n    inp_image = torch.from_numpy(inp_image).unsqueeze(0) # to_tensor\n    model.eval()\n    output = model(inp_image)\n    #print(output)\n\n    _, pred = output.topk(1)\n    pred = pred.t()\n    print(pred)\nif __name__ == \"__main__\":\n    main()"
  },
  {
    "path": "hrnet/hrnet-image-classification/hrnet.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include \"common.hpp\"\n#include \"logging.h\"\n\nstatic Logger gLogger;\n#define DEVICE 0  // GPU id\n#define BATCH_SIZE 1\n\nconst char* INPUT_BLOB_NAME = \"image\";\nconst char* OUTPUT_BLOB_NAME = \"output\";\nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 1000;\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{ 3, INPUT_H, INPUT_W });\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"E:\\\\LearningCodes\\\\GithubRepo\\\\HRNet-Image-Classification\\\\tools\\\\HRNetClassify.wts\");\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n\n    auto id_993 = convBnLeaky(network, weightMap, *data, 64, 3, 2, 1, \"conv1\", \"bn1\");  //conv1.weight \n    auto id_996 = convBnLeaky(network, weightMap, *id_993->getOutput(0), 64, 3, 2, 1, \"conv2\", \"bn2\");  //conv1.weight                                                                                 //Res\n    // IActivationLayer* ResBlock(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int stride, std::string lname) {\n    auto id_1008 = ResBlock2Conv(network, weightMap, *id_996->getOutput(0), 64, 256, 1, \"layer1.0\");\n    auto id_1018 = ResBlock(network, weightMap, *id_1008->getOutput(0), 256, 64, 1, \"layer1.1\");\n\n    // transition1-1\n    auto id_1021 = convBnLeaky(network, weightMap, *id_1018->getOutput(0), 18, 3, 1, 1, \"transition1.0.0\", \"transition1.0.1\");\n    auto id_1031 = liteResBlock(network, weightMap, *id_1021->getOutput(0), 18, \"stage2.0.branches.0.0\");\n    auto id_1038 = liteResBlock(network, weightMap, *id_1031->getOutput(0), 18, \"stage2.0.branches.0.1\");\n    //Ҳ֧\n    auto id_1024 = convBnLeaky(network, weightMap, *id_1018->getOutput(0), 36, 3, 2, 1, \"transition1.1.0.0\", \"transition1.1.0.1\");\n    auto id_1045 = liteResBlock(network, weightMap, *id_1024->getOutput(0), 36, \"stage2.0.branches.1.0\");\n    auto id_1052 = liteResBlock(network, weightMap, *id_1045->getOutput(0), 36, \"stage2.0.branches.1.1\");\n\n    // conv+bn+upsample\n    IConvolutionLayer* id_1053 = network->addConvolutionNd(*id_1052->getOutput(0), 18, DimsHW{ 1, 1 }, weightMap[\"stage2.0.fuse_layers.0.1.0.weight\"], emptywts);\n    assert(id_1053);\n    id_1053->setStrideNd(DimsHW{ 1, 1 });\n    id_1053->setPaddingNd(DimsHW{ 0, 0 });\n\n    IScaleLayer* id_1054 = addBatchNorm2d(network, weightMap, *id_1053->getOutput(0), \"stage2.0.fuse_layers.0.1.1\", 1e-5);\n    ILayer* id_1083 = netAddUpsample(network, id_1054->getOutput(0), 18, 2);\n    IElementWiseLayer* id_1084 = network->addElementWise(*id_1083->getOutput(0), *id_1038->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1085 = network->addActivation(*id_1084->getOutput(0), ActivationType::kRELU);\n\n    // transition1-2\n    IConvolutionLayer* id_1086 = network->addConvolutionNd(*id_1038->getOutput(0), 36, DimsHW{ 3, 3 }, weightMap[\"stage2.0.fuse_layers.1.0.0.0.weight\"], emptywts);\n    assert(id_1086);\n    id_1086->setStrideNd(DimsHW{ 2, 2 });\n    id_1086->setPaddingNd(DimsHW{ 1, 1 });\n\n    IScaleLayer* id_1087 = addBatchNorm2d(network, weightMap, *id_1086->getOutput(0), \"stage2.0.fuse_layers.1.0.0.1\", 1e-5);\n    IElementWiseLayer* id_1088 = network->addElementWise(*id_1087->getOutput(0), *id_1052->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1089 = network->addActivation(*id_1088->getOutput(0), ActivationType::kRELU);\n\n    ///////////////////////////////////\n    // transition2-1  stage_3\n    auto id_1099 = liteResBlock(network, weightMap, *id_1085->getOutput(0), 18, \"stage3.0.branches.0.0\");\n    auto id_1106 = liteResBlock(network, weightMap, *id_1099->getOutput(0), 18, \"stage3.0.branches.0.1\");\n    // transition2-2  stage_3\n    auto id_1113 = liteResBlock(network, weightMap, *id_1089->getOutput(0), 36, \"stage3.0.branches.1.0\");\n    auto id_1120 = liteResBlock(network, weightMap, *id_1113->getOutput(0), 36, \"stage3.0.branches.1.1\");\n    // transition2-3  stage_3\n    auto id_1092 = convBnLeaky(network, weightMap, *id_1089->getOutput(0), 72, 3, 2, 1, \"transition2.2.0.0\", \"transition2.2.0.1\");\n    auto id_1127 = liteResBlock(network, weightMap, *id_1092->getOutput(0), 72, \"stage3.0.branches.2.0\");\n    auto id_1134 = liteResBlock(network, weightMap, *id_1127->getOutput(0), 72, \"stage3.0.branches.2.1\");\n\n    /////// ֱģ ܼ\n    //conv bn up\n    IConvolutionLayer* id_1135 = network->addConvolutionNd(*id_1120->getOutput(0), 18, DimsHW{ 1, 1 }, weightMap[\"stage3.0.fuse_layers.0.1.0.weight\"], emptywts);\n    assert(id_1135);\n    id_1135->setStrideNd(DimsHW{ 1, 1 });\n    id_1135->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1136 = addBatchNorm2d(network, weightMap, *id_1135->getOutput(0), \"stage3.0.fuse_layers.0.1.1\", 1e-5);\n    ILayer* id_1165 = netAddUpsample(network, id_1136->getOutput(0), 18, 2);\n    IElementWiseLayer* id_1166 = network->addElementWise(*id_1165->getOutput(0), *id_1106->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_1167 = network->addConvolutionNd(*id_1134->getOutput(0), 18, DimsHW{ 1, 1 }, weightMap[\"stage3.0.fuse_layers.0.2.0.weight\"], emptywts);\n    assert(id_1167);\n    id_1167->setStrideNd(DimsHW{ 1, 1 });\n    id_1167->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1168 = addBatchNorm2d(network, weightMap, *id_1167->getOutput(0), \"stage3.0.fuse_layers.0.2.1\", 1e-5);\n    ILayer* id_1197 = netAddUpsample(network, id_1168->getOutput(0), 18, 4);\n    IElementWiseLayer* id_1198 = network->addElementWise(*id_1166->getOutput(0), *id_1197->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1199 = network->addActivation(*id_1198->getOutput(0), ActivationType::kRELU);\n\n    //2\n    IConvolutionLayer* id_1200 = network->addConvolutionNd(*id_1106->getOutput(0), 36, DimsHW{ 3, 3 }, weightMap[\"stage3.0.fuse_layers.1.0.0.0.weight\"], emptywts);\n    assert(id_1200);\n    id_1200->setStrideNd(DimsHW{ 2, 2 });\n    id_1200->setPaddingNd(DimsHW{ 1, 1 });\n\n    IScaleLayer* id_1201 = addBatchNorm2d(network, weightMap, *id_1200->getOutput(0), \"stage3.0.fuse_layers.1.0.0.1\", 1e-5);\n    IElementWiseLayer* id_1202 = network->addElementWise(*id_1201->getOutput(0), *id_1120->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_1203 = network->addConvolutionNd(*id_1134->getOutput(0), 36, DimsHW{ 1, 1 }, weightMap[\"stage3.0.fuse_layers.1.2.0.weight\"], emptywts);\n    assert(id_1203);\n    id_1203->setStrideNd(DimsHW{ 1, 1 });\n    id_1203->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1204 = addBatchNorm2d(network, weightMap, *id_1203->getOutput(0), \"stage3.0.fuse_layers.1.2.1\", 1e-5);\n    ILayer* id_1233 = netAddUpsample(network, id_1204->getOutput(0), 36, 2);\n    IElementWiseLayer* id_1234 = network->addElementWise(*id_1202->getOutput(0), *id_1233->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1235 = network->addActivation(*id_1234->getOutput(0), ActivationType::kRELU);\n\n    // 3\n    IConvolutionLayer* id_1236 = network->addConvolutionNd(*id_1106->getOutput(0), 18, DimsHW{ 3, 3 }, weightMap[\"stage3.0.fuse_layers.2.0.0.0.weight\"], emptywts);\n    assert(id_1236);\n    id_1236->setStrideNd(DimsHW{ 2, 2 });\n    id_1236->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1237 = addBatchNorm2d(network, weightMap, *id_1236->getOutput(0), \"stage3.0.fuse_layers.2.0.0.1\", 1e-5);\n    IActivationLayer* id_1238 = network->addActivation(*id_1237->getOutput(0), ActivationType::kRELU);\n\n    IConvolutionLayer* id_1239 = network->addConvolutionNd(*id_1238->getOutput(0), 72, DimsHW{ 3, 3 }, weightMap[\"stage3.0.fuse_layers.2.0.1.0.weight\"], emptywts);\n    assert(id_1239);\n    id_1239->setStrideNd(DimsHW{ 2, 2 });\n    id_1239->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1240 = addBatchNorm2d(network, weightMap, *id_1239->getOutput(0), \"stage3.0.fuse_layers.2.0.1.1\", 1e-5);\n\n    IConvolutionLayer* id_1241 = network->addConvolutionNd(*id_1120->getOutput(0), 72, DimsHW{ 3, 3 }, weightMap[\"stage3.0.fuse_layers.2.1.0.0.weight\"], emptywts);\n    assert(id_1241);\n    id_1241->setStrideNd(DimsHW{ 2, 2 });\n    id_1241->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1242 = addBatchNorm2d(network, weightMap, *id_1241->getOutput(0), \"stage3.0.fuse_layers.2.1.0.1\", 1e-5);\n\n    IElementWiseLayer* id_1243 = network->addElementWise(*id_1240->getOutput(0), *id_1242->getOutput(0), ElementWiseOperation::kSUM);\n    IElementWiseLayer* id_1244 = network->addElementWise(*id_1243->getOutput(0), *id_1134->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1245 = network->addActivation(*id_1244->getOutput(0), ActivationType::kRELU);\n\n    auto id_1252 = liteResBlock(network, weightMap, *id_1199->getOutput(0), 18, \"stage3.1.branches.0.0\");\n    auto id_1259 = liteResBlock(network, weightMap, *id_1252->getOutput(0), 18, \"stage3.1.branches.0.1\");\n    auto id_1266 = liteResBlock(network, weightMap, *id_1235->getOutput(0), 36, \"stage3.1.branches.1.0\");\n    auto id_1273 = liteResBlock(network, weightMap, *id_1266->getOutput(0), 36, \"stage3.1.branches.1.1\");\n    auto id_1280 = liteResBlock(network, weightMap, *id_1245->getOutput(0), 72, \"stage3.1.branches.2.0\");\n    auto id_1287 = liteResBlock(network, weightMap, *id_1280->getOutput(0), 72, \"stage3.1.branches.2.1\");\n\n    /////// ֱģ ܼ \n    //1: 1259+up(1273)+up(1287)\n    IConvolutionLayer* id_1288 = network->addConvolutionNd(*id_1273->getOutput(0), 18, DimsHW{ 1, 1 }, weightMap[\"stage3.1.fuse_layers.0.1.0.weight\"], emptywts);\n    assert(id_1288);\n    id_1288->setStrideNd(DimsHW{ 1, 1 });\n    id_1288->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1289 = addBatchNorm2d(network, weightMap, *id_1288->getOutput(0), \"stage3.1.fuse_layers.0.1.1\", 1e-5);\n    ILayer* id_1318 = netAddUpsample(network, id_1289->getOutput(0), 18, 2);\n    IElementWiseLayer* id_1319 = network->addElementWise(*id_1259->getOutput(0), *id_1318->getOutput(0), ElementWiseOperation::kSUM);\n    //1-2 up(1287)  conv bn up\n    IConvolutionLayer* id_1320 = network->addConvolutionNd(*id_1134->getOutput(0), 18, DimsHW{ 1, 1 }, weightMap[\"stage3.1.fuse_layers.0.2.0.weight\"], emptywts);\n    assert(id_1320);\n    id_1320->setStrideNd(DimsHW{ 1, 1 });\n    id_1320->setPaddingNd(DimsHW{ 0, 0 });\n\n    IScaleLayer* id_1321 = addBatchNorm2d(network, weightMap, *id_1320->getOutput(0), \"stage3.1.fuse_layers.0.2.1\", 1e-5);\n    ILayer* id_1350 = netAddUpsample(network, id_1321->getOutput(0), 18, 4);\n    IElementWiseLayer* id_1351 = network->addElementWise(*id_1319->getOutput(0), *id_1350->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1352 = network->addActivation(*id_1351->getOutput(0), ActivationType::kRELU);\n\n\n    //2: conv(1259)+1273 + up(1287)\n    IConvolutionLayer* id_1353 = network->addConvolutionNd(*id_1259->getOutput(0), 36, DimsHW{ 3, 3 }, weightMap[\"stage3.1.fuse_layers.1.0.0.0.weight\"], emptywts);\n    assert(id_1353);\n    id_1353->setStrideNd(DimsHW{ 2, 2 });\n    id_1353->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1354 = addBatchNorm2d(network, weightMap, *id_1353->getOutput(0), \"stage3.1.fuse_layers.1.0.0.1\", 1e-5);\n    IElementWiseLayer* id_1355 = network->addElementWise(*id_1354->getOutput(0), *id_1273->getOutput(0), ElementWiseOperation::kSUM);\n\n\n    IConvolutionLayer* id_1356 = network->addConvolutionNd(*id_1287->getOutput(0), 36, DimsHW{ 1, 1 }, weightMap[\"stage3.1.fuse_layers.1.2.0.weight\"], emptywts);\n    assert(id_1356);\n    id_1356->setStrideNd(DimsHW{ 1, 1 });\n    id_1356->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1357 = addBatchNorm2d(network, weightMap, *id_1356->getOutput(0), \"stage3.1.fuse_layers.1.2.1\", 1e-5);\n    ILayer* id_1386 = netAddUpsample(network, id_1357->getOutput(0), 36, 2);\n    IElementWiseLayer* id_1387 = network->addElementWise(*id_1355->getOutput(0), *id_1386->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1388 = network->addActivation(*id_1387->getOutput(0), ActivationType::kRELU);\n\n    //3 conv(1259)+conv(1273)+1287\n    IConvolutionLayer* id_1389 = network->addConvolutionNd(*id_1259->getOutput(0), 18, DimsHW{ 3, 3 }, weightMap[\"stage3.1.fuse_layers.2.0.0.0.weight\"], emptywts);\n    assert(id_1389);\n    id_1389->setStrideNd(DimsHW{ 2, 2 });\n    id_1389->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1390 = addBatchNorm2d(network, weightMap, *id_1389->getOutput(0), \"stage3.1.fuse_layers.2.0.0.1\", 1e-5);\n    IActivationLayer* id_1391 = network->addActivation(*id_1390->getOutput(0), ActivationType::kRELU);\n\n    IConvolutionLayer* id_1392 = network->addConvolutionNd(*id_1391->getOutput(0), 72, DimsHW{ 3, 3 }, weightMap[\"stage3.1.fuse_layers.2.0.1.0.weight\"], emptywts);\n    assert(id_1392);\n    id_1392->setStrideNd(DimsHW{ 2, 2 });\n    id_1392->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1393 = addBatchNorm2d(network, weightMap, *id_1392->getOutput(0), \"stage3.1.fuse_layers.2.0.1.1\", 1e-5);\n\n    IConvolutionLayer* id_1394 = network->addConvolutionNd(*id_1273->getOutput(0), 72, DimsHW{ 3, 3 }, weightMap[\"stage3.1.fuse_layers.2.1.0.0.weight\"], emptywts);\n    assert(id_1394);\n    id_1394->setStrideNd(DimsHW{ 2, 2 });\n    id_1394->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1395 = addBatchNorm2d(network, weightMap, *id_1394->getOutput(0), \"stage3.1.fuse_layers.2.1.0.1\", 1e-5);\n\n    IElementWiseLayer* id_1396 = network->addElementWise(*id_1393->getOutput(0), *id_1395->getOutput(0), ElementWiseOperation::kSUM);\n    IElementWiseLayer* id_1397 = network->addElementWise(*id_1396->getOutput(0), *id_1287->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1398 = network->addActivation(*id_1397->getOutput(0), ActivationType::kRELU);\n\n    auto id_1405 = liteResBlock(network, weightMap, *id_1352->getOutput(0), 18, \"stage3.2.branches.0.0\");\n    auto id_1412 = liteResBlock(network, weightMap, *id_1405->getOutput(0), 18, \"stage3.2.branches.0.1\");\n    auto id_1419 = liteResBlock(network, weightMap, *id_1388->getOutput(0), 36, \"stage3.2.branches.1.0\");\n    auto id_1426 = liteResBlock(network, weightMap, *id_1419->getOutput(0), 36, \"stage3.2.branches.1.1\");\n    auto id_1433 = liteResBlock(network, weightMap, *id_1398->getOutput(0), 72, \"stage3.2.branches.2.0\");\n    auto id_1440 = liteResBlock(network, weightMap, *id_1433->getOutput(0), 72, \"stage3.2.branches.2.1\");\n\n\n    // 1412 + up(1426)+up(1440) \n    IConvolutionLayer* id_1441 = network->addConvolutionNd(*id_1426->getOutput(0), 18, DimsHW{ 1, 1 }, weightMap[\"stage3.2.fuse_layers.0.1.0.weight\"], emptywts);\n    assert(id_1441);\n    id_1441->setStrideNd(DimsHW{ 1, 1 });\n    id_1441->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1442 = addBatchNorm2d(network, weightMap, *id_1441->getOutput(0), \"stage3.2.fuse_layers.0.1.1\", 1e-5);\n    ILayer* id_1471 = netAddUpsample(network, id_1442->getOutput(0), 18, 2);\n    IElementWiseLayer* id_1472 = network->addElementWise(*id_1412->getOutput(0), *id_1471->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_1473 = network->addConvolutionNd(*id_1440->getOutput(0), 18, DimsHW{ 1, 1 }, weightMap[\"stage3.2.fuse_layers.0.2.0.weight\"], emptywts);\n    assert(id_1473);\n    id_1473->setStrideNd(DimsHW{ 1, 1 });\n    id_1473->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1474 = addBatchNorm2d(network, weightMap, *id_1473->getOutput(0), \"stage3.2.fuse_layers.0.2.1\", 1e-5);\n    ILayer* id_1503 = netAddUpsample(network, id_1474->getOutput(0), 18, 4);\n\n    IElementWiseLayer* id_1504 = network->addElementWise(*id_1472->getOutput(0), *id_1503->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1505 = network->addActivation(*id_1504->getOutput(0), ActivationType::kRELU);\n\n    // conv(1412)+1426+up(1440)\n    IConvolutionLayer* id_1506 = network->addConvolutionNd(*id_1412->getOutput(0), 36, DimsHW{ 3, 3 }, weightMap[\"stage3.2.fuse_layers.1.0.0.0.weight\"], emptywts);\n    assert(id_1506);\n    id_1506->setStrideNd(DimsHW{ 2, 2 });\n    id_1506->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1507 = addBatchNorm2d(network, weightMap, *id_1506->getOutput(0), \"stage3.2.fuse_layers.1.0.0.1\", 1e-5);\n    IElementWiseLayer* id_1508 = network->addElementWise(*id_1507->getOutput(0), *id_1426->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_1509 = network->addConvolutionNd(*id_1440->getOutput(0), 36, DimsHW{ 1, 1 }, weightMap[\"stage3.2.fuse_layers.1.2.0.weight\"], emptywts);\n    assert(id_1509);\n    id_1509->setStrideNd(DimsHW{ 1, 1 });\n    id_1509->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1510 = addBatchNorm2d(network, weightMap, *id_1509->getOutput(0), \"stage3.2.fuse_layers.1.2.1\", 1e-5);\n    ILayer* id_1539 = netAddUpsample(network, id_1510->getOutput(0), 36, 2);\n    IElementWiseLayer* id_1540 = network->addElementWise(*id_1508->getOutput(0), *id_1539->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1541 = network->addActivation(*id_1540->getOutput(0), ActivationType::kRELU);\n\n    // conv(1412)+conv(1426)+1440\n    IConvolutionLayer* id_1542 = network->addConvolutionNd(*id_1412->getOutput(0), 18, DimsHW{ 3, 3 }, weightMap[\"stage3.2.fuse_layers.2.0.0.0.weight\"], emptywts);\n    assert(id_1542);\n    id_1542->setStrideNd(DimsHW{ 2, 2 });\n    id_1542->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1543 = addBatchNorm2d(network, weightMap, *id_1542->getOutput(0), \"stage3.2.fuse_layers.2.0.0.1\", 1e-5);\n    IActivationLayer* id_1544 = network->addActivation(*id_1543->getOutput(0), ActivationType::kRELU);\n\n    IConvolutionLayer* id_1545 = network->addConvolutionNd(*id_1544->getOutput(0), 72, DimsHW{ 3, 3 }, weightMap[\"stage3.2.fuse_layers.2.0.1.0.weight\"], emptywts);\n    assert(id_1545);\n    id_1545->setStrideNd(DimsHW{ 2, 2 });\n    id_1545->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1546 = addBatchNorm2d(network, weightMap, *id_1545->getOutput(0), \"stage3.2.fuse_layers.2.0.1.1\", 1e-5);\n\n    IConvolutionLayer* id_1547 = network->addConvolutionNd(*id_1426->getOutput(0), 72, DimsHW{ 3, 3 }, weightMap[\"stage3.2.fuse_layers.2.1.0.0.weight\"], emptywts);\n    assert(id_1547);\n    id_1547->setStrideNd(DimsHW{ 2, 2 });\n    id_1547->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1548 = addBatchNorm2d(network, weightMap, *id_1547->getOutput(0), \"stage3.2.fuse_layers.2.1.0.1\", 1e-5);\n\n    IElementWiseLayer* id_1549 = network->addElementWise(*id_1546->getOutput(0), *id_1548->getOutput(0), ElementWiseOperation::kSUM);\n    IElementWiseLayer* id_1550 = network->addElementWise(*id_1549->getOutput(0), *id_1440->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1551 = network->addActivation(*id_1550->getOutput(0), ActivationType::kRELU);\n\n    auto id_1561 = liteResBlock(network, weightMap, *id_1505->getOutput(0), 18, \"stage4.0.branches.0.0\");\n    auto id_1568 = liteResBlock(network, weightMap, *id_1561->getOutput(0), 18, \"stage4.0.branches.0.1\");\n    auto id_1575 = liteResBlock(network, weightMap, *id_1541->getOutput(0), 36, \"stage4.0.branches.1.0\");\n    auto id_1582 = liteResBlock(network, weightMap, *id_1575->getOutput(0), 36, \"stage4.0.branches.1.1\");\n    auto id_1589 = liteResBlock(network, weightMap, *id_1551->getOutput(0), 72, \"stage4.0.branches.2.0\");\n    auto id_1596 = liteResBlock(network, weightMap, *id_1589->getOutput(0), 72, \"stage4.0.branches.2.1\");\n\n    // transition\n    auto id_1554 = convBnLeaky(network, weightMap, *id_1551->getOutput(0), 144, 3, 2, 1, \"transition3.3.0.0\", \"transition3.3.0.1\");\n    auto id_1603 = liteResBlock(network, weightMap, *id_1554->getOutput(0), 144, \"stage4.0.branches.3.0\");\n    auto id_1610 = liteResBlock(network, weightMap, *id_1603->getOutput(0), 144, \"stage4.0.branches.3.1\");\n\n    // 1568+up(1582)+up(1596)+up(1610)\n    IConvolutionLayer* id_1611 = network->addConvolutionNd(*id_1582->getOutput(0), 18, DimsHW{ 1, 1 }, weightMap[\"stage4.0.fuse_layers.0.1.0.weight\"], emptywts);\n    assert(id_1611);\n    id_1611->setStrideNd(DimsHW{ 1, 1 });\n    id_1611->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1612 = addBatchNorm2d(network, weightMap, *id_1611->getOutput(0), \"stage4.0.fuse_layers.0.1.1\", 1e-5);\n    ILayer* id_1641 = netAddUpsample(network, id_1612->getOutput(0), 18, 2);\n    IElementWiseLayer* id_1642 = network->addElementWise(*id_1641->getOutput(0), *id_1568->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_1643 = network->addConvolutionNd(*id_1596->getOutput(0), 18, DimsHW{ 1, 1 }, weightMap[\"stage4.0.fuse_layers.0.2.0.weight\"], emptywts);\n    assert(id_1643);\n    id_1643->setStrideNd(DimsHW{ 1, 1 });\n    id_1643->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1644 = addBatchNorm2d(network, weightMap, *id_1643->getOutput(0), \"stage4.0.fuse_layers.0.2.1\", 1e-5);\n    ILayer* id_1673 = netAddUpsample(network, id_1644->getOutput(0), 18, 4);\n    IElementWiseLayer* id_1674 = network->addElementWise(*id_1642->getOutput(0), *id_1673->getOutput(0), ElementWiseOperation::kSUM);\n\n    //3\n    IConvolutionLayer* id_1675 = network->addConvolutionNd(*id_1610->getOutput(0), 18, DimsHW{ 1, 1 }, weightMap[\"stage4.0.fuse_layers.0.3.0.weight\"], emptywts);\n    assert(id_1675);\n    id_1675->setStrideNd(DimsHW{ 1, 1 });\n    id_1675->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1676 = addBatchNorm2d(network, weightMap, *id_1675->getOutput(0), \"stage4.0.fuse_layers.0.3.1\", 1e-5);\n    ILayer* id_1705 = netAddUpsample(network, id_1676->getOutput(0), 18, 8);\n    IElementWiseLayer* id_1706 = network->addElementWise(*id_1705->getOutput(0), *id_1674->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1707 = network->addActivation(*id_1706->getOutput(0), ActivationType::kRELU);\n\n    // conv(1568)+1582+up(1596)+up(1610)\n    IConvolutionLayer* id_1708 = network->addConvolutionNd(*id_1568->getOutput(0), 36, DimsHW{ 3, 3 }, weightMap[\"stage4.0.fuse_layers.1.0.0.0.weight\"], emptywts);\n    assert(id_1708);\n    id_1708->setStrideNd(DimsHW{ 2, 2 });\n    id_1708->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1709 = addBatchNorm2d(network, weightMap, *id_1708->getOutput(0), \"stage4.0.fuse_layers.1.0.0.1\", 1e-5);\n    IElementWiseLayer* id_1710 = network->addElementWise(*id_1709->getOutput(0), *id_1582->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_1711 = network->addConvolutionNd(*id_1596->getOutput(0), 36, DimsHW{ 1, 1 }, weightMap[\"stage4.0.fuse_layers.1.2.0.weight\"], emptywts);\n    assert(id_1711);\n    id_1711->setStrideNd(DimsHW{ 1, 1 });\n    id_1711->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1712 = addBatchNorm2d(network, weightMap, *id_1711->getOutput(0), \"stage4.0.fuse_layers.1.2.1\", 1e-5);\n    ILayer* id_1741 = netAddUpsample(network, id_1712->getOutput(0), 36, 2);\n    IElementWiseLayer* id_1742 = network->addElementWise(*id_1741->getOutput(0), *id_1710->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_1743 = network->addConvolutionNd(*id_1610->getOutput(0), 36, DimsHW{ 1, 1 }, weightMap[\"stage4.0.fuse_layers.1.3.0.weight\"], emptywts);\n    assert(id_1743);\n    id_1743->setStrideNd(DimsHW{ 1, 1 });\n    id_1743->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1744 = addBatchNorm2d(network, weightMap, *id_1743->getOutput(0), \"stage4.0.fuse_layers.1.3.1\", 1e-5);\n    ILayer* id_1773 = netAddUpsample(network, id_1744->getOutput(0), 36, 4);\n    IElementWiseLayer* id_1774 = network->addElementWise(*id_1773->getOutput(0), *id_1742->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1775 = network->addActivation(*id_1774->getOutput(0), ActivationType::kRELU);\n\n    // conv(1568)+conv(1582)+1596+up(1610)\n    IConvolutionLayer* id_1776 = network->addConvolutionNd(*id_1568->getOutput(0), 18, DimsHW{ 3, 3 }, weightMap[\"stage4.0.fuse_layers.2.0.0.0.weight\"], emptywts);\n    assert(id_1776);\n    id_1776->setStrideNd(DimsHW{ 2, 2 });\n    id_1776->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1777 = addBatchNorm2d(network, weightMap, *id_1776->getOutput(0), \"stage4.0.fuse_layers.2.0.0.1\", 1e-5);\n    IActivationLayer* id_1778 = network->addActivation(*id_1777->getOutput(0), ActivationType::kRELU);\n\n    IConvolutionLayer* id_1779 = network->addConvolutionNd(*id_1778->getOutput(0), 72, DimsHW{ 3, 3 }, weightMap[\"stage4.0.fuse_layers.2.0.1.0.weight\"], emptywts);\n    assert(id_1779);\n    id_1779->setStrideNd(DimsHW{ 2, 2 });\n    id_1779->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1780 = addBatchNorm2d(network, weightMap, *id_1779->getOutput(0), \"stage4.0.fuse_layers.2.0.1.1\", 1e-5);\n\n    IConvolutionLayer* id_1781 = network->addConvolutionNd(*id_1582->getOutput(0), 72, DimsHW{ 3, 3 }, weightMap[\"stage4.0.fuse_layers.2.1.0.0.weight\"], emptywts);\n    assert(id_1781);\n    id_1781->setStrideNd(DimsHW{ 2, 2 });\n    id_1781->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1782 = addBatchNorm2d(network, weightMap, *id_1781->getOutput(0), \"stage4.0.fuse_layers.2.1.0.1\", 1e-5);\n\n    IElementWiseLayer* id_1783 = network->addElementWise(*id_1780->getOutput(0), *id_1782->getOutput(0), ElementWiseOperation::kSUM);\n    IElementWiseLayer* id_1784 = network->addElementWise(*id_1783->getOutput(0), *id_1596->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_1785 = network->addConvolutionNd(*id_1610->getOutput(0), 72, DimsHW{ 1, 1 }, weightMap[\"stage4.0.fuse_layers.2.3.0.weight\"], emptywts);\n    assert(id_1785);\n    id_1785->setStrideNd(DimsHW{ 1, 1 });\n    id_1785->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1786 = addBatchNorm2d(network, weightMap, *id_1785->getOutput(0), \"stage4.0.fuse_layers.2.3.1\", 1e-5);\n    ILayer* id_1815 = netAddUpsample(network, id_1786->getOutput(0), 72, 2);\n\n    IElementWiseLayer* id_1816 = network->addElementWise(*id_1784->getOutput(0), *id_1815->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1817 = network->addActivation(*id_1816->getOutput(0), ActivationType::kRELU);\n\n    // conv(1568)+conv(1582)+conv(1596)+(1610)\n    // 1568(cbr)1820(cbr)1823(cb)1825\n    IConvolutionLayer* id_1818 = network->addConvolutionNd(*id_1568->getOutput(0), 18, DimsHW{ 3, 3 }, weightMap[\"stage4.0.fuse_layers.3.0.0.0.weight\"], emptywts);\n    assert(id_1818);\n    id_1818->setStrideNd(DimsHW{ 2, 2 });\n    id_1818->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1819 = addBatchNorm2d(network, weightMap, *id_1818->getOutput(0), \"stage4.0.fuse_layers.3.0.0.1\", 1e-5);\n    IActivationLayer* id_1820 = network->addActivation(*id_1819->getOutput(0), ActivationType::kRELU);\n    IConvolutionLayer* id_1821 = network->addConvolutionNd(*id_1820->getOutput(0), 18, DimsHW{ 3, 3 }, weightMap[\"stage4.0.fuse_layers.3.0.1.0.weight\"], emptywts);\n    assert(id_1821);\n    id_1821->setStrideNd(DimsHW{ 2, 2 });\n    id_1821->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1822 = addBatchNorm2d(network, weightMap, *id_1821->getOutput(0), \"stage4.0.fuse_layers.3.0.1.1\", 1e-5);\n    IActivationLayer* id_1823 = network->addActivation(*id_1822->getOutput(0), ActivationType::kRELU);\n    IConvolutionLayer* id_1824 = network->addConvolutionNd(*id_1823->getOutput(0), 144, DimsHW{ 3, 3 }, weightMap[\"stage4.0.fuse_layers.3.0.2.0.weight\"], emptywts);\n    assert(id_1824);\n    id_1824->setStrideNd(DimsHW{ 2, 2 });\n    id_1824->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1825 = addBatchNorm2d(network, weightMap, *id_1824->getOutput(0), \"stage4.0.fuse_layers.3.0.2.1\", 1e-5);\n\n    // 1582(cbr)1828(cb)1830\n    IConvolutionLayer* id_1826 = network->addConvolutionNd(*id_1582->getOutput(0), 36, DimsHW{ 3, 3 }, weightMap[\"stage4.0.fuse_layers.3.1.0.0.weight\"], emptywts);\n    assert(id_1826);\n    id_1826->setStrideNd(DimsHW{ 2, 2 });\n    id_1826->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1827 = addBatchNorm2d(network, weightMap, *id_1826->getOutput(0), \"stage4.0.fuse_layers.3.1.0.1\", 1e-5);\n    IActivationLayer* id_1828 = network->addActivation(*id_1827->getOutput(0), ActivationType::kRELU);\n    IConvolutionLayer* id_1829 = network->addConvolutionNd(*id_1828->getOutput(0), 144, DimsHW{ 3, 3 }, weightMap[\"stage4.0.fuse_layers.3.1.1.0.weight\"], emptywts);\n    assert(id_1829);\n    id_1829->setStrideNd(DimsHW{ 2, 2 });\n    id_1829->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1830 = addBatchNorm2d(network, weightMap, *id_1829->getOutput(0), \"stage4.0.fuse_layers.3.1.1.1\", 1e-5);\n\n    IElementWiseLayer* id_1831 = network->addElementWise(*id_1830->getOutput(0), *id_1825->getOutput(0), ElementWiseOperation::kSUM);\n\n    // 1596(cb)1832\n    IConvolutionLayer* id_1832 = network->addConvolutionNd(*id_1596->getOutput(0), 144, DimsHW{ 3, 3 }, weightMap[\"stage4.0.fuse_layers.3.2.0.0.weight\"], emptywts);\n    assert(id_1832);\n    id_1832->setStrideNd(DimsHW{ 2, 2 });\n    id_1832->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1833 = addBatchNorm2d(network, weightMap, *id_1832->getOutput(0), \"stage4.0.fuse_layers.3.2.0.1\", 1e-5);\n\n    IElementWiseLayer* id_1834 = network->addElementWise(*id_1833->getOutput(0), *id_1831->getOutput(0), ElementWiseOperation::kSUM);\n    IElementWiseLayer* id_1835 = network->addElementWise(*id_1834->getOutput(0), *id_1610->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1836 = network->addActivation(*id_1835->getOutput(0), ActivationType::kRELU);\n\n    auto id_1843 = liteResBlock(network, weightMap, *id_1707->getOutput(0), 18, \"stage4.1.branches.0.0\");\n    auto id_1850 = liteResBlock(network, weightMap, *id_1843->getOutput(0), 18, \"stage4.1.branches.0.1\");\n    auto id_1857 = liteResBlock(network, weightMap, *id_1775->getOutput(0), 36, \"stage4.1.branches.1.0\");\n    auto id_1864 = liteResBlock(network, weightMap, *id_1857->getOutput(0), 36, \"stage4.1.branches.1.1\");\n    auto id_1871 = liteResBlock(network, weightMap, *id_1817->getOutput(0), 72, \"stage4.1.branches.2.0\");\n    auto id_1878 = liteResBlock(network, weightMap, *id_1871->getOutput(0), 72, \"stage4.1.branches.2.1\");\n    auto id_1885 = liteResBlock(network, weightMap, *id_1836->getOutput(0), 144, \"stage4.1.branches.3.0\");\n    auto id_1892 = liteResBlock(network, weightMap, *id_1885->getOutput(0), 144, \"stage4.1.branches.3.1\");\n\n    // 1850+up1864+up1878+up1892\n    IConvolutionLayer* id_1893 = network->addConvolutionNd(*id_1864->getOutput(0), 18, DimsHW{ 1, 1 }, weightMap[\"stage4.1.fuse_layers.0.1.0.weight\"], emptywts);\n    assert(id_1893);\n    id_1893->setStrideNd(DimsHW{ 1, 1 });\n    id_1893->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1894 = addBatchNorm2d(network, weightMap, *id_1893->getOutput(0), \"stage4.1.fuse_layers.0.1.1\", 1e-5);\n    ILayer* id_1923 = netAddUpsample(network, id_1894->getOutput(0), 18, 2);\n    IElementWiseLayer* id_1924 = network->addElementWise(*id_1850->getOutput(0), *id_1923->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_1925 = network->addConvolutionNd(*id_1878->getOutput(0), 18, DimsHW{ 1, 1 }, weightMap[\"stage4.1.fuse_layers.0.2.0.weight\"], emptywts);\n    assert(id_1925);\n    id_1925->setStrideNd(DimsHW{ 1, 1 });\n    id_1925->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1926 = addBatchNorm2d(network, weightMap, *id_1925->getOutput(0), \"stage4.1.fuse_layers.0.2.1\", 1e-5);\n    ILayer* id_1955 = netAddUpsample(network, id_1926->getOutput(0), 18, 4);\n    IElementWiseLayer* id_1956 = network->addElementWise(*id_1924->getOutput(0), *id_1955->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_1957 = network->addConvolutionNd(*id_1892->getOutput(0), 18, DimsHW{ 1, 1 }, weightMap[\"stage4.1.fuse_layers.0.3.0.weight\"], emptywts);\n    assert(id_1957);\n    id_1957->setStrideNd(DimsHW{ 1, 1 });\n    id_1957->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1958 = addBatchNorm2d(network, weightMap, *id_1957->getOutput(0), \"stage4.1.fuse_layers.0.3.1\", 1e-5);\n    ILayer* id_1987 = netAddUpsample(network, id_1958->getOutput(0), 18, 8);\n    IElementWiseLayer* id_1988 = network->addElementWise(*id_1956->getOutput(0), *id_1987->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_1989 = network->addActivation(*id_1988->getOutput(0), ActivationType::kRELU);\n\n    // conv1850+1864+up1878+up1892\n    IConvolutionLayer* id_1990 = network->addConvolutionNd(*id_1850->getOutput(0), 36, DimsHW{ 3, 3 }, weightMap[\"stage4.1.fuse_layers.1.0.0.0.weight\"], emptywts);\n    assert(id_1990);\n    id_1990->setStrideNd(DimsHW{ 2, 2 });\n    id_1990->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_1991 = addBatchNorm2d(network, weightMap, *id_1990->getOutput(0), \"stage4.1.fuse_layers.1.0.0.1\", 1e-5);\n    IElementWiseLayer* id_1992 = network->addElementWise(*id_1991->getOutput(0), *id_1864->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_1993 = network->addConvolutionNd(*id_1878->getOutput(0), 36, DimsHW{ 1, 1 }, weightMap[\"stage4.1.fuse_layers.1.2.0.weight\"], emptywts);\n    assert(id_1993);\n    id_1993->setStrideNd(DimsHW{ 1, 1 });\n    id_1993->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_1994 = addBatchNorm2d(network, weightMap, *id_1993->getOutput(0), \"stage4.1.fuse_layers.1.2.1\", 1e-5);\n    ILayer* id_2023 = netAddUpsample(network, id_1994->getOutput(0), 36, 2);\n    IElementWiseLayer* id_2024 = network->addElementWise(*id_1992->getOutput(0), *id_2023->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_2025 = network->addConvolutionNd(*id_1892->getOutput(0), 36, DimsHW{ 1, 1 }, weightMap[\"stage4.1.fuse_layers.1.3.0.weight\"], emptywts);\n    assert(id_2025);\n    id_2025->setStrideNd(DimsHW{ 1, 1 });\n    id_2025->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_2026 = addBatchNorm2d(network, weightMap, *id_2025->getOutput(0), \"stage4.1.fuse_layers.1.3.1\", 1e-5);\n    ILayer* id_2055 = netAddUpsample(network, id_2026->getOutput(0), 36, 4);\n    IElementWiseLayer* id_2056 = network->addElementWise(*id_2024->getOutput(0), *id_2055->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_2057 = network->addActivation(*id_2056->getOutput(0), ActivationType::kRELU);\n\n    //conv1850 + conv 1864 + 1878 + up1892\n    IConvolutionLayer* id_2058 = network->addConvolutionNd(*id_1850->getOutput(0), 18, DimsHW{ 3, 3 }, weightMap[\"stage4.1.fuse_layers.2.0.0.0.weight\"], emptywts);\n    assert(id_2058);\n    id_2058->setStrideNd(DimsHW{ 2, 2 });\n    id_2058->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_2059 = addBatchNorm2d(network, weightMap, *id_2058->getOutput(0), \"stage4.1.fuse_layers.2.0.0.1\", 1e-5);\n    IActivationLayer* id_2060 = network->addActivation(*id_2059->getOutput(0), ActivationType::kRELU);\n\n    IConvolutionLayer* id_2061 = network->addConvolutionNd(*id_2060->getOutput(0), 72, DimsHW{ 3, 3 }, weightMap[\"stage4.1.fuse_layers.2.0.1.0.weight\"], emptywts);\n    assert(id_2061);\n    id_2061->setStrideNd(DimsHW{ 2, 2 });\n    id_2061->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_2062 = addBatchNorm2d(network, weightMap, *id_2061->getOutput(0), \"stage4.1.fuse_layers.2.0.1.1\", 1e-5);\n\n    IConvolutionLayer* id_2063 = network->addConvolutionNd(*id_1864->getOutput(0), 72, DimsHW{ 3, 3 }, weightMap[\"stage4.1.fuse_layers.2.1.0.0.weight\"], emptywts);\n    assert(id_2063);\n    id_2063->setStrideNd(DimsHW{ 2, 2 });\n    id_2063->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_2064 = addBatchNorm2d(network, weightMap, *id_2063->getOutput(0), \"stage4.1.fuse_layers.2.1.0.1\", 1e-5);\n\n    IElementWiseLayer* id_2065 = network->addElementWise(*id_2062->getOutput(0), *id_2064->getOutput(0), ElementWiseOperation::kSUM);\n    IElementWiseLayer* id_2066 = network->addElementWise(*id_1878->getOutput(0), *id_2065->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_2067 = network->addConvolutionNd(*id_1892->getOutput(0), 72, DimsHW{ 1, 1 }, weightMap[\"stage4.1.fuse_layers.2.3.0.weight\"], emptywts);\n    assert(id_2067);\n    id_2067->setStrideNd(DimsHW{ 1, 1 });\n    id_2067->setPaddingNd(DimsHW{ 0, 0 });\n    IScaleLayer* id_2068 = addBatchNorm2d(network, weightMap, *id_2067->getOutput(0), \"stage4.1.fuse_layers.2.3.1\", 1e-5);\n    ILayer* id_2097 = netAddUpsample(network, id_2068->getOutput(0), 72, 2);\n\n    IElementWiseLayer* id_2098 = network->addElementWise(*id_2097->getOutput(0), *id_2066->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_2099 = network->addActivation(*id_2098->getOutput(0), ActivationType::kRELU);\n\n    // conv1850+conv1864+conv1878+1892\n    IConvolutionLayer* id_2100 = network->addConvolutionNd(*id_1850->getOutput(0), 18, DimsHW{ 3, 3 }, weightMap[\"stage4.1.fuse_layers.3.0.0.0.weight\"], emptywts);\n    assert(id_2100);\n    id_2100->setStrideNd(DimsHW{ 2, 2 });\n    id_2100->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_2101 = addBatchNorm2d(network, weightMap, *id_2100->getOutput(0), \"stage4.1.fuse_layers.3.0.0.1\", 1e-5);\n    IActivationLayer* id_2102 = network->addActivation(*id_2101->getOutput(0), ActivationType::kRELU);\n    IConvolutionLayer* id_2103 = network->addConvolutionNd(*id_2102->getOutput(0), 18, DimsHW{ 3, 3 }, weightMap[\"stage4.1.fuse_layers.3.0.1.0.weight\"], emptywts);\n    assert(id_2103);\n    id_2103->setStrideNd(DimsHW{ 2, 2 });\n    id_2103->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_2104 = addBatchNorm2d(network, weightMap, *id_2103->getOutput(0), \"stage4.1.fuse_layers.3.0.1.1\", 1e-5);\n    IActivationLayer* id_2105 = network->addActivation(*id_2104->getOutput(0), ActivationType::kRELU);\n    IConvolutionLayer* id_2106 = network->addConvolutionNd(*id_2105->getOutput(0), 144, DimsHW{ 3, 3 }, weightMap[\"stage4.1.fuse_layers.3.0.2.0.weight\"], emptywts);\n    assert(id_2106);\n    id_2106->setStrideNd(DimsHW{ 2, 2 });\n    id_2106->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_2107 = addBatchNorm2d(network, weightMap, *id_2106->getOutput(0), \"stage4.1.fuse_layers.3.0.2.1\", 1e-5);\n\n    // \n    IConvolutionLayer* id_2108 = network->addConvolutionNd(*id_1864->getOutput(0), 36, DimsHW{ 3, 3 }, weightMap[\"stage4.1.fuse_layers.3.1.0.0.weight\"], emptywts);\n    assert(id_2108);\n    id_2108->setStrideNd(DimsHW{ 2, 2 });\n    id_2108->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_2109 = addBatchNorm2d(network, weightMap, *id_2108->getOutput(0), \"stage4.1.fuse_layers.3.1.0.1\", 1e-5);\n    IActivationLayer* id_2110 = network->addActivation(*id_2109->getOutput(0), ActivationType::kRELU);\n    IConvolutionLayer* id_2111 = network->addConvolutionNd(*id_2110->getOutput(0), 144, DimsHW{ 3, 3 }, weightMap[\"stage4.1.fuse_layers.3.1.1.0.weight\"], emptywts);\n    assert(id_2111);\n    id_2111->setStrideNd(DimsHW{ 2, 2 });\n    id_2111->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_2112 = addBatchNorm2d(network, weightMap, *id_2111->getOutput(0), \"stage4.1.fuse_layers.3.1.1.1\", 1e-5);\n\n    IElementWiseLayer* id_2113 = network->addElementWise(*id_2107->getOutput(0), *id_2112->getOutput(0), ElementWiseOperation::kSUM);\n\n    IConvolutionLayer* id_2114 = network->addConvolutionNd(*id_1878->getOutput(0), 144, DimsHW{ 3, 3 }, weightMap[\"stage4.1.fuse_layers.3.2.0.0.weight\"], emptywts);\n    assert(id_2114);\n    id_2114->setStrideNd(DimsHW{ 2, 2 });\n    id_2114->setPaddingNd(DimsHW{ 1, 1 });\n    IScaleLayer* id_2115 = addBatchNorm2d(network, weightMap, *id_2114->getOutput(0), \"stage4.1.fuse_layers.3.2.0.1\", 1e-5);\n\n    IElementWiseLayer* id_2116 = network->addElementWise(*id_2113->getOutput(0), *id_2115->getOutput(0), ElementWiseOperation::kSUM);\n    IElementWiseLayer* id_2117 = network->addElementWise(*id_2116->getOutput(0), *id_1892->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer* id_2118 = network->addActivation(*id_2117->getOutput(0), ActivationType::kRELU);\n\n    //res\n    auto id_2174 = ResBlock2Conv(network, weightMap, *id_2118->getOutput(0), 256, 1024, 1, \"incre_modules.3.0\");\n    auto id_2158 = ResBlock2Conv(network, weightMap, *id_2099->getOutput(0), 128, 512, 1, \"incre_modules.2.0\");\n    auto id_2142 = ResBlock2Conv(network, weightMap, *id_2057->getOutput(0), 64, 256, 1, \"incre_modules.1.0\");\n    auto id_2130 = ResBlock2Conv(network, weightMap, *id_1989->getOutput(0), 32, 128, 1, \"incre_modules.0.0\");\n\n    auto id_2145 = convBnLeaky(network, weightMap, *id_2130->getOutput(0), 256, 3, 2, 1, \"downsamp_modules.0.0\", \"downsamp_modules.0.1\", true);\n    IElementWiseLayer* id_2146 = network->addElementWise(*id_2145->getOutput(0), *id_2142->getOutput(0), ElementWiseOperation::kSUM);\n    auto id_2161 = convBnLeaky(network, weightMap, *id_2146->getOutput(0), 512, 3, 2, 1, \"downsamp_modules.1.0\", \"downsamp_modules.1.1\", true);\n    IElementWiseLayer* id_2162 = network->addElementWise(*id_2161->getOutput(0), *id_2158->getOutput(0), ElementWiseOperation::kSUM);\n    auto id_2177 = convBnLeaky(network, weightMap, *id_2162->getOutput(0), 1024, 3, 2, 1, \"downsamp_modules.2.0\", \"downsamp_modules.2.1\", true);\n    IElementWiseLayer* id_2178 = network->addElementWise(*id_2177->getOutput(0), *id_2174->getOutput(0), ElementWiseOperation::kSUM);\n\n    auto id_2181 = convBnLeaky(network, weightMap, *id_2178->getOutput(0), 2048, 1, 1, 0, \"final_layer.0\", \"final_layer.1\", true);\n    //   y = F.avg_pool2d(y, kernel_size=y.size()[2:]).view(y.size(0), -1)\n    auto pool = network->addPoolingNd(*id_2181->getOutput(0), PoolingType::kAVERAGE, DimsHW{ 7, 7 });\n    pool->setPaddingNd(DimsHW{ 0, 0 });\n    pool->setStrideNd(DimsHW{ 1, 1 });\n    // self.classifier = nn.Linear(2048, 1000)\n    IFullyConnectedLayer* out = network->addFullyConnected(*pool->getOutput(0), 1000, weightMap[\"classifier.weight\"], weightMap[\"classifier.bias\"]);\n    assert(out);\n    out->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n    network->markOutput(*out->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize((1 << 30));  // 1G\n#ifdef USE_FP16\n    config->setFlag(BuilderFlag::kFP16);\n#endif\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv) {\n\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{ nullptr };\n    size_t size{ 0 };\n    std::string engine_name = \"hrnet.engine\";\n    if (argc == 2 && std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{ nullptr };\n        APIToModel(BATCH_SIZE, &modelStream);\n        assert(modelStream != nullptr);\n        std::ofstream p(engine_name, std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    }\n    else if (argc == 3 && std::string(argv[1]) == \"-d\") {\n        std::ifstream file(engine_name, std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    }\n    else {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./yolov5 -s  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./yolov5 -d ../samples  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(argv[2], file_names) < 0) {\n        std::cout << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n    // prepare input data ---------------------------\n    static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];\n    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n    //    data[i] = 1.0;\n    static float prob[BATCH_SIZE * OUTPUT_SIZE];\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    /*\n    mean = [0.485, 0.456, 0.406]\n    std = [0.229, 0.224, 0.225]\n    inp_image = ((resized_img/255. - mean) / std).astype(np.float32)\n    */\n    int fcount = 0;\n    for (int f = 0; f < (int)file_names.size(); f++) {\n        fcount++;\n        if (fcount < BATCH_SIZE && f + 1 != (int)file_names.size()) continue;\n        for (int b = 0; b < fcount; b++) {\n            cv::Mat img = cv::imread(std::string(argv[2]) + \"/\" + file_names[f - fcount + 1 + b]); // BGR\n            if (img.empty()) continue;\n            // cv::Mat pr_img = preprocess_img(img); // letterbox BGR to RGB\n            cv::Mat pr_img;\n            cv::resize(img, pr_img, cv::Size(INPUT_W, INPUT_H));\n            int i = 0;\n            for (int row = 0; row < INPUT_H; ++row) {\n                uchar* uc_pixel = pr_img.data + row * pr_img.step;\n                for (int col = 0; col < INPUT_W; ++col) {\n                    data[b * 3 * INPUT_H * INPUT_W + i] = ((float)uc_pixel[2] / 255.0 - 0.485) / 0.229; // R-0.485\n                    data[b * 3 * INPUT_H * INPUT_W + i + INPUT_H * INPUT_W] = ((float)uc_pixel[1] / 255.0 - 0.456) / 0.224;\n                    data[b * 3 * INPUT_H * INPUT_W + i + 2 * INPUT_H * INPUT_W] = ((float)uc_pixel[0] / 255.0 - 0.406) / 0.225;\n                    uc_pixel += 3;\n                    ++i;\n                }\n            }\n        }\n        // Run inference  \n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, BATCH_SIZE);\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"infer time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n        float maxp = 0;\n        int index = 0;\n        for (int b = 0; b < fcount; b++) {\n            for (int j = 0; j < 1000; ++j)\n            {\n                float p = prob[b * OUTPUT_SIZE + j];\n                if (p > maxp)\n                {\n                    maxp = p;\n                    index = j;\n                }\n            }\n        }\n        std::cout << \"out index: \" << index << std::endl;\n    }\n}"
  },
  {
    "path": "hrnet/hrnet-image-classification/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "hrnet/hrnet-semantic-segmentation/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(hrnetseg)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(hrnet ${PROJECT_SOURCE_DIR}/hrnet.cpp)\ntarget_link_libraries(hrnet nvinfer)\ntarget_link_libraries(hrnet cudart)\ntarget_link_libraries(hrnet ${OpenCV_LIBS})\n\n\nadd_executable(hrnet_ocr ${PROJECT_SOURCE_DIR}/hrnet_ocr.cpp)\ntarget_link_libraries(hrnet_ocr nvinfer)\ntarget_link_libraries(hrnet_ocr cudart)\ntarget_link_libraries(hrnet_ocr ${OpenCV_LIBS})\n\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "hrnet/hrnet-semantic-segmentation/README.md",
    "content": "# HRNet-Semantic-Segmentation\n\nThis repo implemtents [HRNet-Semantic-Segmentation-v1.1](https://github.com/HRNet/HRNet-Semantic-Segmentation/tree/pytorch-v1.1) and [HRNet-Semantic-Segmentation-OCR](https://github.com/HRNet/HRNet-Semantic-Segmentation/tree/HRNet-OCR).\n\n\n## How to Run\n### For HRNet-Semantic-Segmentation-v1.1\n1. generate .wts, use config `experiments/cityscapes/seg_hrnet_w48_train_512x1024_sgd_lr1e-2_wd5e-4_bs_12_epoch484.yaml` and pretrained weight `hrnet_w48_cityscapes_cls19_1024x2048_trainset.pth` as example. change `PRETRAINED` in `experiments/cityscapes/seg_hrnet_w48_train_512x1024_sgd_lr1e-2_wd5e-4_bs_12_epoch484.yaml` to `\"\"`.\n```\ncp gen_wts.py $HRNET--Semantic-Segmentation-PROJECT-ROOT/tools\ncd $HRNET--Semantic-Segmentation-PROJECT-ROOT\npython tools/gen_wts.py --cfg experiments/cityscapes/seg_hrnet_w48_train_512x1024_sgd_lr1e-2_wd5e-4_bs_12_epoch484.yaml --ckpt_path hrnet_w48_cityscapes_cls19_1024x2048_trainset.pth --save_path hrnet_w48.wts\ncp hrnet_w48.wts $HRNET-TENSORRT-ROOT\ncd $HRNET-TENSORRT-ROOT\n```\n2. cmake and make\n\n  ```\n  mkdir build\n  cd build\n  cmake ..\n  make\n  ```\n  first serialize model to plan file\n  ```\n  ./hrnet -s [.wts] [.engine] [small or 18 or 32 or 48] # small for W18-Small-v2, 18 for W18, etc.\n  ```\n  such as\n  ```\n  ./hrnet -s ../hrnet_w48.wts ./hrnet_w48.engine 48\n  ```\n  then deserialize plan file and run inference\n  ```\n  ./hrnet -d  [.engine] [image dir]\n  ```\n  such as \n  ```\n  ./hrnet -d  ./hrnet_w48.engine ../samples\n  ```\n### For HRNet-Semantic-Segmentation-OCR\n\n1. generate .wts, use config `experiments/cityscapes/seg_hrnet_ocr_w48_train_512x1024_sgd_lr1e-2_wd5e-4_bs_12_epoch484.yaml` and pretrained weight `hrnet_ocr_cs_8162_torch11.pth` as example. change `PRETRAINED` in `experiments/cityscapes/seg_hrnet_ocr_w48_train_512x1024_sgd_lr1e-2_wd5e-4_bs_12_epoch484.yaml` to `\"\"`.\n```\ncp gen_wts.py $HRNET-OCR-TRAIN-PROJECT-ROOT/tools\ncd $HRNET-OCR-PROJECT-ROOT\npython tools/gen_wts.py --cfg experiments/cityscapes/seg_hrnet_ocr_w48_train_512x1024_sgd_lr1e-2_wd5e-4_bs_12_epoch484.yaml --ckpt_path hrnet_ocr_cs_8162_torch11.pth --save_path hrnet_ocr_w48.wts\ncp hrnet_ocr_w48.wts $HRNET-OCR-TENSORRT-ROOT\ncd $HRNET-OCR-TENSORRT-ROOT\n```\n2. cmake and make\n\n  ```\n  mkdir build\n  cd build\n  cmake ..\n  make\n  ```\n  first serialize model to plan file\n  ```\n  ./hrnet_ocr -s [.wts] [.engine] [18 or 32 or 48]\n  ```\n  such as\n  ```\n  ./hrnet_ocr -s ../hrnet_ocr_w48.wts ./hrnet_ocr_w48.engine 48\n  ```\n  then deserialize plan file and run inference\n  ```\n  ./hrnet_ocr -d  [.engine] [image dir]\n  ```\n  such as \n  ```\n  ./hrnet_ocr -d  ./hrnet_ocr_w48.engine ../samples\n  ```\n## Result\n\nTRT Result:\n\n![trtcity](https://user-images.githubusercontent.com/20653176/103136469-a68e2080-46fb-11eb-9f05-06bad81c74b9.png)\n\npytorch result:\n\n![image-20201225171224159](https://user-images.githubusercontent.com/20653176/103131619-6cf9ed00-46dc-11eb-9369-4374abb65744.png)\n\n## Note\n\n* Some source codes are changed for simplicity.  But the original model can still be used.\n\n  All \"upsample\" op  in source code are changed to `mode='bilinear', align_corners=True`\n\n* Image preprocessing operation and postprocessing operation  are put into Trt Engine.\n\n* Zero-copy technology (CPU/GPU memory copy) is used.\n\n"
  },
  {
    "path": "hrnet/hrnet-semantic-segmentation/common.hpp",
    "content": "#pragma once\n\n#include <fstream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <opencv2/opencv.hpp>\n#include <dirent.h>\n#include \"NvInfer.h\"\n#include \"NvInferPlugin.h\"\n#include \"cuda_runtime_api.h\"\n\nusing namespace nvinfer1;\n\n#define CHECK(status)                                          \\\n    do                                                         \\\n    {                                                          \\\n        auto ret = (status);                                   \\\n        if (ret != 0)                                          \\\n        {                                                      \\\n            std::cerr << \"Cuda failure: \" << ret << std::endl; \\\n            abort();                                           \\\n        }                                                      \\\n    } while (0)\n\nint read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n            strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nvoid debug_print(ITensor *input_tensor, std::string head)\n{\n    std::cout << head << \" : \";\n\n    for (int i = 0; i < input_tensor->getDimensions().nbDims; i++)\n    {\n        std::cout << input_tensor->getDimensions().d[i] << \" \";\n    }\n    std::cout << std::endl;\n}\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t *val = reinterpret_cast<uint32_t *>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\ncv::Mat createLTU(int len)\n{\n    cv::Mat lookUpTable(1, 256, CV_8U);\n    uchar *p = lookUpTable.data;\n    for (int j = 0; j < 256; ++j)\n    {\n        p[j] = (j * (256 / len) > 255) ? uchar(255) : (uchar)(j * (256 / len));\n    }\n    return lookUpTable;\n}\nITensor *MeanStd(INetworkDefinition *network, ITensor *input, float *mean, float *std, bool div255)\n{\n    if (div255)\n    {\n        Weights Div_225{DataType::kFLOAT, nullptr, 3};\n        float *wgt = reinterpret_cast<float *>(malloc(sizeof(float) * 3));\n        for (int i = 0; i < 3; ++i)\n        {\n            wgt[i] = 255.0f;\n        }\n        Div_225.values = wgt;\n        IConstantLayer *d = network->addConstant(Dims3{3, 1, 1}, Div_225);\n        input = network->addElementWise(*input, *d->getOutput(0), ElementWiseOperation::kDIV)->getOutput(0);\n    }\n    Weights Mean{DataType::kFLOAT, nullptr, 3};\n    Mean.values = mean;\n    IConstantLayer *m = network->addConstant(Dims3{3, 1, 1}, Mean);\n    IElementWiseLayer *sub_mean = network->addElementWise(*input, *m->getOutput(0), ElementWiseOperation::kSUB);\n    if (std != nullptr)\n    {\n        Weights Std{DataType::kFLOAT, nullptr, 3};\n        Std.values = std;\n        IConstantLayer *s = network->addConstant(Dims3{3, 1, 1}, Std);\n        IElementWiseLayer *std_mean = network->addElementWise(*sub_mean->getOutput(0), *s->getOutput(0), ElementWiseOperation::kDIV);\n        return std_mean->getOutput(0);\n    }\n    else\n    {\n        return sub_mean->getOutput(0);\n    }\n}\n\nIScaleLayer *addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights> &weightMap, ITensor &input, std::string lname, float eps)\n{\n    float *gamma = (float *)weightMap[lname + \".weight\"].values;\n    float *beta = (float *)weightMap[lname + \".bias\"].values;\n    float *mean = (float *)weightMap[lname + \".running_mean\"].values;\n    float *var = (float *)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n    //std::cout << \"len \" << len << std::endl;\n\n    float *scval = reinterpret_cast<float *>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++)\n    {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n\n    float *shval = reinterpret_cast<float *>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++)\n    {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float *>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++)\n    {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer *scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer *convBnRelu(INetworkDefinition *network,\n                   std::map<std::string, Weights> &weightMap,\n                   ITensor &input, int outch, int ksize, int s, int p,\n                   std::string convname, std::string bnname,\n                   bool relu = true,\n                   bool bias = false)\n{\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer *conv1;\n    //Dims dim;\n    if (!bias)\n    {\n        conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[convname + \".weight\"], emptywts);\n    }\n    else\n    {\n        conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[convname + \".weight\"], weightMap[convname + \".bias\"]);\n    }\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n    debug_print(conv1->getOutput(0), convname);\n    IScaleLayer *bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), bnname, 1e-5);\n    debug_print(bn1->getOutput(0), bnname);\n    if (relu)\n    {\n        auto lr = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n        return lr;\n    }\n    return bn1;\n}\n\nIActivationLayer *ResBlock2Conv(INetworkDefinition *network, std::map<std::string, Weights> &weightMap, ITensor &input, int inch, int outch, int stride, std::string lname)\n{\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer *conv1 = network->addConvolutionNd(input, inch, DimsHW{1, 1}, weightMap[lname + \".conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{stride, stride});\n    conv1->setPaddingNd(DimsHW{0, 0});\n    debug_print(conv1->getOutput(0), lname + \"_1\");\n    IScaleLayer *bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn1\", 1e-5);\n    IActivationLayer *relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    ///\n    IConvolutionLayer *conv2 = network->addConvolutionNd(*relu1->getOutput(0), inch, DimsHW{3, 3}, weightMap[lname + \".conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{stride, stride});\n    conv2->setPaddingNd(DimsHW{1, 1});\n    debug_print(conv2->getOutput(0), lname + \"_2\");\n    IScaleLayer *bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \".bn2\", 1e-5);\n\n    IActivationLayer *relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n    //////\n    IConvolutionLayer *conv3 = network->addConvolutionNd(*relu2->getOutput(0), outch, DimsHW{1, 1}, weightMap[lname + \".conv3.weight\"], emptywts);\n    assert(conv3);\n    conv3->setStrideNd(DimsHW{stride, stride});\n    conv3->setPaddingNd(DimsHW{0, 0});\n    debug_print(conv3->getOutput(0), lname + \"_3\");\n    IScaleLayer *bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \".bn3\", 1e-5);\n\n    IElementWiseLayer *ew1;\n    if (inch != outch)\n    {\n        IConvolutionLayer *conv4 = network->addConvolutionNd(input, outch, DimsHW{1, 1}, weightMap[lname + \".downsample.0.weight\"], emptywts);\n        assert(conv4);\n        conv4->setStrideNd(DimsHW{stride, stride});\n        conv4->setPaddingNd(DimsHW{0, 0});\n        debug_print(conv4->getOutput(0), lname + \"_4\");\n        IScaleLayer *bn4 = addBatchNorm2d(network, weightMap, *conv4->getOutput(0), lname + \".downsample.1\", 1e-5);\n        ew1 = network->addElementWise(*bn4->getOutput(0), *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    else\n    {\n        ew1 = network->addElementWise(input, *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    IActivationLayer *relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n    return relu3;\n}\n\nIActivationLayer *ResBlock(INetworkDefinition *network, std::map<std::string, Weights> &weightMap, ITensor &input, int inch, int outch, int stride, std::string lname)\n{\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    // in 256 out 64\n    IConvolutionLayer *conv1 = network->addConvolutionNd(input, outch, DimsHW{1, 1}, weightMap[lname + \".conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{stride, stride});\n    conv1->setPaddingNd(DimsHW{0, 0});\n    debug_print(conv1->getOutput(0), lname + \"_1\");\n    IScaleLayer *bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn1\", 1e-5);\n\n    IActivationLayer *relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    ///\n    IConvolutionLayer *conv2 = network->addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{3, 3}, weightMap[lname + \".conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{stride, stride});\n    conv2->setPaddingNd(DimsHW{1, 1});\n    debug_print(conv2->getOutput(0), lname + \"_2\");\n    IScaleLayer *bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \".bn2\", 1e-5);\n\n    IActivationLayer *relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n    //////\n    IConvolutionLayer *conv3 = network->addConvolutionNd(*relu2->getOutput(0), inch, DimsHW{1, 1}, weightMap[lname + \".conv3.weight\"], emptywts);\n    assert(conv3);\n    conv3->setStrideNd(DimsHW{stride, stride});\n    conv3->setPaddingNd(DimsHW{0, 0});\n    debug_print(conv3->getOutput(0), lname + \"_3\");\n    IScaleLayer *bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \".bn3\", 1e-5);\n\n    IElementWiseLayer *ew1;\n    ew1 = network->addElementWise(input, *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    IActivationLayer *relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n    return relu3;\n}\n\nIActivationLayer *liteResBlock(INetworkDefinition *network, std::map<std::string, Weights> &weightMap, ITensor &input, int outch, std::string lname)\n{\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    // in 256 out 64\n    IConvolutionLayer *conv1 = network->addConvolutionNd(input, outch, DimsHW{3, 3}, weightMap[lname + \".conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{1, 1});\n    conv1->setPaddingNd(DimsHW{1, 1});\n    debug_print(conv1->getOutput(0), lname + \"_1\");\n    IScaleLayer *bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn1\", 1e-5);\n\n    IActivationLayer *relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    ///\n    IConvolutionLayer *conv2 = network->addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{3, 3}, weightMap[lname + \".conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{1, 1});\n    conv2->setPaddingNd(DimsHW{1, 1});\n    debug_print(conv2->getOutput(0), lname + \"_2\");\n    IScaleLayer *bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \".bn2\", 1e-5);\n\n    IElementWiseLayer *ew1;\n    ew1 = network->addElementWise(input, *bn2->getOutput(0), ElementWiseOperation::kSUM);\n    debug_print(ew1->getOutput(0), lname + \"_add\");\n    IActivationLayer *relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n    return relu3;\n}\n\nILayer *convBnAddRelu(INetworkDefinition *network, std::map<std::string, Weights> &weightMap, ITensor &input, ITensor &addinput, int outch, int ksize, int s, int p, std::string convname, std::string bnname, bool bias = false)\n{\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer *conv1;\n    //Dims dim;\n    if (!bias)\n    {\n        conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[convname + \".weight\"], emptywts);\n    }\n    else\n    {\n        conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[convname + \".weight\"], weightMap[convname + \".bias\"]);\n    }\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n    debug_print(conv1->getOutput(0), convname);\n    IScaleLayer *bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), bnname, 1e-5);\n    auto lr = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    debug_print(lr->getOutput(0), convname + \"_add\");\n    return lr;\n}\n\nILayer *netAddUpsampleBi(INetworkDefinition *network, ITensor *input, Dims outdims)\n{\n    // Bi + True\n    IResizeLayer *upSample = network->addResize(*input);\n    upSample->setResizeMode(ResizeMode::kLINEAR);\n    upSample->setOutputDimensions(outdims);\n    upSample->setAlignCorners(true); // tips!\n    return upSample;\n}\n\nIElementWiseLayer *convBnUpAdd(INetworkDefinition *network,\n                               std::map<std::string, Weights> &weightMap,\n                               ITensor &input, ITensor &addinput,\n                               int outch, int ksize, int s, int p,\n                               std::string convname,\n                               std::string bnname, bool upsample, bool bias = false)\n{\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer *conv1;\n    if (!bias)\n    {\n        conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[convname + \".weight\"], emptywts);\n    }\n    else\n    {\n        conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[convname + \".weight\"], weightMap[convname + \".bias\"]);\n    }\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n    debug_print(conv1->getOutput(0), convname + \"_1\");\n    IScaleLayer *bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), bnname, 1e-5);\n    if (!upsample)\n    {\n        IElementWiseLayer *add = network->addElementWise(*bn1->getOutput(0), addinput, ElementWiseOperation::kSUM);\n        debug_print(add->getOutput(0), convname + \"_add\");\n        return add;\n    }\n    else\n    {\n        nvinfer1::Dims dim = addinput.getDimensions();\n        ILayer *up = netAddUpsampleBi(network, bn1->getOutput(0), dim);\n        IElementWiseLayer *add = network->addElementWise(*up->getOutput(0), addinput, ElementWiseOperation::kSUM);\n        debug_print(conv1->getOutput(0), convname + \"_1\");\n        //auto lr = network->addActivation(*add->getOutput(0), ActivationType::kRELU);\n        return add;\n    }\n}\n"
  },
  {
    "path": "hrnet/hrnet-semantic-segmentation/gen_wts.py",
    "content": "import argparse\nimport struct\n\nimport _init_paths\nimport models\nimport torch\nfrom config import config, update_config\n\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description=\"Train keypoints network\")\n\n    parser.add_argument(\"--cfg\", help=\"experiment configure file name\", type=str)\n    parser.add_argument(\"--ckpt_path\", help=\"checkpoint path\", required=True, type=str)\n    parser.add_argument(\"--save_path\", help=\".wts path\", required=True, type=str)\n\n    parser.add_argument(\n        \"opts\",\n        help=\"Modify config options using the command-line\",\n        default=None,\n        nargs=argparse.REMAINDER,\n    )\n\n    args = parser.parse_args()\n    update_config(config, args)\n\n    return args\n\n\ndef main():\n    args = parse_args()\n\n    model = eval(\"models.\" + config.MODEL.NAME + \".get_seg_model\")(config)\n\n    print(\"=> loading model from {}\".format(args.ckpt_path))\n    pretrained_dict = torch.load(args.ckpt_path, map_location=\"cpu\")\n    model_dict = model.state_dict()\n    pretrained_dict = {\n        k[6:]: v for k, v in pretrained_dict.items() if k[6:] in model_dict.keys()\n    }\n    for k, _ in pretrained_dict.items():\n        print(\"=> loading {} from pretrained model\".format(k))\n    model_dict.update(pretrained_dict)\n    model.load_state_dict(model_dict)\n\n    print(\"=> saving {} \".format(args.save_path))\n    f = open(args.save_path, \"w\")\n    f.write(\"{}\\n\".format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write(\"{} {} \".format(k, len(vr)))\n        for vv in vr:\n            f.write(\" \")\n            f.write(struct.pack(\">f\", float(vv)).hex())\n        f.write(\"\\n\")\n    f.close()\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "hrnet/hrnet-semantic-segmentation/hrnet.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include \"common.hpp\"\n#include \"logging.h\"\n\nstatic Logger gLogger;\n#define USE_FP32\n#define DEVICE 0 // GPU id\n#define BATCH_SIZE 1\n\nconst char *INPUT_BLOB_NAME = \"data\";\nconst char *OUTPUT_BLOB_NAME = \"output\";\nstatic const int INPUT_H = 512;\nstatic const int INPUT_W = 1024;\nstatic const int NUM_CLASSES = 19;\nstatic const int OUTPUT_SIZE = INPUT_H * INPUT_W;\n\n// Creat the engine using only the API and not any parser.\nICudaEngine *createEngine(unsigned int maxBatchSize, IBuilder *builder, IBuilderConfig *config, DataType dt, std::string wtsPath, int width)\n{\n    INetworkDefinition *network = builder->createNetworkV2(0U);\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\n    ITensor *data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{INPUT_H, INPUT_W, 3});\n    assert(data);\n\n    // hwc to chw\n    auto ps = network->addShuffle(*data);\n    ps->setFirstTranspose(nvinfer1::Permutation{2, 0, 1});\n    float mean[3] = {0.485, 0.456, 0.406};\n    float std[3] = {0.229, 0.224, 0.225};\n    ITensor *preinput = MeanStd(network, ps->getOutput(0), mean, std, true);\n\n    std::map<std::string, Weights> weightMap = loadWeights(wtsPath);\n    auto relu_2 = convBnRelu(network, weightMap, *preinput, 64, 3, 2, 1, \"conv1\", \"bn1\");\n    auto relu_5 = convBnRelu(network, weightMap, *relu_2->getOutput(0), 64, 3, 2, 1, \"conv2\", \"bn2\");\n    auto relu_17 = ResBlock2Conv(network, weightMap, *relu_5->getOutput(0), 64, 256, 1, \"layer1.0\");\n    auto relu_27 = ResBlock(network, weightMap, *relu_17->getOutput(0), 256, 64, 1, \"layer1.1\");\n    auto relu_37 = ResBlock(network, weightMap, *relu_27->getOutput(0), 256, 64, 1, \"layer1.2\");\n    auto relu_47 = ResBlock(network, weightMap, *relu_37->getOutput(0), 256, 64, 1, \"layer1.3\");\n\n    auto relu_50 = convBnRelu(network, weightMap, *relu_47->getOutput(0), width, 3, 1, 1, \"transition1.0.0\", \"transition1.0.1\");\n    auto relu_60 = liteResBlock(network, weightMap, *relu_50->getOutput(0), width, \"stage2.0.branches.0.0\");\n    auto relu_67 = liteResBlock(network, weightMap, *relu_60->getOutput(0), width, \"stage2.0.branches.0.1\");\n    auto relu_74 = liteResBlock(network, weightMap, *relu_67->getOutput(0), width, \"stage2.0.branches.0.2\");\n    auto relu_81 = liteResBlock(network, weightMap, *relu_74->getOutput(0), width, \"stage2.0.branches.0.3\");\n\n    auto relu_53 = convBnRelu(network, weightMap, *relu_47->getOutput(0), width * 2, 3, 2, 1, \"transition1.1.0.0\", \"transition1.1.0.1\");\n    auto relu_88 = liteResBlock(network, weightMap, *relu_53->getOutput(0), width * 2, \"stage2.0.branches.1.0\");\n    auto relu_95 = liteResBlock(network, weightMap, *relu_88->getOutput(0), width * 2, \"stage2.0.branches.1.1\");\n    auto relu_102 = liteResBlock(network, weightMap, *relu_95->getOutput(0), width * 2, \"stage2.0.branches.1.2\");\n    auto relu_109 = liteResBlock(network, weightMap, *relu_102->getOutput(0), width * 2, \"stage2.0.branches.1.3\");\n\n    auto add_131 = convBnUpAdd(network, weightMap, *relu_109->getOutput(0), *relu_81->getOutput(0), width, 1, 1, 0, \"stage2.0.fuse_layers.0.1.0\", \"stage2.0.fuse_layers.0.1.1\", true);\n    auto relu_132 = network->addActivation(*add_131->getOutput(0), ActivationType::kRELU);\n\n    auto add_135 = convBnUpAdd(network, weightMap, *relu_81->getOutput(0), *relu_109->getOutput(0), width * 2, 3, 2, 1, \"stage2.0.fuse_layers.1.0.0.0\", \"stage2.0.fuse_layers.1.0.0.1\", false);\n    auto relu_136 = network->addActivation(*add_135->getOutput(0), ActivationType::kRELU);\n\n    auto relu_146 = liteResBlock(network, weightMap, *relu_132->getOutput(0), width, \"stage3.0.branches.0.0\");\n    auto relu_153 = liteResBlock(network, weightMap, *relu_146->getOutput(0), width, \"stage3.0.branches.0.1\");\n    auto relu_160 = liteResBlock(network, weightMap, *relu_153->getOutput(0), width, \"stage3.0.branches.0.2\");\n    auto relu_167 = liteResBlock(network, weightMap, *relu_160->getOutput(0), width, \"stage3.0.branches.0.3\");\n\n    auto relu_174 = liteResBlock(network, weightMap, *relu_136->getOutput(0), width * 2, \"stage3.0.branches.1.0\");\n    auto relu_181 = liteResBlock(network, weightMap, *relu_174->getOutput(0), width * 2, \"stage3.0.branches.1.1\");\n    auto relu_188 = liteResBlock(network, weightMap, *relu_181->getOutput(0), width * 2, \"stage3.0.branches.1.2\");\n    auto relu_195 = liteResBlock(network, weightMap, *relu_188->getOutput(0), width * 2, \"stage3.0.branches.1.3\");\n\n    auto relu_139 = convBnRelu(network, weightMap, *relu_136->getOutput(0), width * 4, 3, 2, 1, \"transition2.2.0.0\", \"transition2.2.0.1\");\n    auto relu_202 = liteResBlock(network, weightMap, *relu_139->getOutput(0), width * 4, \"stage3.0.branches.2.0\");\n    auto relu_209 = liteResBlock(network, weightMap, *relu_202->getOutput(0), width * 4, \"stage3.0.branches.2.1\");\n    auto relu_216 = liteResBlock(network, weightMap, *relu_209->getOutput(0), width * 4, \"stage3.0.branches.2.2\");\n    auto relu_223 = liteResBlock(network, weightMap, *relu_216->getOutput(0), width * 4, \"stage3.0.branches.2.3\");\n\n    auto add_245 = convBnUpAdd(network, weightMap, *relu_195->getOutput(0), *relu_167->getOutput(0), width, 1, 1, 0, \"stage3.0.fuse_layers.0.1.0\", \"stage3.0.fuse_layers.0.1.1\", true);\n    auto add_267 = convBnUpAdd(network, weightMap, *relu_223->getOutput(0), *add_245->getOutput(0), width, 1, 1, 0, \"stage3.0.fuse_layers.0.2.0\", \"stage3.0.fuse_layers.0.2.1\", true);\n    auto relu_268 = network->addActivation(*add_267->getOutput(0), ActivationType::kRELU);\n\n    auto add_271 = convBnUpAdd(network, weightMap, *relu_167->getOutput(0), *relu_195->getOutput(0), width * 2, 3, 2, 1, \"stage3.0.fuse_layers.1.0.0.0\", \"stage3.0.fuse_layers.1.0.0.1\", false);\n    auto add_293 = convBnUpAdd(network, weightMap, *relu_223->getOutput(0), *add_271->getOutput(0), width * 2, 1, 1, 0, \"stage3.0.fuse_layers.1.2.0\", \"stage3.0.fuse_layers.1.2.1\", true);\n    auto relu_294 = network->addActivation(*add_293->getOutput(0), ActivationType::kRELU);\n\n    auto relu_297 = convBnRelu(network, weightMap, *relu_167->getOutput(0), width, 3, 2, 1, \"stage3.0.fuse_layers.2.0.0.0\", \"stage3.0.fuse_layers.2.0.0.1\");\n    auto bn_299 = convBnRelu(network, weightMap, *relu_297->getOutput(0), width * 4, 3, 2, 1, \"stage3.0.fuse_layers.2.0.1.0\", \"stage3.0.fuse_layers.2.0.1.1\", false);\n    auto add_302 = convBnUpAdd(network, weightMap, *relu_195->getOutput(0), *bn_299->getOutput(0), width * 4, 3, 2, 1, \"stage3.0.fuse_layers.2.1.0.0\", \"stage3.0.fuse_layers.2.1.0.1\", false);\n    auto add_303 = network->addElementWise(*add_302->getOutput(0), *relu_223->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_304 = network->addActivation(*add_303->getOutput(0), ActivationType::kRELU);\n\n    auto relu_311 = liteResBlock(network, weightMap, *relu_268->getOutput(0), width, \"stage3.1.branches.0.0\");\n    auto relu_318 = liteResBlock(network, weightMap, *relu_311->getOutput(0), width, \"stage3.1.branches.0.1\");\n    auto relu_325 = liteResBlock(network, weightMap, *relu_318->getOutput(0), width, \"stage3.1.branches.0.2\");\n    auto relu_332 = liteResBlock(network, weightMap, *relu_325->getOutput(0), width, \"stage3.1.branches.0.3\");\n\n    auto relu_339 = liteResBlock(network, weightMap, *relu_294->getOutput(0), width * 2, \"stage3.1.branches.1.0\");\n    auto relu_346 = liteResBlock(network, weightMap, *relu_339->getOutput(0), width * 2, \"stage3.1.branches.1.1\");\n    auto relu_353 = liteResBlock(network, weightMap, *relu_346->getOutput(0), width * 2, \"stage3.1.branches.1.2\");\n    auto relu_360 = liteResBlock(network, weightMap, *relu_353->getOutput(0), width * 2, \"stage3.1.branches.1.3\");\n\n    auto relu_367 = liteResBlock(network, weightMap, *relu_304->getOutput(0), width * 4, \"stage3.1.branches.2.0\");\n    auto relu_374 = liteResBlock(network, weightMap, *relu_367->getOutput(0), width * 4, \"stage3.1.branches.2.1\");\n    auto relu_381 = liteResBlock(network, weightMap, *relu_374->getOutput(0), width * 4, \"stage3.1.branches.2.2\");\n    auto relu_388 = liteResBlock(network, weightMap, *relu_381->getOutput(0), width * 4, \"stage3.1.branches.2.3\");\n\n    auto add_410 = convBnUpAdd(network, weightMap, *relu_360->getOutput(0), *relu_332->getOutput(0), width, 1, 1, 0, \"stage3.1.fuse_layers.0.1.0\", \"stage3.1.fuse_layers.0.1.1\", true);\n    auto add_432 = convBnUpAdd(network, weightMap, *relu_388->getOutput(0), *add_410->getOutput(0), width, 1, 1, 0, \"stage3.1.fuse_layers.0.2.0\", \"stage3.1.fuse_layers.0.2.1\", true);\n    auto relu_433 = network->addActivation(*add_432->getOutput(0), ActivationType::kRELU);\n\n    auto add_436 = convBnUpAdd(network, weightMap, *relu_332->getOutput(0), *relu_360->getOutput(0), width * 2, 3, 2, 1, \"stage3.1.fuse_layers.1.0.0.0\", \"stage3.1.fuse_layers.1.0.0.1\", false);\n    auto add_458 = convBnUpAdd(network, weightMap, *relu_388->getOutput(0), *add_436->getOutput(0), width * 2, 1, 1, 0, \"stage3.1.fuse_layers.1.2.0\", \"stage3.1.fuse_layers.1.2.1\", true);\n    auto relu_459 = network->addActivation(*add_458->getOutput(0), ActivationType::kRELU);\n\n    auto relu_462 = convBnRelu(network, weightMap, *relu_332->getOutput(0), width, 3, 2, 1, \"stage3.1.fuse_layers.2.0.0.0\", \"stage3.1.fuse_layers.2.0.0.1\");\n    auto bn_464 = convBnRelu(network, weightMap, *relu_462->getOutput(0), width * 4, 3, 2, 1, \"stage3.1.fuse_layers.2.0.1.0\", \"stage3.1.fuse_layers.2.0.1.1\", false);\n    auto add_467 = convBnUpAdd(network, weightMap, *relu_360->getOutput(0), *bn_464->getOutput(0), width * 4, 3, 2, 1, \"stage3.1.fuse_layers.2.1.0.0\", \"stage3.1.fuse_layers.2.1.0.1\", false);\n    auto add_468 = network->addElementWise(*add_467->getOutput(0), *relu_388->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_469 = network->addActivation(*add_468->getOutput(0), ActivationType::kRELU);\n\n    auto relu_476 = liteResBlock(network, weightMap, *relu_433->getOutput(0), width, \"stage3.2.branches.0.0\");\n    auto relu_483 = liteResBlock(network, weightMap, *relu_476->getOutput(0), width, \"stage3.2.branches.0.1\");\n    auto relu_490 = liteResBlock(network, weightMap, *relu_483->getOutput(0), width, \"stage3.2.branches.0.2\");\n    auto relu_497 = liteResBlock(network, weightMap, *relu_490->getOutput(0), width, \"stage3.2.branches.0.3\");\n\n    auto relu_504 = liteResBlock(network, weightMap, *relu_459->getOutput(0), width * 2, \"stage3.2.branches.1.0\");\n    auto relu_511 = liteResBlock(network, weightMap, *relu_504->getOutput(0), width * 2, \"stage3.2.branches.1.1\");\n    auto relu_518 = liteResBlock(network, weightMap, *relu_511->getOutput(0), width * 2, \"stage3.2.branches.1.2\");\n    auto relu_525 = liteResBlock(network, weightMap, *relu_518->getOutput(0), width * 2, \"stage3.2.branches.1.3\");\n\n    auto relu_532 = liteResBlock(network, weightMap, *relu_469->getOutput(0), width * 4, \"stage3.2.branches.2.0\");\n    auto relu_539 = liteResBlock(network, weightMap, *relu_532->getOutput(0), width * 4, \"stage3.2.branches.2.1\");\n    auto relu_546 = liteResBlock(network, weightMap, *relu_539->getOutput(0), width * 4, \"stage3.2.branches.2.2\");\n    auto relu_553 = liteResBlock(network, weightMap, *relu_546->getOutput(0), width * 4, \"stage3.2.branches.2.3\");\n\n    auto add_575 = convBnUpAdd(network, weightMap, *relu_525->getOutput(0), *relu_497->getOutput(0), width, 1, 1, 0, \"stage3.2.fuse_layers.0.1.0\", \"stage3.2.fuse_layers.0.1.1\", true);\n    auto add_597 = convBnUpAdd(network, weightMap, *relu_553->getOutput(0), *add_575->getOutput(0), width, 1, 1, 0, \"stage3.2.fuse_layers.0.2.0\", \"stage3.2.fuse_layers.0.2.1\", true);\n\n    auto relu_598 = network->addActivation(*add_597->getOutput(0), ActivationType::kRELU);\n\n    auto add_601 = convBnUpAdd(network, weightMap, *relu_497->getOutput(0), *relu_525->getOutput(0), width * 2, 3, 2, 1, \"stage3.2.fuse_layers.1.0.0.0\", \"stage3.2.fuse_layers.1.0.0.1\", false);\n    auto add_623 = convBnUpAdd(network, weightMap, *relu_553->getOutput(0), *add_601->getOutput(0), width * 2, 1, 1, 0, \"stage3.2.fuse_layers.1.2.0\", \"stage3.2.fuse_layers.1.2.1\", true);\n    auto relu_624 = network->addActivation(*add_623->getOutput(0), ActivationType::kRELU);\n\n    auto relu_627 = convBnRelu(network, weightMap, *relu_497->getOutput(0), width, 3, 2, 1, \"stage3.2.fuse_layers.2.0.0.0\", \"stage3.2.fuse_layers.2.0.0.1\");\n    auto bn_629 = convBnRelu(network, weightMap, *relu_627->getOutput(0), width * 4, 3, 2, 1, \"stage3.2.fuse_layers.2.0.1.0\", \"stage3.2.fuse_layers.2.0.1.1\", false);\n    auto add_632 = convBnUpAdd(network, weightMap, *relu_525->getOutput(0), *bn_629->getOutput(0), width * 4, 3, 2, 1, \"stage3.2.fuse_layers.2.1.0.0\", \"stage3.2.fuse_layers.2.1.0.1\", false);\n    auto add_633 = network->addElementWise(*relu_553->getOutput(0), *add_632->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_634 = network->addActivation(*add_633->getOutput(0), ActivationType::kRELU);\n\n    auto relu_641 = liteResBlock(network, weightMap, *relu_598->getOutput(0), width, \"stage3.3.branches.0.0\");\n    auto relu_648 = liteResBlock(network, weightMap, *relu_641->getOutput(0), width, \"stage3.3.branches.0.1\");\n    auto relu_655 = liteResBlock(network, weightMap, *relu_648->getOutput(0), width, \"stage3.3.branches.0.2\");\n    auto relu_662 = liteResBlock(network, weightMap, *relu_655->getOutput(0), width, \"stage3.3.branches.0.3\");\n\n    auto relu_669 = liteResBlock(network, weightMap, *relu_624->getOutput(0), width * 2, \"stage3.3.branches.1.0\");\n    auto relu_676 = liteResBlock(network, weightMap, *relu_669->getOutput(0), width * 2, \"stage3.3.branches.1.1\");\n    auto relu_683 = liteResBlock(network, weightMap, *relu_676->getOutput(0), width * 2, \"stage3.3.branches.1.2\");\n    auto relu_690 = liteResBlock(network, weightMap, *relu_683->getOutput(0), width * 2, \"stage3.3.branches.1.3\");\n\n    auto relu_697 = liteResBlock(network, weightMap, *relu_634->getOutput(0), width * 4, \"stage3.3.branches.2.0\");\n    auto relu_704 = liteResBlock(network, weightMap, *relu_697->getOutput(0), width * 4, \"stage3.3.branches.2.1\");\n    auto relu_711 = liteResBlock(network, weightMap, *relu_704->getOutput(0), width * 4, \"stage3.3.branches.2.2\");\n    auto relu_718 = liteResBlock(network, weightMap, *relu_711->getOutput(0), width * 4, \"stage3.3.branches.2.3\");\n\n    auto add_740 = convBnUpAdd(network, weightMap, *relu_690->getOutput(0), *relu_662->getOutput(0), width, 1, 1, 0, \"stage3.3.fuse_layers.0.1.0\", \"stage3.3.fuse_layers.0.1.1\", true);\n    auto add_762 = convBnUpAdd(network, weightMap, *relu_718->getOutput(0), *add_740->getOutput(0), width, 1, 1, 0, \"stage3.3.fuse_layers.0.2.0\", \"stage3.3.fuse_layers.0.2.1\", true);\n    auto relu_763 = network->addActivation(*add_762->getOutput(0), ActivationType::kRELU);\n\n    auto add_766 = convBnUpAdd(network, weightMap, *relu_662->getOutput(0), *relu_690->getOutput(0), width * 2, 3, 2, 1, \"stage3.3.fuse_layers.1.0.0.0\", \"stage3.3.fuse_layers.1.0.0.1\", false);\n    auto add_788 = convBnUpAdd(network, weightMap, *relu_718->getOutput(0), *add_766->getOutput(0), width * 2, 1, 1, 0, \"stage3.3.fuse_layers.1.2.0\", \"stage3.3.fuse_layers.1.2.1\", true);\n    auto relu_789 = network->addActivation(*add_788->getOutput(0), ActivationType::kRELU);\n\n    auto relu_792 = convBnRelu(network, weightMap, *relu_662->getOutput(0), width, 3, 2, 1, \"stage3.3.fuse_layers.2.0.0.0\", \"stage3.3.fuse_layers.2.0.0.1\");\n    auto bn_794 = convBnRelu(network, weightMap, *relu_792->getOutput(0), width * 4, 3, 2, 1, \"stage3.3.fuse_layers.2.0.1.0\", \"stage3.3.fuse_layers.2.0.1.1\", false);\n    auto add_797 = convBnUpAdd(network, weightMap, *relu_690->getOutput(0), *bn_794->getOutput(0), width * 4, 3, 2, 1, \"stage3.3.fuse_layers.2.1.0.0\", \"stage3.3.fuse_layers.2.1.0.1\", false);\n    auto add_798 = network->addElementWise(*relu_718->getOutput(0), *add_797->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_799 = network->addActivation(*add_798->getOutput(0), ActivationType::kRELU);\n\n    auto relu_809 = liteResBlock(network, weightMap, *relu_763->getOutput(0), width, \"stage4.0.branches.0.0\");\n    auto relu_816 = liteResBlock(network, weightMap, *relu_809->getOutput(0), width, \"stage4.0.branches.0.1\");\n    auto relu_823 = liteResBlock(network, weightMap, *relu_816->getOutput(0), width, \"stage4.0.branches.0.2\");\n    auto relu_830 = liteResBlock(network, weightMap, *relu_823->getOutput(0), width, \"stage4.0.branches.0.3\");\n\n    auto relu_837 = liteResBlock(network, weightMap, *relu_789->getOutput(0), width * 2, \"stage4.0.branches.1.0\");\n    auto relu_844 = liteResBlock(network, weightMap, *relu_837->getOutput(0), width * 2, \"stage4.0.branches.1.1\");\n    auto relu_851 = liteResBlock(network, weightMap, *relu_844->getOutput(0), width * 2, \"stage4.0.branches.1.2\");\n    auto relu_858 = liteResBlock(network, weightMap, *relu_851->getOutput(0), width * 2, \"stage4.0.branches.1.3\");\n\n    auto relu_865 = liteResBlock(network, weightMap, *relu_799->getOutput(0), width * 4, \"stage4.0.branches.2.0\");\n    auto relu_872 = liteResBlock(network, weightMap, *relu_865->getOutput(0), width * 4, \"stage4.0.branches.2.1\");\n    auto relu_879 = liteResBlock(network, weightMap, *relu_872->getOutput(0), width * 4, \"stage4.0.branches.2.2\");\n    auto relu_886 = liteResBlock(network, weightMap, *relu_879->getOutput(0), width * 4, \"stage4.0.branches.2.3\"); //========\n\n    auto relu_802 = convBnRelu(network, weightMap, *relu_799->getOutput(0), width * 8, 3, 2, 1, \"transition3.3.0.0\", \"transition3.3.0.1\");\n    auto relu_893 = liteResBlock(network, weightMap, *relu_802->getOutput(0), width * 8, \"stage4.0.branches.3.0\");\n    auto relu_900 = liteResBlock(network, weightMap, *relu_893->getOutput(0), width * 8, \"stage4.0.branches.3.1\");\n    auto relu_907 = liteResBlock(network, weightMap, *relu_900->getOutput(0), width * 8, \"stage4.0.branches.3.2\");\n    auto relu_914 = liteResBlock(network, weightMap, *relu_907->getOutput(0), width * 8, \"stage4.0.branches.3.3\");\n\n    auto add_936 = convBnUpAdd(network, weightMap, *relu_858->getOutput(0), *relu_830->getOutput(0), width, 1, 1, 0, \"stage4.0.fuse_layers.0.1.0\", \"stage4.0.fuse_layers.0.1.1\", true);\n    auto add_958 = convBnUpAdd(network, weightMap, *relu_886->getOutput(0), *add_936->getOutput(0), width, 1, 1, 0, \"stage4.0.fuse_layers.0.2.0\", \"stage4.0.fuse_layers.0.2.1\", true);\n    auto add_980 = convBnUpAdd(network, weightMap, *relu_914->getOutput(0), *add_958->getOutput(0), width, 1, 1, 0, \"stage4.0.fuse_layers.0.3.0\", \"stage4.0.fuse_layers.0.3.1\", true);\n    auto relu_981 = network->addActivation(*add_980->getOutput(0), ActivationType::kRELU);\n\n    auto add_984 = convBnUpAdd(network, weightMap, *relu_830->getOutput(0), *relu_858->getOutput(0), width * 2, 3, 2, 1, \"stage4.0.fuse_layers.1.0.0.0\", \"stage4.0.fuse_layers.1.0.0.1\", false);\n    auto add_1006 = convBnUpAdd(network, weightMap, *relu_886->getOutput(0), *add_984->getOutput(0), width * 2, 1, 1, 0, \"stage4.0.fuse_layers.1.2.0\", \"stage4.0.fuse_layers.1.2.1\", true);\n    auto add_1028 = convBnUpAdd(network, weightMap, *relu_914->getOutput(0), *add_1006->getOutput(0), width * 2, 1, 1, 0, \"stage4.0.fuse_layers.1.3.0\", \"stage4.0.fuse_layers.1.3.1\", true);\n    auto relu_1029 = network->addActivation(*add_1028->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1032 = convBnRelu(network, weightMap, *relu_830->getOutput(0), width, 3, 2, 1, \"stage4.0.fuse_layers.2.0.0.0\", \"stage4.0.fuse_layers.2.0.0.1\");\n    auto bn_1034 = convBnRelu(network, weightMap, *relu_1032->getOutput(0), width * 4, 3, 2, 1, \"stage4.0.fuse_layers.2.0.1.0\", \"stage4.0.fuse_layers.2.0.1.1\", false);\n\n    auto add_1037 = convBnUpAdd(network, weightMap, *relu_858->getOutput(0), *bn_1034->getOutput(0), width * 4, 3, 2, 1,\n                                \"stage4.0.fuse_layers.2.1.0.0\", \"stage4.0.fuse_layers.2.1.0.1\", false);\n    auto add_1038 = network->addElementWise(*relu_886->getOutput(0), *add_1037->getOutput(0), ElementWiseOperation::kSUM);\n    auto add_1060 = convBnUpAdd(network, weightMap, *relu_914->getOutput(0), *add_1038->getOutput(0), width * 4, 1, 1, 0,\n                                \"stage4.0.fuse_layers.2.3.0\", \"stage4.0.fuse_layers.2.3.1\", true);\n    auto relu_1061 = network->addActivation(*add_1060->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1064 = convBnRelu(network, weightMap, *relu_830->getOutput(0), width, 3, 2, 1, \"stage4.0.fuse_layers.3.0.0.0\", \"stage4.0.fuse_layers.3.0.0.1\");\n    auto relu_1067 = convBnRelu(network, weightMap, *relu_1064->getOutput(0), width, 3, 2, 1, \"stage4.0.fuse_layers.3.0.1.0\", \"stage4.0.fuse_layers.3.0.1.1\");\n    auto bn_1069 = convBnRelu(network, weightMap, *relu_1067->getOutput(0), width * 8, 3, 2, 1, \"stage4.0.fuse_layers.3.0.2.0\", \"stage4.0.fuse_layers.3.0.2.1\", false);\n    auto relu_1072 = convBnRelu(network, weightMap, *relu_858->getOutput(0), width * 2, 3, 2, 1, \"stage4.0.fuse_layers.3.1.0.0\", \"stage4.0.fuse_layers.3.1.0.1\");\n    auto add_1075 = convBnUpAdd(network, weightMap, *relu_1072->getOutput(0), *bn_1069->getOutput(0), width * 8, 3, 2, 1,\n                                \"stage4.0.fuse_layers.3.1.1.0\", \"stage4.0.fuse_layers.3.1.1.1\", false);\n    auto add_1078 = convBnUpAdd(network, weightMap, *relu_886->getOutput(0), *add_1075->getOutput(0), width * 8, 3, 2, 1,\n                                \"stage4.0.fuse_layers.3.2.0.0\", \"stage4.0.fuse_layers.3.2.0.1\", false);\n    auto add_1079 = network->addElementWise(*relu_914->getOutput(0), *add_1078->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_1080 = network->addActivation(*add_1079->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1087 = liteResBlock(network, weightMap, *relu_981->getOutput(0), width, \"stage4.1.branches.0.0\");\n    auto relu_1094 = liteResBlock(network, weightMap, *relu_1087->getOutput(0), width, \"stage4.1.branches.0.1\");\n    auto relu_1101 = liteResBlock(network, weightMap, *relu_1094->getOutput(0), width, \"stage4.1.branches.0.2\");\n    auto relu_1108 = liteResBlock(network, weightMap, *relu_1101->getOutput(0), width, \"stage4.1.branches.0.3\");\n\n    auto relu_1115 = liteResBlock(network, weightMap, *relu_1029->getOutput(0), width * 2, \"stage4.1.branches.1.0\");\n    auto relu_1122 = liteResBlock(network, weightMap, *relu_1115->getOutput(0), width * 2, \"stage4.1.branches.1.1\");\n    auto relu_1129 = liteResBlock(network, weightMap, *relu_1122->getOutput(0), width * 2, \"stage4.1.branches.1.2\");\n    auto relu_1136 = liteResBlock(network, weightMap, *relu_1129->getOutput(0), width * 2, \"stage4.1.branches.1.3\");\n\n    auto relu_1143 = liteResBlock(network, weightMap, *relu_1061->getOutput(0), width * 4, \"stage4.1.branches.2.0\");\n    auto relu_1150 = liteResBlock(network, weightMap, *relu_1143->getOutput(0), width * 4, \"stage4.1.branches.2.1\");\n    auto relu_1157 = liteResBlock(network, weightMap, *relu_1150->getOutput(0), width * 4, \"stage4.1.branches.2.2\");\n    auto relu_1164 = liteResBlock(network, weightMap, *relu_1157->getOutput(0), width * 4, \"stage4.1.branches.2.3\");\n\n    auto relu_1171 = liteResBlock(network, weightMap, *relu_1080->getOutput(0), width * 8, \"stage4.1.branches.3.0\");\n    auto relu_1178 = liteResBlock(network, weightMap, *relu_1171->getOutput(0), width * 8, \"stage4.1.branches.3.1\");\n    auto relu_1185 = liteResBlock(network, weightMap, *relu_1178->getOutput(0), width * 8, \"stage4.1.branches.3.2\");\n    auto relu_1192 = liteResBlock(network, weightMap, *relu_1185->getOutput(0), width * 8, \"stage4.1.branches.3.3\");\n\n    auto add_1214 = convBnUpAdd(network, weightMap, *relu_1136->getOutput(0), *relu_1108->getOutput(0), width, 1, 1, 0,\n                                \"stage4.1.fuse_layers.0.1.0\", \"stage4.1.fuse_layers.0.1.1\", true);\n    auto add_1236 = convBnUpAdd(network, weightMap, *relu_1164->getOutput(0), *add_1214->getOutput(0), width, 1, 1, 0,\n                                \"stage4.1.fuse_layers.0.2.0\", \"stage4.1.fuse_layers.0.2.1\", true);\n    auto add_1258 = convBnUpAdd(network, weightMap, *relu_1192->getOutput(0), *add_1236->getOutput(0), width, 1, 1, 0,\n                                \"stage4.1.fuse_layers.0.3.0\", \"stage4.1.fuse_layers.0.3.1\", true);\n    auto relu_1259 = network->addActivation(*add_1258->getOutput(0), ActivationType::kRELU);\n\n    auto add_1262 = convBnUpAdd(network, weightMap, *relu_1108->getOutput(0), *relu_1136->getOutput(0), width * 2, 3, 2, 1,\n                                \"stage4.1.fuse_layers.1.0.0.0\", \"stage4.1.fuse_layers.1.0.0.1\", false);\n    auto add_1284 = convBnUpAdd(network, weightMap, *relu_1164->getOutput(0), *add_1262->getOutput(0), width * 2, 1, 1, 0,\n                                \"stage4.1.fuse_layers.1.2.0\", \"stage4.1.fuse_layers.1.2.1\", true);\n    auto add_1306 = convBnUpAdd(network, weightMap, *relu_1192->getOutput(0), *add_1284->getOutput(0), width * 2, 1, 1, 0,\n                                \"stage4.1.fuse_layers.1.3.0\", \"stage4.1.fuse_layers.1.3.1\", true);\n    auto relu_1307 = network->addActivation(*add_1306->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1310 = convBnRelu(network, weightMap, *relu_1108->getOutput(0), width, 3, 2, 1, \"stage4.1.fuse_layers.2.0.0.0\", \"stage4.1.fuse_layers.2.0.0.1\");\n    auto bn_1312 = convBnRelu(network, weightMap, *relu_1310->getOutput(0), width * 4, 3, 2, 1, \"stage4.1.fuse_layers.2.0.1.0\", \"stage4.1.fuse_layers.2.0.1.1\", false);\n    auto add_1315 = convBnUpAdd(network, weightMap, *relu_1136->getOutput(0), *bn_1312->getOutput(0), width * 4, 3, 2, 1,\n                                \"stage4.1.fuse_layers.2.1.0.0\", \"stage4.1.fuse_layers.2.1.0.1\", false);\n    auto add_1316 = network->addElementWise(*relu_1164->getOutput(0), *add_1315->getOutput(0), ElementWiseOperation::kSUM);\n    auto add_1338 = convBnUpAdd(network, weightMap, *relu_1192->getOutput(0), *add_1316->getOutput(0), width * 4, 1, 1, 0,\n                                \"stage4.1.fuse_layers.2.3.0\", \"stage4.1.fuse_layers.2.3.1\", true);\n    auto relu_1339 = network->addActivation(*add_1338->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1342 = convBnRelu(network, weightMap, *relu_1108->getOutput(0), width, 3, 2, 1, \"stage4.1.fuse_layers.3.0.0.0\", \"stage4.1.fuse_layers.3.0.0.1\");\n    auto relu_1345 = convBnRelu(network, weightMap, *relu_1342->getOutput(0), width, 3, 2, 1, \"stage4.1.fuse_layers.3.0.1.0\", \"stage4.1.fuse_layers.3.0.1.1\");\n    auto bn_1347 = convBnRelu(network, weightMap, *relu_1345->getOutput(0), width * 8, 3, 2, 1, \"stage4.1.fuse_layers.3.0.2.0\", \"stage4.1.fuse_layers.3.0.2.1\", false);\n    auto relu_1350 = convBnRelu(network, weightMap, *relu_1136->getOutput(0), width * 2, 3, 2, 1, \"stage4.1.fuse_layers.3.1.0.0\", \"stage4.1.fuse_layers.3.1.0.1\");\n    auto add_1353 = convBnUpAdd(network, weightMap, *relu_1350->getOutput(0), *bn_1347->getOutput(0), width * 8, 3, 2, 1,\n                                \"stage4.1.fuse_layers.3.1.1.0\", \"stage4.1.fuse_layers.3.1.1.1\", false);\n    auto add_1356 = convBnUpAdd(network, weightMap, *relu_1164->getOutput(0), *add_1353->getOutput(0), width * 8, 3, 2, 1,\n                                \"stage4.1.fuse_layers.3.2.0.0\", \"stage4.1.fuse_layers.3.2.0.1\", false);\n    auto add_1357 = network->addElementWise(*relu_1192->getOutput(0), *add_1356->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_1358 = network->addActivation(*add_1357->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1365 = liteResBlock(network, weightMap, *relu_1259->getOutput(0), width, \"stage4.2.branches.0.0\");\n    auto relu_1372 = liteResBlock(network, weightMap, *relu_1365->getOutput(0), width, \"stage4.2.branches.0.1\");\n    auto relu_1379 = liteResBlock(network, weightMap, *relu_1372->getOutput(0), width, \"stage4.2.branches.0.2\");\n    auto relu_1386 = liteResBlock(network, weightMap, *relu_1379->getOutput(0), width, \"stage4.2.branches.0.3\");\n\n    auto relu_1393 = liteResBlock(network, weightMap, *relu_1307->getOutput(0), width * 2, \"stage4.2.branches.1.0\");\n    auto relu_1400 = liteResBlock(network, weightMap, *relu_1393->getOutput(0), width * 2, \"stage4.2.branches.1.1\");\n    auto relu_1407 = liteResBlock(network, weightMap, *relu_1400->getOutput(0), width * 2, \"stage4.2.branches.1.2\");\n    auto relu_1414 = liteResBlock(network, weightMap, *relu_1407->getOutput(0), width * 2, \"stage4.2.branches.1.3\");\n\n    auto relu_1421 = liteResBlock(network, weightMap, *relu_1339->getOutput(0), width * 4, \"stage4.2.branches.2.0\");\n    auto relu_1428 = liteResBlock(network, weightMap, *relu_1421->getOutput(0), width * 4, \"stage4.2.branches.2.1\");\n    auto relu_1435 = liteResBlock(network, weightMap, *relu_1428->getOutput(0), width * 4, \"stage4.2.branches.2.2\");\n    auto relu_1442 = liteResBlock(network, weightMap, *relu_1435->getOutput(0), width * 4, \"stage4.2.branches.2.3\");\n\n    auto relu_1449 = liteResBlock(network, weightMap, *relu_1358->getOutput(0), width * 8, \"stage4.2.branches.3.0\");\n    auto relu_1456 = liteResBlock(network, weightMap, *relu_1449->getOutput(0), width * 8, \"stage4.2.branches.3.1\");\n    auto relu_1463 = liteResBlock(network, weightMap, *relu_1456->getOutput(0), width * 8, \"stage4.2.branches.3.2\");\n    auto relu_1470 = liteResBlock(network, weightMap, *relu_1463->getOutput(0), width * 8, \"stage4.2.branches.3.3\");\n\n    auto add_1492 = convBnUpAdd(network, weightMap, *relu_1414->getOutput(0), *relu_1386->getOutput(0), width, 1, 1, 0,\n                                \"stage4.2.fuse_layers.0.1.0\", \"stage4.2.fuse_layers.0.1.1\", true);\n    auto add_1514 = convBnUpAdd(network, weightMap, *relu_1442->getOutput(0), *add_1492->getOutput(0), width, 1, 1, 0,\n                                \"stage4.2.fuse_layers.0.2.0\", \"stage4.2.fuse_layers.0.2.1\", true);\n\n    auto add_1536 = convBnUpAdd(network, weightMap, *relu_1470->getOutput(0), *add_1514->getOutput(0), width, 1, 1, 0,\n                                \"stage4.2.fuse_layers.0.3.0\", \"stage4.2.fuse_layers.0.3.1\", true);\n    auto relu_1537 = network->addActivation(*add_1536->getOutput(0), ActivationType::kRELU);\n\n    auto add_1540 = convBnUpAdd(network, weightMap, *relu_1386->getOutput(0), *relu_1414->getOutput(0),\n                                width * 2, 3, 2, 1, \"stage4.2.fuse_layers.1.0.0.0\", \"stage4.2.fuse_layers.1.0.0.1\", false);\n    auto add_1562 = convBnUpAdd(network, weightMap, *relu_1442->getOutput(0), *add_1540->getOutput(0),\n                                width * 2, 1, 1, 0, \"stage4.2.fuse_layers.1.2.0\", \"stage4.2.fuse_layers.1.2.1\", true);\n    auto add_1584 = convBnUpAdd(network, weightMap, *relu_1470->getOutput(0), *add_1562->getOutput(0),\n                                width * 2, 1, 1, 0, \"stage4.2.fuse_layers.1.3.0\", \"stage4.2.fuse_layers.1.3.1\", true);\n    auto relu_1585 = network->addActivation(*add_1584->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1588 = convBnRelu(network, weightMap, *relu_1386->getOutput(0), width, 3, 2, 1, \"stage4.2.fuse_layers.2.0.0.0\", \"stage4.2.fuse_layers.2.0.0.1\");\n    auto bn_1590 = convBnRelu(network, weightMap, *relu_1588->getOutput(0), width * 4, 3, 2, 1, \"stage4.2.fuse_layers.2.0.1.0\", \"stage4.2.fuse_layers.2.0.1.1\", false);\n    auto add_1593 = convBnUpAdd(network, weightMap, *relu_1414->getOutput(0), *bn_1590->getOutput(0), width * 4, 3, 2, 1,\n                                \"stage4.2.fuse_layers.2.1.0.0\", \"stage4.2.fuse_layers.2.1.0.1\", false);\n    auto add_1594 = network->addElementWise(*relu_1442->getOutput(0), *add_1593->getOutput(0), ElementWiseOperation::kSUM);\n    auto add_1616 = convBnUpAdd(network, weightMap, *relu_1470->getOutput(0), *add_1594->getOutput(0), width * 4, 1, 1, 0,\n                                \"stage4.2.fuse_layers.2.3.0\", \"stage4.2.fuse_layers.2.3.1\", true);\n    auto relu_1617 = network->addActivation(*add_1616->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1620 = convBnRelu(network, weightMap, *relu_1386->getOutput(0), width, 3, 2, 1, \"stage4.2.fuse_layers.3.0.0.0\", \"stage4.2.fuse_layers.3.0.0.1\");\n    auto relu_1623 = convBnRelu(network, weightMap, *relu_1620->getOutput(0), width, 3, 2, 1, \"stage4.2.fuse_layers.3.0.1.0\", \"stage4.2.fuse_layers.3.0.1.1\");\n    auto bn_1625 = convBnRelu(network, weightMap, *relu_1623->getOutput(0), width * 8, 3, 2, 1, \"stage4.2.fuse_layers.3.0.2.0\", \"stage4.2.fuse_layers.3.0.2.1\", false);\n    auto relu_1628 = convBnRelu(network, weightMap, *relu_1414->getOutput(0), width * 2, 3, 2, 1, \"stage4.2.fuse_layers.3.1.0.0\", \"stage4.2.fuse_layers.3.1.0.1\");\n    auto add_1631 = convBnUpAdd(network, weightMap, *relu_1628->getOutput(0), *bn_1625->getOutput(0), width * 8, 3, 2, 1,\n                                \"stage4.2.fuse_layers.3.1.1.0\", \"stage4.2.fuse_layers.3.1.1.1\", false);\n    auto add_1634 = convBnUpAdd(network, weightMap, *relu_1442->getOutput(0), *add_1631->getOutput(0), width * 8, 3, 2, 1,\n                                \"stage4.2.fuse_layers.3.2.0.0\", \"stage4.2.fuse_layers.3.2.0.1\", false);\n    auto add_1635 = network->addElementWise(*relu_1470->getOutput(0), *add_1634->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_1636 = network->addActivation(*add_1635->getOutput(0), ActivationType::kRELU);\n\n    nvinfer1::Dims dim = relu_1537->getOutput(0)->getDimensions();\n    dim.d[0] = relu_1585->getOutput(0)->getDimensions().d[0];\n    auto resize_1655 = netAddUpsampleBi(network, relu_1585->getOutput(0), dim);\n    dim.d[0] = relu_1617->getOutput(0)->getDimensions().d[0];\n    auto resize_1668 = netAddUpsampleBi(network, relu_1617->getOutput(0), dim);\n    dim.d[0] = relu_1636->getOutput(0)->getDimensions().d[0];\n    auto resize_1681 = netAddUpsampleBi(network, relu_1636->getOutput(0), dim);\n\n    ITensor *concatTensors[] = {relu_1537->getOutput(0), resize_1655->getOutput(0), resize_1668->getOutput(0), resize_1681->getOutput(0)};\n    auto concat_1682 = network->addConcatenation(concatTensors, 4);\n    concat_1682->setAxis(0);\n    auto relu_1685 = convBnRelu(network, weightMap, *concat_1682->getOutput(0), width * 15, 1, 1, 0, \"last_layer.0\", \"last_layer.1\", true, true);\n    auto conv_1686 = network->addConvolutionNd(*relu_1685->getOutput(0), NUM_CLASSES, DimsHW{1, 1}, weightMap[\"last_layer.3.weight\"], weightMap[\"last_layer.3.bias\"]);\n    conv_1686->setStrideNd(DimsHW{1, 1});\n    conv_1686->setPaddingNd(DimsHW{0, 0});\n    debug_print(conv_1686->getOutput(0), \"conv_1686\");\n    dim.d[0] = NUM_CLASSES;\n    dim.d[1] = INPUT_H;\n    dim.d[2] = INPUT_W;\n    auto feature_map = netAddUpsampleBi(network, conv_1686->getOutput(0), dim);\n    debug_print(feature_map->getOutput(0), \"feature_map\");\n    auto topk = network->addTopK(*feature_map->getOutput(0), TopKOperation::kMAX, 1, 0X01);\n    debug_print(topk->getOutput(0), \"topk\");\n    std::cout << \"set name out\" << std::endl;\n    // topk->getOutput(1) 1 is index\n    topk->getOutput(1)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*topk->getOutput(1));\n\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize((1 << 30)); // 1G\n#ifdef USE_FP16\n    std::cout << \"use fp16\" << std::endl;\n    config->setFlag(BuilderFlag::kFP16);\n#endif\n    ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build success!\" << std::endl;\n    network->destroy();\n    for (auto &mem : weightMap)\n    {\n        free((void *)(mem.second.values));\n    }\n    return engine;\n}\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory **modelStream, std::string wtsPath, int width)\n{\n    IBuilder *builder = createInferBuilder(gLogger);\n    IBuilderConfig *config = builder->createBuilderConfig();\n    ICudaEngine *engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT, wtsPath, width);\n    assert(engine != nullptr);\n    (*modelStream) = engine->serialize();\n    engine->destroy();\n    builder->destroy();\n}\n\nbool parse_args(int argc, char **argv, std::string &wts, std::string &engine, int &width, std::string &img_dir)\n{\n    if (std::string(argv[1]) == \"-s\" && argc == 5)\n    {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        width = std::stoi(argv[4]);\n    }\n    else if (std::string(argv[1]) == \"-d\" && argc == 4)\n    {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n    }\n    else\n    {\n        return false;\n    }\n    return true;\n}\nvoid doInference(IExecutionContext &context, cudaStream_t &stream, void **buffers, int batchSize)\n{\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    cudaStreamSynchronize(stream);\n    cudaDeviceSynchronize();\n}\n\nint main(int argc, char **argv)\n{\n    cudaSetDevice(DEVICE);\n    std::string wtsPath = \"\";\n    std::string engine_name = \"\";\n    int width;\n    std::string img_dir;\n    // parse args\n    if (!parse_args(argc, argv, wtsPath, engine_name, width, img_dir))\n    {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./hrnet -s [.wts] [.engine] [18 or 32 or 48]  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./hrnet -d [.engine] ../samples  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n    // create a model using the API directly and serialize it to a stream\n    if (!wtsPath.empty())\n    {\n        IHostMemory *modelStream{nullptr};\n        APIToModel(BATCH_SIZE, &modelStream, wtsPath, width);\n        assert(modelStream != nullptr);\n        std::ofstream p(engine_name, std::ios::binary);\n        if (!p)\n        {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char *>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    }\n\n    // deserialize the .engine and run inference\n    char *trtModelStream{nullptr};\n    size_t size{0};\n    std::ifstream file(engine_name, std::ios::binary);\n    if (file.good())\n    {\n        file.seekg(0, file.end);\n        size = file.tellg();\n        file.seekg(0, file.beg);\n        trtModelStream = new char[size];\n        assert(trtModelStream);\n        file.read(trtModelStream, size);\n        file.close();\n    }\n    else\n    {\n        std::cerr << \"could not open plan file\" << std::endl;\n    }\n\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0)\n    {\n        std::cout << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n    // prepare input data ---------------------------\n    cudaSetDeviceFlags(cudaDeviceMapHost);\n    float *data;\n    int *prob; // using int. output is index\n    CHECK(cudaHostAlloc((void **)&data, BATCH_SIZE * 3 * INPUT_H * INPUT_W * sizeof(float), cudaHostAllocMapped));\n    CHECK(cudaHostAlloc((void **)&prob, BATCH_SIZE * OUTPUT_SIZE * sizeof(int), cudaHostAllocMapped));\n\n    IRuntime *runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine *engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext *context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n    void *buffers[2];\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine->getBindingIndex(OUTPUT_BLOB_NAME);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    for (int f = 0; f < (int)file_names.size(); f++)\n    {\n        std::cout << file_names[f] << std::endl;\n        cv::Mat pr_img;\n        cv::Mat img_BGR = cv::imread(img_dir + \"/\" + file_names[f], 1); // BGR\n        cv::Mat img;\n        cv::cvtColor(img_BGR, img, cv::COLOR_BGR2RGB);\n        if (img.empty())\n            continue;\n        cv::resize(img, pr_img, cv::Size(INPUT_W, INPUT_H));\n        img = pr_img.clone(); // for img show\n        pr_img.convertTo(pr_img, CV_32FC3);\n        if (!pr_img.isContinuous())\n        {\n            pr_img = pr_img.clone();\n        }\n        std::memcpy(data, pr_img.data, BATCH_SIZE * 3 * INPUT_W * INPUT_H * sizeof(float));\n\n        cudaHostGetDevicePointer((void **)&buffers[inputIndex], (void *)data, 0);  // buffers[inputIndex]-->data\n        cudaHostGetDevicePointer((void **)&buffers[outputIndex], (void *)prob, 0); // buffers[outputIndex] --> prob\n\n        // Run inference\n        auto start = std::chrono::high_resolution_clock::now();\n        doInference(*context, stream, buffers, BATCH_SIZE);\n        auto end = std::chrono::high_resolution_clock::now();\n        std::cout << \"infer time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n        cv::Mat outimg(INPUT_H, INPUT_W, CV_8UC1);\n        for (int row = 0; row < INPUT_H; ++row)\n        {\n            uchar *uc_pixel = outimg.data + row * outimg.step;\n            for (int col = 0; col < INPUT_W; ++col)\n            {\n                uc_pixel[col] = (uchar)prob[row * INPUT_W + col];\n            }\n        }\n        cv::Mat im_color;\n        cv::cvtColor(outimg, im_color, cv::COLOR_GRAY2RGB);\n        cv::Mat lut = createLTU(NUM_CLASSES);\n        cv::LUT(im_color, lut, im_color);\n        // false color\n        cv::cvtColor(im_color, im_color, cv::COLOR_RGB2GRAY);\n        cv::applyColorMap(im_color, im_color, cv::COLORMAP_HOT);\n        // cv::imshow(\"False Color Map\", im_color);\n        cv::imwrite(std::to_string(f) + \"_false_color_map.png\", im_color);\n        //fusion\n        cv::Mat fusionImg;\n        cv::addWeighted(img, 1, im_color, 0.8, 1, fusionImg);\n        // cv::imshow(\"Fusion Img\", fusionImg);\n        // cv::waitKey(0);\n        cv::imwrite(std::to_string(f) + \"_fusion_img.png\", fusionImg);\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFreeHost(buffers[inputIndex]));\n    CHECK(cudaFreeHost(buffers[outputIndex]));\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n    return 0;\n}\n"
  },
  {
    "path": "hrnet/hrnet-semantic-segmentation/hrnet_ocr.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include \"common.hpp\"\n#include \"logging.h\"\n\nstatic Logger gLogger;\n#define USE_FP32\n#define DEVICE 0     // GPU id\n#define BATCH_SIZE 1 //\n\nconst char *INPUT_BLOB_NAME = \"data\";\nconst char *OUTPUT_BLOB_NAME = \"output\";\nstatic const int INPUT_H = 512;\nstatic const int INPUT_W = 1024;\nstatic const int NUM_CLASSES = 19;\nstatic const int OUTPUT_SIZE = INPUT_H * INPUT_W;\n\n// Creat the engine using only the API and not any parser.\nICudaEngine *createEngine(unsigned int maxBatchSize, IBuilder *builder, IBuilderConfig *config, DataType dt, std::string wtsPath, int width)\n{\n    INetworkDefinition *network = builder->createNetworkV2(0U);\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\n    ITensor *data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{INPUT_H, INPUT_W, 3});\n    assert(data);\n\n    // hwc to chw\n    auto ps = network->addShuffle(*data);\n    ps->setFirstTranspose(nvinfer1::Permutation{2, 0, 1});\n    float mean[3] = {0.485, 0.456, 0.406};\n    float std[3] = {0.229, 0.224, 0.225};\n    ITensor *preinput = MeanStd(network, ps->getOutput(0), mean, std, true);\n\n    std::map<std::string, Weights> weightMap = loadWeights(wtsPath);\n    auto relu_2 = convBnRelu(network, weightMap, *preinput, 64, 3, 2, 1, \"conv1\", \"bn1\");\n    auto relu_5 = convBnRelu(network, weightMap, *relu_2->getOutput(0), 64, 3, 2, 1, \"conv2\", \"bn2\");\n    auto relu_17 = ResBlock2Conv(network, weightMap, *relu_5->getOutput(0), 64, 256, 1, \"layer1.0\");\n    auto relu_27 = ResBlock(network, weightMap, *relu_17->getOutput(0), 256, 64, 1, \"layer1.1\");\n    auto relu_37 = ResBlock(network, weightMap, *relu_27->getOutput(0), 256, 64, 1, \"layer1.2\");\n    auto relu_47 = ResBlock(network, weightMap, *relu_37->getOutput(0), 256, 64, 1, \"layer1.3\");\n\n    auto relu_50 = convBnRelu(network, weightMap, *relu_47->getOutput(0), width, 3, 1, 1, \"transition1.0.0\", \"transition1.0.1\");\n    auto relu_60 = liteResBlock(network, weightMap, *relu_50->getOutput(0), width, \"stage2.0.branches.0.0\");\n    auto relu_67 = liteResBlock(network, weightMap, *relu_60->getOutput(0), width, \"stage2.0.branches.0.1\");\n    auto relu_74 = liteResBlock(network, weightMap, *relu_67->getOutput(0), width, \"stage2.0.branches.0.2\");\n    auto relu_81 = liteResBlock(network, weightMap, *relu_74->getOutput(0), width, \"stage2.0.branches.0.3\");\n\n    auto relu_53 = convBnRelu(network, weightMap, *relu_47->getOutput(0), width * 2, 3, 2, 1, \"transition1.1.0.0\", \"transition1.1.0.1\");\n    auto relu_88 = liteResBlock(network, weightMap, *relu_53->getOutput(0), width * 2, \"stage2.0.branches.1.0\");\n    auto relu_95 = liteResBlock(network, weightMap, *relu_88->getOutput(0), width * 2, \"stage2.0.branches.1.1\");\n    auto relu_102 = liteResBlock(network, weightMap, *relu_95->getOutput(0), width * 2, \"stage2.0.branches.1.2\");\n    auto relu_109 = liteResBlock(network, weightMap, *relu_102->getOutput(0), width * 2, \"stage2.0.branches.1.3\");\n\n    auto add_131 = convBnUpAdd(network, weightMap, *relu_109->getOutput(0), *relu_81->getOutput(0), width, 1, 1, 0, \"stage2.0.fuse_layers.0.1.0\", \"stage2.0.fuse_layers.0.1.1\", true);\n    auto relu_132 = network->addActivation(*add_131->getOutput(0), ActivationType::kRELU);\n\n    auto add_135 = convBnUpAdd(network, weightMap, *relu_81->getOutput(0), *relu_109->getOutput(0), width * 2, 3, 2, 1, \"stage2.0.fuse_layers.1.0.0.0\", \"stage2.0.fuse_layers.1.0.0.1\", false);\n    auto relu_136 = network->addActivation(*add_135->getOutput(0), ActivationType::kRELU);\n\n    auto relu_146 = liteResBlock(network, weightMap, *relu_132->getOutput(0), width, \"stage3.0.branches.0.0\");\n    auto relu_153 = liteResBlock(network, weightMap, *relu_146->getOutput(0), width, \"stage3.0.branches.0.1\");\n    auto relu_160 = liteResBlock(network, weightMap, *relu_153->getOutput(0), width, \"stage3.0.branches.0.2\");\n    auto relu_167 = liteResBlock(network, weightMap, *relu_160->getOutput(0), width, \"stage3.0.branches.0.3\");\n\n    auto relu_174 = liteResBlock(network, weightMap, *relu_136->getOutput(0), width * 2, \"stage3.0.branches.1.0\");\n    auto relu_181 = liteResBlock(network, weightMap, *relu_174->getOutput(0), width * 2, \"stage3.0.branches.1.1\");\n    auto relu_188 = liteResBlock(network, weightMap, *relu_181->getOutput(0), width * 2, \"stage3.0.branches.1.2\");\n    auto relu_195 = liteResBlock(network, weightMap, *relu_188->getOutput(0), width * 2, \"stage3.0.branches.1.3\");\n\n    auto relu_139 = convBnRelu(network, weightMap, *relu_136->getOutput(0), width * 4, 3, 2, 1, \"transition2.2.0.0\", \"transition2.2.0.1\");\n    auto relu_202 = liteResBlock(network, weightMap, *relu_139->getOutput(0), width * 4, \"stage3.0.branches.2.0\");\n    auto relu_209 = liteResBlock(network, weightMap, *relu_202->getOutput(0), width * 4, \"stage3.0.branches.2.1\");\n    auto relu_216 = liteResBlock(network, weightMap, *relu_209->getOutput(0), width * 4, \"stage3.0.branches.2.2\");\n    auto relu_223 = liteResBlock(network, weightMap, *relu_216->getOutput(0), width * 4, \"stage3.0.branches.2.3\");\n\n    auto add_245 = convBnUpAdd(network, weightMap, *relu_195->getOutput(0), *relu_167->getOutput(0), width, 1, 1, 0, \"stage3.0.fuse_layers.0.1.0\", \"stage3.0.fuse_layers.0.1.1\", true);\n    auto add_267 = convBnUpAdd(network, weightMap, *relu_223->getOutput(0), *add_245->getOutput(0), width, 1, 1, 0, \"stage3.0.fuse_layers.0.2.0\", \"stage3.0.fuse_layers.0.2.1\", true);\n    auto relu_268 = network->addActivation(*add_267->getOutput(0), ActivationType::kRELU);\n\n    auto add_271 = convBnUpAdd(network, weightMap, *relu_167->getOutput(0), *relu_195->getOutput(0), width * 2, 3, 2, 1, \"stage3.0.fuse_layers.1.0.0.0\", \"stage3.0.fuse_layers.1.0.0.1\", false);\n    auto add_293 = convBnUpAdd(network, weightMap, *relu_223->getOutput(0), *add_271->getOutput(0), width * 2, 1, 1, 0, \"stage3.0.fuse_layers.1.2.0\", \"stage3.0.fuse_layers.1.2.1\", true);\n    auto relu_294 = network->addActivation(*add_293->getOutput(0), ActivationType::kRELU);\n\n    auto relu_297 = convBnRelu(network, weightMap, *relu_167->getOutput(0), width, 3, 2, 1, \"stage3.0.fuse_layers.2.0.0.0\", \"stage3.0.fuse_layers.2.0.0.1\");\n    auto bn_299 = convBnRelu(network, weightMap, *relu_297->getOutput(0), width * 4, 3, 2, 1, \"stage3.0.fuse_layers.2.0.1.0\", \"stage3.0.fuse_layers.2.0.1.1\", false);\n    auto add_302 = convBnUpAdd(network, weightMap, *relu_195->getOutput(0), *bn_299->getOutput(0), width * 4, 3, 2, 1, \"stage3.0.fuse_layers.2.1.0.0\", \"stage3.0.fuse_layers.2.1.0.1\", false);\n    auto add_303 = network->addElementWise(*add_302->getOutput(0), *relu_223->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_304 = network->addActivation(*add_303->getOutput(0), ActivationType::kRELU);\n\n    auto relu_311 = liteResBlock(network, weightMap, *relu_268->getOutput(0), width, \"stage3.1.branches.0.0\");\n    auto relu_318 = liteResBlock(network, weightMap, *relu_311->getOutput(0), width, \"stage3.1.branches.0.1\");\n    auto relu_325 = liteResBlock(network, weightMap, *relu_318->getOutput(0), width, \"stage3.1.branches.0.2\");\n    auto relu_332 = liteResBlock(network, weightMap, *relu_325->getOutput(0), width, \"stage3.1.branches.0.3\");\n\n    auto relu_339 = liteResBlock(network, weightMap, *relu_294->getOutput(0), width * 2, \"stage3.1.branches.1.0\");\n    auto relu_346 = liteResBlock(network, weightMap, *relu_339->getOutput(0), width * 2, \"stage3.1.branches.1.1\");\n    auto relu_353 = liteResBlock(network, weightMap, *relu_346->getOutput(0), width * 2, \"stage3.1.branches.1.2\");\n    auto relu_360 = liteResBlock(network, weightMap, *relu_353->getOutput(0), width * 2, \"stage3.1.branches.1.3\");\n\n    auto relu_367 = liteResBlock(network, weightMap, *relu_304->getOutput(0), width * 4, \"stage3.1.branches.2.0\");\n    auto relu_374 = liteResBlock(network, weightMap, *relu_367->getOutput(0), width * 4, \"stage3.1.branches.2.1\");\n    auto relu_381 = liteResBlock(network, weightMap, *relu_374->getOutput(0), width * 4, \"stage3.1.branches.2.2\");\n    auto relu_388 = liteResBlock(network, weightMap, *relu_381->getOutput(0), width * 4, \"stage3.1.branches.2.3\");\n\n    auto add_410 = convBnUpAdd(network, weightMap, *relu_360->getOutput(0), *relu_332->getOutput(0), width, 1, 1, 0, \"stage3.1.fuse_layers.0.1.0\", \"stage3.1.fuse_layers.0.1.1\", true);\n    auto add_432 = convBnUpAdd(network, weightMap, *relu_388->getOutput(0), *add_410->getOutput(0), width, 1, 1, 0, \"stage3.1.fuse_layers.0.2.0\", \"stage3.1.fuse_layers.0.2.1\", true);\n    auto relu_433 = network->addActivation(*add_432->getOutput(0), ActivationType::kRELU);\n\n    auto add_436 = convBnUpAdd(network, weightMap, *relu_332->getOutput(0), *relu_360->getOutput(0), width * 2, 3, 2, 1, \"stage3.1.fuse_layers.1.0.0.0\", \"stage3.1.fuse_layers.1.0.0.1\", false);\n    auto add_458 = convBnUpAdd(network, weightMap, *relu_388->getOutput(0), *add_436->getOutput(0), width * 2, 1, 1, 0, \"stage3.1.fuse_layers.1.2.0\", \"stage3.1.fuse_layers.1.2.1\", true);\n    auto relu_459 = network->addActivation(*add_458->getOutput(0), ActivationType::kRELU);\n\n    auto relu_462 = convBnRelu(network, weightMap, *relu_332->getOutput(0), width, 3, 2, 1, \"stage3.1.fuse_layers.2.0.0.0\", \"stage3.1.fuse_layers.2.0.0.1\");\n    auto bn_464 = convBnRelu(network, weightMap, *relu_462->getOutput(0), width * 4, 3, 2, 1, \"stage3.1.fuse_layers.2.0.1.0\", \"stage3.1.fuse_layers.2.0.1.1\", false);\n    auto add_467 = convBnUpAdd(network, weightMap, *relu_360->getOutput(0), *bn_464->getOutput(0), width * 4, 3, 2, 1, \"stage3.1.fuse_layers.2.1.0.0\", \"stage3.1.fuse_layers.2.1.0.1\", false);\n    auto add_468 = network->addElementWise(*add_467->getOutput(0), *relu_388->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_469 = network->addActivation(*add_468->getOutput(0), ActivationType::kRELU);\n\n    auto relu_476 = liteResBlock(network, weightMap, *relu_433->getOutput(0), width, \"stage3.2.branches.0.0\");\n    auto relu_483 = liteResBlock(network, weightMap, *relu_476->getOutput(0), width, \"stage3.2.branches.0.1\");\n    auto relu_490 = liteResBlock(network, weightMap, *relu_483->getOutput(0), width, \"stage3.2.branches.0.2\");\n    auto relu_497 = liteResBlock(network, weightMap, *relu_490->getOutput(0), width, \"stage3.2.branches.0.3\");\n\n    auto relu_504 = liteResBlock(network, weightMap, *relu_459->getOutput(0), width * 2, \"stage3.2.branches.1.0\");\n    auto relu_511 = liteResBlock(network, weightMap, *relu_504->getOutput(0), width * 2, \"stage3.2.branches.1.1\");\n    auto relu_518 = liteResBlock(network, weightMap, *relu_511->getOutput(0), width * 2, \"stage3.2.branches.1.2\");\n    auto relu_525 = liteResBlock(network, weightMap, *relu_518->getOutput(0), width * 2, \"stage3.2.branches.1.3\");\n\n    auto relu_532 = liteResBlock(network, weightMap, *relu_469->getOutput(0), width * 4, \"stage3.2.branches.2.0\");\n    auto relu_539 = liteResBlock(network, weightMap, *relu_532->getOutput(0), width * 4, \"stage3.2.branches.2.1\");\n    auto relu_546 = liteResBlock(network, weightMap, *relu_539->getOutput(0), width * 4, \"stage3.2.branches.2.2\");\n    auto relu_553 = liteResBlock(network, weightMap, *relu_546->getOutput(0), width * 4, \"stage3.2.branches.2.3\");\n\n    auto add_575 = convBnUpAdd(network, weightMap, *relu_525->getOutput(0), *relu_497->getOutput(0), width, 1, 1, 0, \"stage3.2.fuse_layers.0.1.0\", \"stage3.2.fuse_layers.0.1.1\", true);\n    auto add_597 = convBnUpAdd(network, weightMap, *relu_553->getOutput(0), *add_575->getOutput(0), width, 1, 1, 0, \"stage3.2.fuse_layers.0.2.0\", \"stage3.2.fuse_layers.0.2.1\", true);\n\n    auto relu_598 = network->addActivation(*add_597->getOutput(0), ActivationType::kRELU);\n\n    auto add_601 = convBnUpAdd(network, weightMap, *relu_497->getOutput(0), *relu_525->getOutput(0), width * 2, 3, 2, 1, \"stage3.2.fuse_layers.1.0.0.0\", \"stage3.2.fuse_layers.1.0.0.1\", false);\n    auto add_623 = convBnUpAdd(network, weightMap, *relu_553->getOutput(0), *add_601->getOutput(0), width * 2, 1, 1, 0, \"stage3.2.fuse_layers.1.2.0\", \"stage3.2.fuse_layers.1.2.1\", true);\n    auto relu_624 = network->addActivation(*add_623->getOutput(0), ActivationType::kRELU);\n\n    auto relu_627 = convBnRelu(network, weightMap, *relu_497->getOutput(0), width, 3, 2, 1, \"stage3.2.fuse_layers.2.0.0.0\", \"stage3.2.fuse_layers.2.0.0.1\");\n    auto bn_629 = convBnRelu(network, weightMap, *relu_627->getOutput(0), width * 4, 3, 2, 1, \"stage3.2.fuse_layers.2.0.1.0\", \"stage3.2.fuse_layers.2.0.1.1\", false);\n    auto add_632 = convBnUpAdd(network, weightMap, *relu_525->getOutput(0), *bn_629->getOutput(0), width * 4, 3, 2, 1, \"stage3.2.fuse_layers.2.1.0.0\", \"stage3.2.fuse_layers.2.1.0.1\", false);\n    auto add_633 = network->addElementWise(*relu_553->getOutput(0), *add_632->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_634 = network->addActivation(*add_633->getOutput(0), ActivationType::kRELU);\n\n    auto relu_641 = liteResBlock(network, weightMap, *relu_598->getOutput(0), width, \"stage3.3.branches.0.0\");\n    auto relu_648 = liteResBlock(network, weightMap, *relu_641->getOutput(0), width, \"stage3.3.branches.0.1\");\n    auto relu_655 = liteResBlock(network, weightMap, *relu_648->getOutput(0), width, \"stage3.3.branches.0.2\");\n    auto relu_662 = liteResBlock(network, weightMap, *relu_655->getOutput(0), width, \"stage3.3.branches.0.3\");\n\n    auto relu_669 = liteResBlock(network, weightMap, *relu_624->getOutput(0), width * 2, \"stage3.3.branches.1.0\");\n    auto relu_676 = liteResBlock(network, weightMap, *relu_669->getOutput(0), width * 2, \"stage3.3.branches.1.1\");\n    auto relu_683 = liteResBlock(network, weightMap, *relu_676->getOutput(0), width * 2, \"stage3.3.branches.1.2\");\n    auto relu_690 = liteResBlock(network, weightMap, *relu_683->getOutput(0), width * 2, \"stage3.3.branches.1.3\");\n\n    auto relu_697 = liteResBlock(network, weightMap, *relu_634->getOutput(0), width * 4, \"stage3.3.branches.2.0\");\n    auto relu_704 = liteResBlock(network, weightMap, *relu_697->getOutput(0), width * 4, \"stage3.3.branches.2.1\");\n    auto relu_711 = liteResBlock(network, weightMap, *relu_704->getOutput(0), width * 4, \"stage3.3.branches.2.2\");\n    auto relu_718 = liteResBlock(network, weightMap, *relu_711->getOutput(0), width * 4, \"stage3.3.branches.2.3\");\n\n    auto add_740 = convBnUpAdd(network, weightMap, *relu_690->getOutput(0), *relu_662->getOutput(0), width, 1, 1, 0, \"stage3.3.fuse_layers.0.1.0\", \"stage3.3.fuse_layers.0.1.1\", true);\n    auto add_762 = convBnUpAdd(network, weightMap, *relu_718->getOutput(0), *add_740->getOutput(0), width, 1, 1, 0, \"stage3.3.fuse_layers.0.2.0\", \"stage3.3.fuse_layers.0.2.1\", true);\n    auto relu_763 = network->addActivation(*add_762->getOutput(0), ActivationType::kRELU);\n\n    auto add_766 = convBnUpAdd(network, weightMap, *relu_662->getOutput(0), *relu_690->getOutput(0), width * 2, 3, 2, 1, \"stage3.3.fuse_layers.1.0.0.0\", \"stage3.3.fuse_layers.1.0.0.1\", false);\n    auto add_788 = convBnUpAdd(network, weightMap, *relu_718->getOutput(0), *add_766->getOutput(0), width * 2, 1, 1, 0, \"stage3.3.fuse_layers.1.2.0\", \"stage3.3.fuse_layers.1.2.1\", true);\n    auto relu_789 = network->addActivation(*add_788->getOutput(0), ActivationType::kRELU);\n\n    auto relu_792 = convBnRelu(network, weightMap, *relu_662->getOutput(0), width, 3, 2, 1, \"stage3.3.fuse_layers.2.0.0.0\", \"stage3.3.fuse_layers.2.0.0.1\");\n    auto bn_794 = convBnRelu(network, weightMap, *relu_792->getOutput(0), width * 4, 3, 2, 1, \"stage3.3.fuse_layers.2.0.1.0\", \"stage3.3.fuse_layers.2.0.1.1\", false);\n    auto add_797 = convBnUpAdd(network, weightMap, *relu_690->getOutput(0), *bn_794->getOutput(0), width * 4, 3, 2, 1, \"stage3.3.fuse_layers.2.1.0.0\", \"stage3.3.fuse_layers.2.1.0.1\", false);\n    auto add_798 = network->addElementWise(*relu_718->getOutput(0), *add_797->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_799 = network->addActivation(*add_798->getOutput(0), ActivationType::kRELU);\n\n    auto relu_809 = liteResBlock(network, weightMap, *relu_763->getOutput(0), width, \"stage4.0.branches.0.0\");\n    auto relu_816 = liteResBlock(network, weightMap, *relu_809->getOutput(0), width, \"stage4.0.branches.0.1\");\n    auto relu_823 = liteResBlock(network, weightMap, *relu_816->getOutput(0), width, \"stage4.0.branches.0.2\");\n    auto relu_830 = liteResBlock(network, weightMap, *relu_823->getOutput(0), width, \"stage4.0.branches.0.3\");\n\n    auto relu_837 = liteResBlock(network, weightMap, *relu_789->getOutput(0), width * 2, \"stage4.0.branches.1.0\");\n    auto relu_844 = liteResBlock(network, weightMap, *relu_837->getOutput(0), width * 2, \"stage4.0.branches.1.1\");\n    auto relu_851 = liteResBlock(network, weightMap, *relu_844->getOutput(0), width * 2, \"stage4.0.branches.1.2\");\n    auto relu_858 = liteResBlock(network, weightMap, *relu_851->getOutput(0), width * 2, \"stage4.0.branches.1.3\");\n\n    auto relu_865 = liteResBlock(network, weightMap, *relu_799->getOutput(0), width * 4, \"stage4.0.branches.2.0\");\n    auto relu_872 = liteResBlock(network, weightMap, *relu_865->getOutput(0), width * 4, \"stage4.0.branches.2.1\");\n    auto relu_879 = liteResBlock(network, weightMap, *relu_872->getOutput(0), width * 4, \"stage4.0.branches.2.2\");\n    auto relu_886 = liteResBlock(network, weightMap, *relu_879->getOutput(0), width * 4, \"stage4.0.branches.2.3\"); //========\n\n    auto relu_802 = convBnRelu(network, weightMap, *relu_799->getOutput(0), width * 8, 3, 2, 1, \"transition3.3.0.0\", \"transition3.3.0.1\");\n    auto relu_893 = liteResBlock(network, weightMap, *relu_802->getOutput(0), width * 8, \"stage4.0.branches.3.0\");\n    auto relu_900 = liteResBlock(network, weightMap, *relu_893->getOutput(0), width * 8, \"stage4.0.branches.3.1\");\n    auto relu_907 = liteResBlock(network, weightMap, *relu_900->getOutput(0), width * 8, \"stage4.0.branches.3.2\");\n    auto relu_914 = liteResBlock(network, weightMap, *relu_907->getOutput(0), width * 8, \"stage4.0.branches.3.3\");\n\n    auto add_936 = convBnUpAdd(network, weightMap, *relu_858->getOutput(0), *relu_830->getOutput(0), width, 1, 1, 0, \"stage4.0.fuse_layers.0.1.0\", \"stage4.0.fuse_layers.0.1.1\", true);\n    auto add_958 = convBnUpAdd(network, weightMap, *relu_886->getOutput(0), *add_936->getOutput(0), width, 1, 1, 0, \"stage4.0.fuse_layers.0.2.0\", \"stage4.0.fuse_layers.0.2.1\", true);\n    auto add_980 = convBnUpAdd(network, weightMap, *relu_914->getOutput(0), *add_958->getOutput(0), width, 1, 1, 0, \"stage4.0.fuse_layers.0.3.0\", \"stage4.0.fuse_layers.0.3.1\", true);\n    auto relu_981 = network->addActivation(*add_980->getOutput(0), ActivationType::kRELU);\n\n    auto add_984 = convBnUpAdd(network, weightMap, *relu_830->getOutput(0), *relu_858->getOutput(0), width * 2, 3, 2, 1, \"stage4.0.fuse_layers.1.0.0.0\", \"stage4.0.fuse_layers.1.0.0.1\", false);\n    auto add_1006 = convBnUpAdd(network, weightMap, *relu_886->getOutput(0), *add_984->getOutput(0), width * 2, 1, 1, 0, \"stage4.0.fuse_layers.1.2.0\", \"stage4.0.fuse_layers.1.2.1\", true);\n    auto add_1028 = convBnUpAdd(network, weightMap, *relu_914->getOutput(0), *add_1006->getOutput(0), width * 2, 1, 1, 0, \"stage4.0.fuse_layers.1.3.0\", \"stage4.0.fuse_layers.1.3.1\", true);\n    auto relu_1029 = network->addActivation(*add_1028->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1032 = convBnRelu(network, weightMap, *relu_830->getOutput(0), width, 3, 2, 1, \"stage4.0.fuse_layers.2.0.0.0\", \"stage4.0.fuse_layers.2.0.0.1\");\n    auto bn_1034 = convBnRelu(network, weightMap, *relu_1032->getOutput(0), width * 4, 3, 2, 1, \"stage4.0.fuse_layers.2.0.1.0\", \"stage4.0.fuse_layers.2.0.1.1\", false);\n\n    auto add_1037 = convBnUpAdd(network, weightMap, *relu_858->getOutput(0), *bn_1034->getOutput(0), width * 4, 3, 2, 1,\n                                \"stage4.0.fuse_layers.2.1.0.0\", \"stage4.0.fuse_layers.2.1.0.1\", false);\n    auto add_1038 = network->addElementWise(*relu_886->getOutput(0), *add_1037->getOutput(0), ElementWiseOperation::kSUM);\n    auto add_1060 = convBnUpAdd(network, weightMap, *relu_914->getOutput(0), *add_1038->getOutput(0), width * 4, 1, 1, 0,\n                                \"stage4.0.fuse_layers.2.3.0\", \"stage4.0.fuse_layers.2.3.1\", true);\n    auto relu_1061 = network->addActivation(*add_1060->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1064 = convBnRelu(network, weightMap, *relu_830->getOutput(0), width, 3, 2, 1, \"stage4.0.fuse_layers.3.0.0.0\", \"stage4.0.fuse_layers.3.0.0.1\");\n    auto relu_1067 = convBnRelu(network, weightMap, *relu_1064->getOutput(0), width, 3, 2, 1, \"stage4.0.fuse_layers.3.0.1.0\", \"stage4.0.fuse_layers.3.0.1.1\");\n    auto bn_1069 = convBnRelu(network, weightMap, *relu_1067->getOutput(0), width * 8, 3, 2, 1, \"stage4.0.fuse_layers.3.0.2.0\", \"stage4.0.fuse_layers.3.0.2.1\", false);\n    auto relu_1072 = convBnRelu(network, weightMap, *relu_858->getOutput(0), width * 2, 3, 2, 1, \"stage4.0.fuse_layers.3.1.0.0\", \"stage4.0.fuse_layers.3.1.0.1\");\n    auto add_1075 = convBnUpAdd(network, weightMap, *relu_1072->getOutput(0), *bn_1069->getOutput(0), width * 8, 3, 2, 1,\n                                \"stage4.0.fuse_layers.3.1.1.0\", \"stage4.0.fuse_layers.3.1.1.1\", false);\n    auto add_1078 = convBnUpAdd(network, weightMap, *relu_886->getOutput(0), *add_1075->getOutput(0), width * 8, 3, 2, 1,\n                                \"stage4.0.fuse_layers.3.2.0.0\", \"stage4.0.fuse_layers.3.2.0.1\", false);\n    auto add_1079 = network->addElementWise(*relu_914->getOutput(0), *add_1078->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_1080 = network->addActivation(*add_1079->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1087 = liteResBlock(network, weightMap, *relu_981->getOutput(0), width, \"stage4.1.branches.0.0\");\n    auto relu_1094 = liteResBlock(network, weightMap, *relu_1087->getOutput(0), width, \"stage4.1.branches.0.1\");\n    auto relu_1101 = liteResBlock(network, weightMap, *relu_1094->getOutput(0), width, \"stage4.1.branches.0.2\");\n    auto relu_1108 = liteResBlock(network, weightMap, *relu_1101->getOutput(0), width, \"stage4.1.branches.0.3\");\n\n    auto relu_1115 = liteResBlock(network, weightMap, *relu_1029->getOutput(0), width * 2, \"stage4.1.branches.1.0\");\n    auto relu_1122 = liteResBlock(network, weightMap, *relu_1115->getOutput(0), width * 2, \"stage4.1.branches.1.1\");\n    auto relu_1129 = liteResBlock(network, weightMap, *relu_1122->getOutput(0), width * 2, \"stage4.1.branches.1.2\");\n    auto relu_1136 = liteResBlock(network, weightMap, *relu_1129->getOutput(0), width * 2, \"stage4.1.branches.1.3\");\n\n    auto relu_1143 = liteResBlock(network, weightMap, *relu_1061->getOutput(0), width * 4, \"stage4.1.branches.2.0\");\n    auto relu_1150 = liteResBlock(network, weightMap, *relu_1143->getOutput(0), width * 4, \"stage4.1.branches.2.1\");\n    auto relu_1157 = liteResBlock(network, weightMap, *relu_1150->getOutput(0), width * 4, \"stage4.1.branches.2.2\");\n    auto relu_1164 = liteResBlock(network, weightMap, *relu_1157->getOutput(0), width * 4, \"stage4.1.branches.2.3\");\n\n    auto relu_1171 = liteResBlock(network, weightMap, *relu_1080->getOutput(0), width * 8, \"stage4.1.branches.3.0\");\n    auto relu_1178 = liteResBlock(network, weightMap, *relu_1171->getOutput(0), width * 8, \"stage4.1.branches.3.1\");\n    auto relu_1185 = liteResBlock(network, weightMap, *relu_1178->getOutput(0), width * 8, \"stage4.1.branches.3.2\");\n    auto relu_1192 = liteResBlock(network, weightMap, *relu_1185->getOutput(0), width * 8, \"stage4.1.branches.3.3\");\n\n    auto add_1214 = convBnUpAdd(network, weightMap, *relu_1136->getOutput(0), *relu_1108->getOutput(0), width, 1, 1, 0,\n                                \"stage4.1.fuse_layers.0.1.0\", \"stage4.1.fuse_layers.0.1.1\", true);\n    auto add_1236 = convBnUpAdd(network, weightMap, *relu_1164->getOutput(0), *add_1214->getOutput(0), width, 1, 1, 0,\n                                \"stage4.1.fuse_layers.0.2.0\", \"stage4.1.fuse_layers.0.2.1\", true);\n    auto add_1258 = convBnUpAdd(network, weightMap, *relu_1192->getOutput(0), *add_1236->getOutput(0), width, 1, 1, 0,\n                                \"stage4.1.fuse_layers.0.3.0\", \"stage4.1.fuse_layers.0.3.1\", true);\n    auto relu_1259 = network->addActivation(*add_1258->getOutput(0), ActivationType::kRELU);\n\n    auto add_1262 = convBnUpAdd(network, weightMap, *relu_1108->getOutput(0), *relu_1136->getOutput(0), width * 2, 3, 2, 1,\n                                \"stage4.1.fuse_layers.1.0.0.0\", \"stage4.1.fuse_layers.1.0.0.1\", false);\n    auto add_1284 = convBnUpAdd(network, weightMap, *relu_1164->getOutput(0), *add_1262->getOutput(0), width * 2, 1, 1, 0,\n                                \"stage4.1.fuse_layers.1.2.0\", \"stage4.1.fuse_layers.1.2.1\", true);\n    auto add_1306 = convBnUpAdd(network, weightMap, *relu_1192->getOutput(0), *add_1284->getOutput(0), width * 2, 1, 1, 0,\n                                \"stage4.1.fuse_layers.1.3.0\", \"stage4.1.fuse_layers.1.3.1\", true);\n    auto relu_1307 = network->addActivation(*add_1306->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1310 = convBnRelu(network, weightMap, *relu_1108->getOutput(0), width, 3, 2, 1, \"stage4.1.fuse_layers.2.0.0.0\", \"stage4.1.fuse_layers.2.0.0.1\");\n    auto bn_1312 = convBnRelu(network, weightMap, *relu_1310->getOutput(0), width * 4, 3, 2, 1, \"stage4.1.fuse_layers.2.0.1.0\", \"stage4.1.fuse_layers.2.0.1.1\", false);\n    auto add_1315 = convBnUpAdd(network, weightMap, *relu_1136->getOutput(0), *bn_1312->getOutput(0), width * 4, 3, 2, 1,\n                                \"stage4.1.fuse_layers.2.1.0.0\", \"stage4.1.fuse_layers.2.1.0.1\", false);\n    auto add_1316 = network->addElementWise(*relu_1164->getOutput(0), *add_1315->getOutput(0), ElementWiseOperation::kSUM);\n    auto add_1338 = convBnUpAdd(network, weightMap, *relu_1192->getOutput(0), *add_1316->getOutput(0), width * 4, 1, 1, 0,\n                                \"stage4.1.fuse_layers.2.3.0\", \"stage4.1.fuse_layers.2.3.1\", true);\n    auto relu_1339 = network->addActivation(*add_1338->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1342 = convBnRelu(network, weightMap, *relu_1108->getOutput(0), width, 3, 2, 1, \"stage4.1.fuse_layers.3.0.0.0\", \"stage4.1.fuse_layers.3.0.0.1\");\n    auto relu_1345 = convBnRelu(network, weightMap, *relu_1342->getOutput(0), width, 3, 2, 1, \"stage4.1.fuse_layers.3.0.1.0\", \"stage4.1.fuse_layers.3.0.1.1\");\n    auto bn_1347 = convBnRelu(network, weightMap, *relu_1345->getOutput(0), width * 8, 3, 2, 1, \"stage4.1.fuse_layers.3.0.2.0\", \"stage4.1.fuse_layers.3.0.2.1\", false);\n    auto relu_1350 = convBnRelu(network, weightMap, *relu_1136->getOutput(0), width * 2, 3, 2, 1, \"stage4.1.fuse_layers.3.1.0.0\", \"stage4.1.fuse_layers.3.1.0.1\");\n    auto add_1353 = convBnUpAdd(network, weightMap, *relu_1350->getOutput(0), *bn_1347->getOutput(0), width * 8, 3, 2, 1,\n                                \"stage4.1.fuse_layers.3.1.1.0\", \"stage4.1.fuse_layers.3.1.1.1\", false);\n    auto add_1356 = convBnUpAdd(network, weightMap, *relu_1164->getOutput(0), *add_1353->getOutput(0), width * 8, 3, 2, 1,\n                                \"stage4.1.fuse_layers.3.2.0.0\", \"stage4.1.fuse_layers.3.2.0.1\", false);\n    auto add_1357 = network->addElementWise(*relu_1192->getOutput(0), *add_1356->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_1358 = network->addActivation(*add_1357->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1365 = liteResBlock(network, weightMap, *relu_1259->getOutput(0), width, \"stage4.2.branches.0.0\");\n    auto relu_1372 = liteResBlock(network, weightMap, *relu_1365->getOutput(0), width, \"stage4.2.branches.0.1\");\n    auto relu_1379 = liteResBlock(network, weightMap, *relu_1372->getOutput(0), width, \"stage4.2.branches.0.2\");\n    auto relu_1386 = liteResBlock(network, weightMap, *relu_1379->getOutput(0), width, \"stage4.2.branches.0.3\");\n\n    auto relu_1393 = liteResBlock(network, weightMap, *relu_1307->getOutput(0), width * 2, \"stage4.2.branches.1.0\");\n    auto relu_1400 = liteResBlock(network, weightMap, *relu_1393->getOutput(0), width * 2, \"stage4.2.branches.1.1\");\n    auto relu_1407 = liteResBlock(network, weightMap, *relu_1400->getOutput(0), width * 2, \"stage4.2.branches.1.2\");\n    auto relu_1414 = liteResBlock(network, weightMap, *relu_1407->getOutput(0), width * 2, \"stage4.2.branches.1.3\");\n\n    auto relu_1421 = liteResBlock(network, weightMap, *relu_1339->getOutput(0), width * 4, \"stage4.2.branches.2.0\");\n    auto relu_1428 = liteResBlock(network, weightMap, *relu_1421->getOutput(0), width * 4, \"stage4.2.branches.2.1\");\n    auto relu_1435 = liteResBlock(network, weightMap, *relu_1428->getOutput(0), width * 4, \"stage4.2.branches.2.2\");\n    auto relu_1442 = liteResBlock(network, weightMap, *relu_1435->getOutput(0), width * 4, \"stage4.2.branches.2.3\");\n\n    auto relu_1449 = liteResBlock(network, weightMap, *relu_1358->getOutput(0), width * 8, \"stage4.2.branches.3.0\");\n    auto relu_1456 = liteResBlock(network, weightMap, *relu_1449->getOutput(0), width * 8, \"stage4.2.branches.3.1\");\n    auto relu_1463 = liteResBlock(network, weightMap, *relu_1456->getOutput(0), width * 8, \"stage4.2.branches.3.2\");\n    auto relu_1470 = liteResBlock(network, weightMap, *relu_1463->getOutput(0), width * 8, \"stage4.2.branches.3.3\");\n\n    auto add_1492 = convBnUpAdd(network, weightMap, *relu_1414->getOutput(0), *relu_1386->getOutput(0), width, 1, 1, 0,\n                                \"stage4.2.fuse_layers.0.1.0\", \"stage4.2.fuse_layers.0.1.1\", true);\n    auto add_1514 = convBnUpAdd(network, weightMap, *relu_1442->getOutput(0), *add_1492->getOutput(0), width, 1, 1, 0,\n                                \"stage4.2.fuse_layers.0.2.0\", \"stage4.2.fuse_layers.0.2.1\", true);\n\n    auto add_1536 = convBnUpAdd(network, weightMap, *relu_1470->getOutput(0), *add_1514->getOutput(0), width, 1, 1, 0,\n                                \"stage4.2.fuse_layers.0.3.0\", \"stage4.2.fuse_layers.0.3.1\", true);\n    auto relu_1537 = network->addActivation(*add_1536->getOutput(0), ActivationType::kRELU);\n\n    auto add_1540 = convBnUpAdd(network, weightMap, *relu_1386->getOutput(0), *relu_1414->getOutput(0),\n                                width * 2, 3, 2, 1, \"stage4.2.fuse_layers.1.0.0.0\", \"stage4.2.fuse_layers.1.0.0.1\", false);\n    auto add_1562 = convBnUpAdd(network, weightMap, *relu_1442->getOutput(0), *add_1540->getOutput(0),\n                                width * 2, 1, 1, 0, \"stage4.2.fuse_layers.1.2.0\", \"stage4.2.fuse_layers.1.2.1\", true);\n    auto add_1584 = convBnUpAdd(network, weightMap, *relu_1470->getOutput(0), *add_1562->getOutput(0),\n                                width * 2, 1, 1, 0, \"stage4.2.fuse_layers.1.3.0\", \"stage4.2.fuse_layers.1.3.1\", true);\n    auto relu_1585 = network->addActivation(*add_1584->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1588 = convBnRelu(network, weightMap, *relu_1386->getOutput(0), width, 3, 2, 1, \"stage4.2.fuse_layers.2.0.0.0\", \"stage4.2.fuse_layers.2.0.0.1\");\n    auto bn_1590 = convBnRelu(network, weightMap, *relu_1588->getOutput(0), width * 4, 3, 2, 1, \"stage4.2.fuse_layers.2.0.1.0\", \"stage4.2.fuse_layers.2.0.1.1\", false);\n    auto add_1593 = convBnUpAdd(network, weightMap, *relu_1414->getOutput(0), *bn_1590->getOutput(0), width * 4, 3, 2, 1,\n                                \"stage4.2.fuse_layers.2.1.0.0\", \"stage4.2.fuse_layers.2.1.0.1\", false);\n    auto add_1594 = network->addElementWise(*relu_1442->getOutput(0), *add_1593->getOutput(0), ElementWiseOperation::kSUM);\n    auto add_1616 = convBnUpAdd(network, weightMap, *relu_1470->getOutput(0), *add_1594->getOutput(0), width * 4, 1, 1, 0,\n                                \"stage4.2.fuse_layers.2.3.0\", \"stage4.2.fuse_layers.2.3.1\", true);\n    auto relu_1617 = network->addActivation(*add_1616->getOutput(0), ActivationType::kRELU);\n\n    auto relu_1620 = convBnRelu(network, weightMap, *relu_1386->getOutput(0), width, 3, 2, 1, \"stage4.2.fuse_layers.3.0.0.0\", \"stage4.2.fuse_layers.3.0.0.1\");\n    auto relu_1623 = convBnRelu(network, weightMap, *relu_1620->getOutput(0), width, 3, 2, 1, \"stage4.2.fuse_layers.3.0.1.0\", \"stage4.2.fuse_layers.3.0.1.1\");\n    auto bn_1625 = convBnRelu(network, weightMap, *relu_1623->getOutput(0), width * 8, 3, 2, 1, \"stage4.2.fuse_layers.3.0.2.0\", \"stage4.2.fuse_layers.3.0.2.1\", false);\n    auto relu_1628 = convBnRelu(network, weightMap, *relu_1414->getOutput(0), width * 2, 3, 2, 1, \"stage4.2.fuse_layers.3.1.0.0\", \"stage4.2.fuse_layers.3.1.0.1\");\n    auto add_1631 = convBnUpAdd(network, weightMap, *relu_1628->getOutput(0), *bn_1625->getOutput(0), width * 8, 3, 2, 1,\n                                \"stage4.2.fuse_layers.3.1.1.0\", \"stage4.2.fuse_layers.3.1.1.1\", false);\n    auto add_1634 = convBnUpAdd(network, weightMap, *relu_1442->getOutput(0), *add_1631->getOutput(0), width * 8, 3, 2, 1,\n                                \"stage4.2.fuse_layers.3.2.0.0\", \"stage4.2.fuse_layers.3.2.0.1\", false);\n    auto add_1635 = network->addElementWise(*relu_1470->getOutput(0), *add_1634->getOutput(0), ElementWiseOperation::kSUM);\n    auto relu_1636 = network->addActivation(*add_1635->getOutput(0), ActivationType::kRELU);\n\n    nvinfer1::Dims dim = relu_1537->getOutput(0)->getDimensions();\n    dim.d[0] = relu_1585->getOutput(0)->getDimensions().d[0];\n    auto resize_1655 = netAddUpsampleBi(network, relu_1585->getOutput(0), dim);\n    dim.d[0] = relu_1617->getOutput(0)->getDimensions().d[0];\n    auto resize_1668 = netAddUpsampleBi(network, relu_1617->getOutput(0), dim);\n    dim.d[0] = relu_1636->getOutput(0)->getDimensions().d[0];\n    auto resize_1681 = netAddUpsampleBi(network, relu_1636->getOutput(0), dim);\n\n    ITensor *concatTensors[] = {relu_1537->getOutput(0), resize_1655->getOutput(0), resize_1668->getOutput(0), resize_1681->getOutput(0)};\n    auto concat_1682 = network->addConcatenation(concatTensors, 4);\n    concat_1682->setAxis(0);\n    auto relu_1685 = convBnRelu(network, weightMap, *concat_1682->getOutput(0), width * 15, 1, 1, 0, \"aux_head.0\", \"aux_head.1\", true, true);\n    auto conv_1686 = network->addConvolutionNd(*relu_1685->getOutput(0), NUM_CLASSES, DimsHW{1, 1}, weightMap[\"aux_head.3.weight\"], weightMap[\"aux_head.3.bias\"]);\n    conv_1686->setStrideNd(DimsHW{1, 1});\n    conv_1686->setPaddingNd(DimsHW{0, 0});\n    auto reshape_1701 = network->addShuffle(*conv_1686->getOutput(0));\n    nvinfer1::Dims reshape_dim;\n    reshape_dim.nbDims = 2;\n    reshape_dim.d[0] = NUM_CLASSES;\n    reshape_dim.d[1] = -1;\n    reshape_1701->setReshapeDimensions(reshape_dim);\n\n    auto softmax_1714 = network->addSoftMax(*reshape_1701->getOutput(0));\n    softmax_1714->setAxes(2);\n\n    auto relu_1689 = convBnRelu(network, weightMap, *concat_1682->getOutput(0), 512, 3, 1, 1, \"conv3x3_ocr.0\", \"conv3x3_ocr.1\", true, true);\n\n    auto reshape_1710 = network->addShuffle(*relu_1689->getOutput(0));\n    nvinfer1::Dims reshape_dim1;\n    reshape_dim1.nbDims = 2;\n    reshape_dim1.d[0] = 512;\n    reshape_dim1.d[1] = -1;\n    reshape_1710->setReshapeDimensions(reshape_dim1);\n    nvinfer1::Permutation permutation1;\n    permutation1.order[0] = 1;\n    permutation1.order[1] = 0;\n    reshape_1710->setSecondTranspose(permutation1);\n\n    auto matmul_1715 = network->addMatrixMultiply(*softmax_1714->getOutput(0), MatrixOperation::kNONE,\n                                                  *reshape_1710->getOutput(0), MatrixOperation::kNONE);\n\n    auto transpose_1716 = network->addShuffle(*matmul_1715->getOutput(0));\n    nvinfer1::Permutation permutation2;\n    permutation2.order[0] = 1;\n    permutation2.order[1] = 0;\n    transpose_1716->setFirstTranspose(permutation2);\n\n    auto unsqueeze_1717 = network->addShuffle(*transpose_1716->getOutput(0));\n    nvinfer1::Dims reshape_dim3;\n    reshape_dim3.nbDims = 3;\n    reshape_dim3.d[0] = 512;\n    reshape_dim3.d[1] = NUM_CLASSES;\n    reshape_dim3.d[2] = 1;\n    unsqueeze_1717->setReshapeDimensions(reshape_dim3);\n\n    auto relu_1737 = convBnRelu(network, weightMap, *unsqueeze_1717->getOutput(0), 256, 1, 1, 0, \"ocr_distri_head.object_context_block.f_object.0\", \"ocr_distri_head.object_context_block.f_object.1.0\", true, true);\n\n    auto relu_1740 = convBnRelu(network, weightMap, *relu_1737->getOutput(0), 256, 1, 1, 0, \"ocr_distri_head.object_context_block.f_object.2\", \"ocr_distri_head.object_context_block.f_object.3.0\", true, true);\n\n    auto reshape_1747 = network->addShuffle(*relu_1740->getOutput(0));\n    nvinfer1::Dims reshape_dim4;\n    reshape_dim4.nbDims = 2;\n    reshape_dim4.d[0] = 256;\n    reshape_dim4.d[1] = -1;\n    reshape_1747->setReshapeDimensions(reshape_dim4);\n\n    auto relu_1723 = convBnRelu(network, weightMap, *relu_1689->getOutput(0), 256, 1, 1, 0, \"ocr_distri_head.object_context_block.f_pixel.0\", \"ocr_distri_head.object_context_block.f_pixel.1.0\", true, true);\n    auto relu_1726 = convBnRelu(network, weightMap, *relu_1723->getOutput(0), 256, 1, 1, 0, \"ocr_distri_head.object_context_block.f_pixel.2\", \"ocr_distri_head.object_context_block.f_pixel.3.0\", true, true);\n\n    auto reshape_1733 = network->addShuffle(*relu_1726->getOutput(0));\n    nvinfer1::Dims reshape_dim5;\n    reshape_dim5.nbDims = 2;\n    reshape_dim5.d[0] = 256;\n    reshape_dim5.d[1] = -1;\n    reshape_1733->setReshapeDimensions(reshape_dim5);\n    nvinfer1::Permutation permutation3;\n    permutation3.order[0] = 1;\n    permutation3.order[1] = 0;\n    reshape_1733->setSecondTranspose(permutation3);\n\n    auto matmul_1759 = network->addMatrixMultiply(*reshape_1733->getOutput(0), MatrixOperation::kNONE, *reshape_1747->getOutput(0), MatrixOperation::kNONE);\n    nvinfer1::Dims constant_dim;\n    constant_dim.nbDims = 2;\n    int allNum = INPUT_H * INPUT_W / 16;\n    constant_dim.d[0] = INPUT_H * INPUT_W / 16;\n    constant_dim.d[1] = 1;\n    Weights wgt{DataType::kFLOAT, nullptr, allNum};\n    float *w = new float[allNum];\n    for (int i = 0; i < allNum; i++)\n    {\n        w[i] = 0.0625;\n    }\n    wgt.values = w;\n    auto constant_1761 = network->addConstant(constant_dim, wgt);\n\n    auto mul_1761 = network->addElementWise(*constant_1761->getOutput(0), *matmul_1759->getOutput(0), ElementWiseOperation::kPROD);\n\n    auto softmax_1762 = network->addSoftMax(*mul_1761->getOutput(0));\n    softmax_1762->setAxes(2);\n\n    auto relu_1750 = convBnRelu(network, weightMap, *unsqueeze_1717->getOutput(0), 256, 1, 1, 0, \"ocr_distri_head.object_context_block.f_down.0\", \"ocr_distri_head.object_context_block.f_down.1.0\", true, true);\n\n    auto reshape_1757 = network->addShuffle(*relu_1750->getOutput(0));\n    nvinfer1::Dims reshape_dim6;\n    reshape_dim6.nbDims = 2;\n    reshape_dim6.d[0] = 256;\n    reshape_dim6.d[1] = -1;\n    reshape_1757->setReshapeDimensions(reshape_dim6);\n    nvinfer1::Permutation permutation4;\n    permutation4.order[0] = 1;\n    permutation4.order[1] = 0;\n    reshape_1757->setSecondTranspose(permutation4);\n\n    auto matmul_1763 = network->addMatrixMultiply(*softmax_1762->getOutput(0), MatrixOperation::kNONE, *reshape_1757->getOutput(0), MatrixOperation::kNONE);\n\n    auto reshape_1777 = network->addShuffle(*matmul_1763->getOutput(0));\n    nvinfer1::Dims reshape_dim7;\n    reshape_dim7.nbDims = 3;\n    reshape_dim7.d[0] = 256;\n    reshape_dim7.d[1] = INPUT_H / 4;\n    reshape_dim7.d[2] = INPUT_W / 4;\n    reshape_1777->setReshapeDimensions(reshape_dim7);\n    nvinfer1::Permutation permutation5;\n    permutation5.order[0] = 1;\n    permutation5.order[1] = 0;\n    reshape_1777->setFirstTranspose(permutation5);\n\n    auto relu_1780 = convBnRelu(network, weightMap, *reshape_1777->getOutput(0), 512, 1, 1, 0, \"ocr_distri_head.object_context_block.f_up.0\", \"ocr_distri_head.object_context_block.f_up.1.0\", true, true);\n\n    ITensor *concatTensors1[] = {relu_1780->getOutput(0), relu_1689->getOutput(0)};\n    auto concat_1781 = network->addConcatenation(concatTensors1, 2);\n\n    auto relu_1784 = convBnRelu(network, weightMap, *concat_1781->getOutput(0), 512, 1, 1, 0, \"ocr_distri_head.conv_bn_dropout.0\", \"ocr_distri_head.conv_bn_dropout.1.0\", true, true);\n\n    auto conv_1785 = network->addConvolutionNd(*relu_1784->getOutput(0), NUM_CLASSES, DimsHW{1, 1}, weightMap[\"cls_head.weight\"], weightMap[\"cls_head.bias\"]);\n    debug_print(conv_1785->getOutput(0), \"cls_head\");\n    dim.nbDims = 3;\n    dim.d[0] = NUM_CLASSES;\n    dim.d[1] = INPUT_H;\n    dim.d[2] = INPUT_W;\n    auto feature_map = netAddUpsampleBi(network, conv_1785->getOutput(0), dim);\n    debug_print(feature_map->getOutput(0), \"upsample\");\n    auto topk = network->addTopK(*feature_map->getOutput(0), TopKOperation::kMAX, 1, 0X01);\n\n    debug_print(topk->getOutput(0), \"topk\");\n\n    std::cout << \"set name out\" << std::endl;\n    topk->getOutput(1)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*topk->getOutput(1));\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize((1 << 30)); // 1G\n#ifdef USE_FP16\n    std::cout << \"use fp16\" << std::endl;\n    config->setFlag(BuilderFlag::kFP16);\n#endif\n    ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build success!\" << std::endl;\n    network->destroy();\n    for (auto &mem : weightMap)\n    {\n        free((void *)(mem.second.values));\n    }\n    return engine;\n}\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory **modelStream, std::string wtsPath, int width)\n{\n    IBuilder *builder = createInferBuilder(gLogger);\n    IBuilderConfig *config = builder->createBuilderConfig();\n    ICudaEngine *engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT, wtsPath, width);\n    assert(engine != nullptr);\n    (*modelStream) = engine->serialize();\n    engine->destroy();\n    builder->destroy();\n}\n\nbool parse_args(int argc, char **argv, std::string &wts, std::string &engine, int &width, std::string &img_dir)\n{\n    if (std::string(argv[1]) == \"-s\" && argc == 5)\n    {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        width = std::stoi(argv[4]);\n    }\n    else if (std::string(argv[1]) == \"-d\" && argc == 4)\n    {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n    }\n    else\n    {\n        return false;\n    }\n    return true;\n}\nvoid doInference(IExecutionContext &context, cudaStream_t &stream, void **buffers, int batchSize)\n{\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    cudaStreamSynchronize(stream);\n    cudaDeviceSynchronize();\n}\n\nint main(int argc, char **argv)\n{\n    cudaSetDevice(DEVICE);\n    std::string wtsPath = \"\";\n    std::string engine_name = \"\";\n    int width;\n    std::string img_dir;\n    // parse args\n    if (!parse_args(argc, argv, wtsPath, engine_name, width, img_dir))\n    {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./hrnet_ocr -s [.wts] [.engine] [18 or 32 or 48]  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./hrnet_ocr -d [.engine] ../samples  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n    // create a model using the API directly and serialize it to a stream\n    if (!wtsPath.empty())\n    {\n        IHostMemory *modelStream{nullptr};\n        APIToModel(BATCH_SIZE, &modelStream, wtsPath, width);\n        assert(modelStream != nullptr);\n        std::ofstream p(engine_name, std::ios::binary);\n        if (!p)\n        {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char *>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    }\n\n    // deserialize the .engine and run inference\n    char *trtModelStream{nullptr};\n    size_t size{0};\n    std::ifstream file(engine_name, std::ios::binary);\n    if (file.good())\n    {\n        file.seekg(0, file.end);\n        size = file.tellg();\n        file.seekg(0, file.beg);\n        trtModelStream = new char[size];\n        assert(trtModelStream);\n        file.read(trtModelStream, size);\n        file.close();\n    }\n    else\n    {\n        std::cerr << \"could not open plan file\" << std::endl;\n    }\n\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0)\n    {\n        std::cout << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n    // prepare input data ---------------------------\n    cudaSetDeviceFlags(cudaDeviceMapHost);\n    float *data;\n    int *prob; // using int. output is index\n    CHECK(cudaHostAlloc((void **)&data, BATCH_SIZE * 3 * INPUT_H * INPUT_W * sizeof(float), cudaHostAllocMapped));\n    CHECK(cudaHostAlloc((void **)&prob, BATCH_SIZE * OUTPUT_SIZE * sizeof(int), cudaHostAllocMapped));\n\n    IRuntime *runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine *engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext *context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n    void *buffers[2];\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine->getBindingIndex(OUTPUT_BLOB_NAME);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    for (int f = 0; f < (int)file_names.size(); f++)\n    {\n        std::cout << file_names[f] << std::endl;\n        cv::Mat pr_img;\n        cv::Mat img_BGR = cv::imread(img_dir + \"/\" + file_names[f], 1); // BGR\n        cv::Mat img;\n        cv::cvtColor(img_BGR, img, cv::COLOR_BGR2RGB);\n        if (img.empty())\n            continue;\n        cv::resize(img, pr_img, cv::Size(INPUT_W, INPUT_H));\n        img = pr_img.clone(); // for img show\n        pr_img.convertTo(pr_img, CV_32FC3);\n        if (!pr_img.isContinuous())\n        {\n            pr_img = pr_img.clone();\n        }\n        std::memcpy(data, pr_img.data, BATCH_SIZE * 3 * INPUT_W * INPUT_H * sizeof(float));\n\n        cudaHostGetDevicePointer((void **)&buffers[inputIndex], (void *)data, 0);  // buffers[inputIndex]-->data\n        cudaHostGetDevicePointer((void **)&buffers[outputIndex], (void *)prob, 0); // buffers[outputIndex] --> prob\n\n        // Run inference\n        auto start = std::chrono::high_resolution_clock::now();\n        doInference(*context, stream, buffers, BATCH_SIZE);\n        auto end = std::chrono::high_resolution_clock::now();\n        std::cout << \"infer time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n        cv::Mat outimg(INPUT_H, INPUT_W, CV_8UC1);\n        for (int row = 0; row < INPUT_H; ++row)\n        {\n            uchar *uc_pixel = outimg.data + row * outimg.step;\n            for (int col = 0; col < INPUT_W; ++col)\n            {\n                uc_pixel[col] = (uchar)prob[row * INPUT_W + col];\n            }\n        }\n        cv::Mat im_color;\n        cv::cvtColor(outimg, im_color, cv::COLOR_GRAY2RGB);\n        cv::Mat lut = createLTU(NUM_CLASSES);\n        cv::LUT(im_color, lut, im_color);\n        // false color\n        cv::cvtColor(im_color, im_color, cv::COLOR_RGB2GRAY);\n        cv::applyColorMap(im_color, im_color, cv::COLORMAP_HOT);\n        // cv::imshow(\"False Color Map\", im_color);\n        cv::imwrite(std::to_string(f) + \"_false_color_map.png\", im_color);\n        //fusion\n        cv::Mat fusionImg;\n        cv::addWeighted(img, 1, im_color, 0.8, 1, fusionImg);\n        // cv::imshow(\"Fusion Img\", fusionImg);\n        // cv::waitKey(0);\n        cv::imwrite(std::to_string(f) + \"_fusion_img.png\", fusionImg);\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFreeHost(buffers[inputIndex]));\n    CHECK(cudaFreeHost(buffers[outputIndex]));\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n    return 0;\n}\n"
  },
  {
    "path": "hrnet/hrnet-semantic-segmentation/hrnet_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences for hrnet.\n\"\"\"\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit\nimport pycuda.driver as cuda\nimport tensorrt as trt\nfrom imgaug import augmenters as iaa\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\nclass Hrnet_TRT(object):\n    \"\"\"\n    description: A Hrnet class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.cfx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        runtime = trt.Runtime(trt.Logger(trt.Logger.INFO))\n        assert runtime\n        \n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-2]\n                self.input_h = engine.get_binding_shape(binding)[-3]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n\n    def infer(self, image_raw):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.cfx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        engine = self.engine\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        print('ori_shape: ', image_raw.shape)\n        # if image_raw is constant, image_raw.shape[1] != self.input_w\n        w_ori, h_ori = image_raw.shape[1], image_raw.shape[0]\n        # Do image preprocess\n        input_image = self.preprocess_image(image_raw)\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], input_image.ravel())\n        start = time.time()\n        # Transfer input data to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.cfx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        output = output.reshape(self.input_h, self.input_w).astype('uint8')\n        print('output_shape: ', output.shape)\n        output = cv2.resize(output, (w_ori, h_ori))\n        return output, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.cfx.pop()\n\n    def preprocess_image(self, image_raw):\n        \"\"\"\n        description: Read an image from image path, convert it to RGB,\n                    resize and pad it to target size.\n        param:\n            image_raw: numpy, raw image\n        return:\n            image:  the processed image\n        \"\"\"\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        resize = iaa.Resize({\n            'width': self.input_w,\n            'height': self.input_h\n        })\n        image = resize.augment_image(image)\n        print('resized', image.shape, image.dtype)\n        image = image.astype(np.float32)\n        return image\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            return cv2.imread(img_path)\n    \n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            return np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, hrnet_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.hrnet_wrapper = hrnet_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.hrnet_wrapper.infer(self.hrnet_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw*255)\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, hrnet_wrapper):\n        threading.Thread.__init__(self)\n        self.hrnet_wrapper = hrnet_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.hrnet_wrapper.infer(self.hrnet_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\n\nif __name__ == \"__main__\":\n    # load custom engine\n    engine_file_path = \"build/hrnet.engine\"  # the generated engine file\n    \n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a hrnet instance\n    hrnet_wrapper = Hrnet_TRT(engine_file_path)\n    try:\n        print('batch size is', hrnet_wrapper.batch_size)  # batch size is set to 1!\n        \n        image_dir = \"samples/\"\n        image_path_batches = get_img_path_batches(hrnet_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(hrnet_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(hrnet_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        hrnet_wrapper.destroy()\n"
  },
  {
    "path": "hrnet/hrnet-semantic-segmentation/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "ibnnet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(IBNNet)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -pthread -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nfile(GLOB SOURCE_FILES \"*.h\" \"*.cpp\")\n\nadd_executable(ibnnet ${SOURCE_FILES})\ntarget_link_libraries(ibnnet nvinfer)\ntarget_link_libraries(ibnnet cudart)\ntarget_link_libraries(ibnnet ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "ibnnet/InferenceEngine.cpp",
    "content": "#include \"InferenceEngine.h\"\n\nnamespace trt {\n\n   InferenceEngine::InferenceEngine(const EngineConfig &enginecfg): _engineCfg(enginecfg) { \n\n        assert(_engineCfg.max_batch_size > 0);\n\n        CHECK(cudaSetDevice(_engineCfg.device_id));\n\n        _runtime = make_holder(nvinfer1::createInferRuntime(gLogger));\n        assert(_runtime);\n\n        _engine = make_holder(_runtime->deserializeCudaEngine(_engineCfg.trtModelStream.get(), _engineCfg.stream_size)); \n        assert(_engine);\n\n        _context = make_holder(_engine->createExecutionContext());\n        assert(_context);\n\n        _inputSize = _engineCfg.max_batch_size * 3 * _engineCfg.input_h * _engineCfg.input_w * _depth;\n        _outputSize = _engineCfg.max_batch_size * _engineCfg.output_size * _depth; \n\n        CHECK(cudaMallocHost((void**)&_data, _inputSize));\n        CHECK(cudaMallocHost((void**)&_prob, _outputSize));\n\n        _streamptr = std::shared_ptr<cudaStream_t>( new cudaStream_t, \n            [](cudaStream_t* ptr){ \n                cudaStreamDestroy(*ptr);\n                if(ptr != nullptr){ \n                    delete ptr;\n                } \n            });\n\n        CHECK(cudaStreamCreate(&*_streamptr.get()));\n\n        // Pointers to input and output device buffers to pass to engine.\n        // Engine requires exactly IEngine::getNbBindings() number of buffers.\n        assert(_engine->getNbBindings() == 2);\n\n        // In order to bind the buffers, we need to know the names of the input and output tensors.\n        // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n        _inputIndex = _engine->getBindingIndex(_engineCfg.input_name);\n        _outputIndex = _engine->getBindingIndex(_engineCfg.output_name);\n        \n        // Create GPU buffers on device\n        CHECK(cudaMalloc(&_buffers[_inputIndex], _inputSize));\n        CHECK(cudaMalloc(&_buffers[_outputIndex], _outputSize));\n\n        _inputSize /= _engineCfg.max_batch_size;\n        _outputSize /= _engineCfg.max_batch_size; \n\n    }\n\n    bool InferenceEngine::doInference(const int inference_batch_size, std::function<void(float*)> preprocessing) {\n        assert(inference_batch_size <= _engineCfg.max_batch_size);\n        preprocessing(_data);\n        CHECK(cudaSetDevice(_engineCfg.device_id));\n        CHECK(cudaMemcpyAsync(_buffers[_inputIndex], _data, inference_batch_size * _inputSize, cudaMemcpyHostToDevice, *_streamptr));\n        auto status = _context->enqueue(inference_batch_size, _buffers, *_streamptr, nullptr);\n        CHECK(cudaMemcpyAsync(_prob, _buffers[_outputIndex], inference_batch_size * _outputSize, cudaMemcpyDeviceToHost, *_streamptr));\n        CHECK(cudaStreamSynchronize(*_streamptr));\n        return status;\n    }\n\n    InferenceEngine::InferenceEngine(InferenceEngine &&other) noexcept: \n        _engineCfg(other._engineCfg)\n        , _data(other._data)\n        , _prob(other._prob)\n        , _inputIndex(other._inputIndex) \n        , _outputIndex(other._outputIndex)\n        , _inputSize(other._inputSize) \n        , _outputSize(other._outputSize)\n        , _runtime(std::move(other._runtime))\n        , _engine(std::move(other._engine))\n        , _context(std::move(other._context))\n        , _streamptr(other._streamptr) { \n\n        _buffers[0] = other._buffers[0];\n        _buffers[1] = other._buffers[1];\n        other._streamptr.reset();\n        other._data = nullptr;\n        other._prob = nullptr;\n        other._buffers[0] = nullptr; \n        other._buffers[1] = nullptr; \n    } \n\n    InferenceEngine::~InferenceEngine() {  \n        CHECK(cudaFreeHost(_data));\n        CHECK(cudaFreeHost(_prob));\n        CHECK(cudaFree(_buffers[_inputIndex]));\n        CHECK(cudaFree(_buffers[_outputIndex]));\n    }\n}"
  },
  {
    "path": "ibnnet/InferenceEngine.h",
    "content": "/**************************************************************************\n * Handle memory pre-alloc\n * both on host(pinned memory, allow CUDA DMA) & device\n*************************************************************************/\n\n#pragma once\n\n#include <thread>\n#include <chrono>\n#include <memory>\n#include <functional>\n#include <opencv2/opencv.hpp>\n\n#include \"utils.h\"\n#include \"holder.h\"\n#include \"logging.h\"\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\nstatic Logger gLogger;\n\nnamespace trt {\n\n    struct EngineConfig {\n        const char* input_name;\n        const char* output_name; \n        std::shared_ptr<char> trtModelStream;\n        int max_batch_size; /* create engine */\n        int input_h;  \n        int input_w;\n        int output_size;\n        int stream_size;\n        int device_id;\n    };\n\n    class InferenceEngine {\n\n    public:\n        InferenceEngine(const EngineConfig &enginecfg);\n        InferenceEngine(InferenceEngine &&other) noexcept;\n        ~InferenceEngine();\n\n        InferenceEngine(const InferenceEngine &) = delete;\n        InferenceEngine& operator=(const InferenceEngine &) = delete;\n        InferenceEngine& operator=(InferenceEngine && other) = delete;\n\n        bool doInference(const int inference_batch_size, std::function<void(float*)> preprocessing);\n        float* getOutput() { return _prob; }\n        std::thread::id getThreadID() { return std::this_thread::get_id(); }\n\n    private:\n        EngineConfig _engineCfg;\n        float* _data{nullptr};\n        float* _prob{nullptr};\n\n        // Pointers to input and output device buffers to pass to engine.\n        // Engine requires exactly IEngine::getNbBindings() number of buffers.\n        void* _buffers[2];\n\n        // In order to bind the buffers, we need to know the names of the input and output tensors.\n        // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n        int _inputIndex;\n        int _outputIndex;\n\n        int _inputSize;\n        int _outputSize;\n\n        static constexpr std::size_t _depth{sizeof(float)};\n\n        TensorRTHolder<nvinfer1::IRuntime> _runtime{nullptr};\n        TensorRTHolder<nvinfer1::ICudaEngine> _engine{nullptr};\n        TensorRTHolder<nvinfer1::IExecutionContext> _context{nullptr};\n        std::shared_ptr<cudaStream_t> _streamptr;\n    };\n\n}\n\n"
  },
  {
    "path": "ibnnet/README.md",
    "content": "# IBN-Net\n\nAn implementation of IBN-Net, proposed in [\"Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net\"](https://arxiv.org/abs/1807.09441), ECCV2018 by Xingang Pan, Ping Luo, Jianping Shi, Xiaoou Tang. \n\nFor the Pytorch implementation, you can refer to [IBN-Net](https://github.com/XingangPan/IBN-Net)\n\n## Features\n- InstanceNorm2d\n- bottleneck_ibn\n- Resnet50-IBNA\n- Resnet50-IBNB\n- Multi-thread inference\n\n## How to Run\n\n* 1. generate .wts\n\n  // for ibn-a\n  ```\n  python gen_wts.py a\n  ```\n  a file 'resnet50-ibna.wts' will be generated.\n\n  // for ibn-b\n  ```\n  python gen_wts.py b\n  ```\n  a file 'resnet50-ibnb.wts' will be generated.\n* 2. cmake and make\n\n  ```\n  mkdir build\n  cd build\n  cmake ..\n  make\n  ```\n* 3. build engine and run classification\n\n  // put resnet50-ibna.wts/resnet50-ibnb.wts into tensorrtx/ibnnet\n  \n  // go to tensorrtx/ibnnet\n  ```\n  ./ibnnet -s  // serialize model to plan file\n  ./ibnnet -d  // deserialize plan file and run inference\n  ```\n  "
  },
  {
    "path": "ibnnet/gen_wts.py",
    "content": "import torch\nimport os\nimport sys\nimport struct\n\n\nassert sys.argv[1] == \"a\" or sys.argv[1] == \"b\"\nmodel_name = \"resnet50_ibn_\" + sys.argv[1]\n\nnet = torch.hub.load('XingangPan/IBN-Net', model_name, pretrained=True).to('cuda:0').eval()\n\n#verify\n#input = torch.ones(1, 3, 224, 224).to('cuda:0')\n#pixel_mean = torch.tensor([0.485, 0.456, 0.406]).view(1, -1, 1, 1).to('cuda:0')\n#pixel_std = torch.tensor([0.229, 0.224, 0.225]).view(1, -1, 1, 1).to('cuda:0')\n#input.sub_(pixel_mean).div_(pixel_std)\n#out = net(input)\n#print(out)\n\nf = open(model_name + \".wts\", 'w')\nf.write(\"{}\\n\".format(len(net.state_dict().keys())))\nfor k,v in net.state_dict().items():\n    vr = v.reshape(-1).cpu().numpy()\n    f.write(\"{} {}\".format(k, len(vr)))\n    for vv in vr:\n        f.write(\" \")\n        f.write(struct.pack(\">f\", float(vv)).hex())\n    f.write(\"\\n\")\n\n\n"
  },
  {
    "path": "ibnnet/holder.h",
    "content": "#pragma once\n\ntemplate <typename T>\nclass TensorRTHolder {\n    T* holder;\npublic:\n    explicit TensorRTHolder(T* holder_) : holder(holder_) {}\n    ~TensorRTHolder() {\n        if (holder)\n            holder->destroy();\n    }\n    TensorRTHolder(const TensorRTHolder&) = delete;\n    TensorRTHolder& operator=(const TensorRTHolder&) = delete;\n    TensorRTHolder(TensorRTHolder && rhs) noexcept{\n        holder = rhs.holder;\n        rhs.holder = nullptr;\n    }\n    TensorRTHolder& operator=(TensorRTHolder&& rhs) noexcept {\n        if (this == &rhs) {\n            return *this;\n        }\n        if (holder) holder->destroy();\n        holder = rhs.holder;\n        rhs.holder = nullptr;\n        return *this;\n    }\n    T* operator->() {\n        return holder;\n    }\n    T* get() { return holder; }\n    explicit operator bool() { return holder != nullptr; }\n    T& operator*() noexcept { return *holder; }\n};\n\ntemplate <typename T>\nTensorRTHolder<T> make_holder(T* holder) {\n    return TensorRTHolder<T>(holder);\n}\n\ntemplate <typename T>\nusing TensorRTNonHolder = T*;"
  },
  {
    "path": "ibnnet/ibnnet.cpp",
    "content": "#include \"ibnnet.h\"\n\n//#define USE_FP16\n\nnamespace trt {\n\n    IBNNet::IBNNet(trt::EngineConfig &enginecfg, const IBN ibn) : _engineCfg(enginecfg) {\n        switch(ibn) {\n            case IBN::A:\n                _ibn = \"a\"; \n                break;\n            case IBN::B:\n                _ibn = \"b\"; \n                break;\n            case IBN::NONE:\n            default:\n                _ibn = \"\";\n                break;\n        }\n    }\n\n    // create the engine using only the API and not any parser.\n    ICudaEngine *IBNNet::createEngine(IBuilder* builder, IBuilderConfig* config) {\n        // resnet50-ibna, resnet50-ibnb, resnet50\n        assert(_ibn == \"a\" or _ibn == \"b\" or _ibn == \"\");\n        INetworkDefinition* network = builder->createNetworkV2(0U);\n\n        // Create input tensor of shape { 3, INPUT_H, INPUT_W } with name INPUT_BLOB_NAME\n        ITensor* data = network->addInput(_engineCfg.input_name, _dt, Dims3{3, _engineCfg.input_h, _engineCfg.input_w});\n        assert(data);\n\n        std::string path;\n        if(_ibn == \"\") {\n            path = \"../resnet50.wts\";\n        } else {\n            path = \"../resnet50-ibn\" + _ibn + \".wts\";\n        }\n\n        std::map<std::string, Weights> weightMap = loadWeights(path);\n        Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n        std::map<std::string, std::vector<std::string>> ibn_layers{ \n            { \"a\", {\"a\", \"a\", \"a\", \"a\", \"a\", \"a\", \"a\", \"a\", \"a\", \"a\", \"a\", \"a\", \"a\", \"\", \"\", \"\"}},\n            { \"b\", {\"\", \"\", \"b\", \"\", \"\", \"\",\"b\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\",}},\n            { \"\", {16, \"\"}}};\n\n        const float mean[3] = {0.485, 0.456, 0.406}; // rgb\n        const float std[3] = {0.229, 0.224, 0.225};\n        ITensor* pre_input = MeanStd(network, weightMap, data, \"\", mean, std, false);\n\n        IConvolutionLayer* conv1 = network->addConvolutionNd(*pre_input, 64, DimsHW{7, 7}, weightMap[\"conv1.weight\"], emptywts);\n        assert(conv1);\n        conv1->setStrideNd(DimsHW{2, 2});\n        conv1->setPaddingNd(DimsHW{3, 3});\n\n        IActivationLayer* relu1{nullptr};\n        if (_ibn == \"b\") {\n            IScaleLayer* bn1 = addInstanceNorm2d(network, weightMap, *conv1->getOutput(0), \"bn1\", 1e-5);\n            relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n        } else {\n            IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"bn1\", 1e-5);\n            relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n        }\n        assert(relu1);\n\n        // Add max pooling layer with stride of 2x2 and kernel size of 2x2.\n        IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n        assert(pool1);\n        pool1->setStrideNd(DimsHW{2, 2});\n        pool1->setPaddingNd(DimsHW{1, 1});\n\n        IActivationLayer* x = bottleneck_ibn(network, weightMap, *pool1->getOutput(0), 64, 64, 1, \"layer1.0.\", ibn_layers[_ibn][0]);\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 256, 64, 1, \"layer1.1.\", ibn_layers[_ibn][1]);\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 256, 64, 1, \"layer1.2.\", ibn_layers[_ibn][2]);\n\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 256, 128, 2, \"layer2.0.\", ibn_layers[_ibn][3]);\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.1.\", ibn_layers[_ibn][4]);\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.2.\", ibn_layers[_ibn][5]);\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.3.\", ibn_layers[_ibn][6]);\n\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 512, 256, 2, \"layer3.0.\", ibn_layers[_ibn][7]);\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.1.\", ibn_layers[_ibn][8]);\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.2.\", ibn_layers[_ibn][9]);\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.3.\", ibn_layers[_ibn][10]);\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.4.\", ibn_layers[_ibn][11]);\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.5.\", ibn_layers[_ibn][12]);\n\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 1024, 512, 2, \"layer4.0.\", ibn_layers[_ibn][13]);\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 2048, 512, 1, \"layer4.1.\", ibn_layers[_ibn][14]);\n        x = bottleneck_ibn(network, weightMap, *x->getOutput(0), 2048, 512, 1, \"layer4.2.\", ibn_layers[_ibn][15]);\n\n        IPoolingLayer* pool2 = network->addPoolingNd(*x->getOutput(0), PoolingType::kAVERAGE, DimsHW{7, 7});\n        assert(pool2);\n        pool2->setStrideNd(DimsHW{1, 1});\n        \n        IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 1000, weightMap[\"fc.weight\"], weightMap[\"fc.bias\"]);\n        assert(fc1);\n\n        fc1->getOutput(0)->setName(_engineCfg.output_name);\n        std::cout << \"set name out\" << std::endl;\n        network->markOutput(*fc1->getOutput(0));\n\n        // Build engine\n        builder->setMaxBatchSize(_engineCfg.max_batch_size);\n        config->setMaxWorkspaceSize(1 << 20);\n\n    #ifdef USE_FP16\n        config->setFlag(BuilderFlag::kFP16);\n    #endif\n        ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n        std::cout << \"build out\" << std::endl;\n\n        // Don't need the network any more\n        network->destroy();\n\n        // Release host memory\n        for (auto& mem : weightMap) {\n            free((void*) (mem.second.values));\n        }\n\n        return engine;\n    }\n\n    bool IBNNet::serializeEngine() {\n        // Create builder\n        auto builder = make_holder(createInferBuilder(gLogger));\n        auto config = make_holder(builder->createBuilderConfig());\n        // Create model to populate the network, then set the outputs and create an engine\n        ICudaEngine *engine = createEngine(builder.get(), config.get());\n        assert(engine);\n\n        // Serialize the engine\n        TensorRTHolder<IHostMemory> modelStream = make_holder(engine->serialize());\n        assert(modelStream);\n\n        std::ofstream p(\"./ibnnet.engine\", std::ios::binary | std::ios::out);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return false;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n\n        return true;\n    }\n\n    bool IBNNet::deserializeEngine() {\n        std::ifstream file(\"./ibnnet.engine\", std::ios::binary | std::ios::in);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            _engineCfg.stream_size = file.tellg();\n            file.seekg(0, file.beg);\n            _engineCfg.trtModelStream = std::shared_ptr<char>( new char[_engineCfg.stream_size], []( char* ptr ){ delete [] ptr; } );\n            assert(_engineCfg.trtModelStream.get());\n            file.read(_engineCfg.trtModelStream.get(), _engineCfg.stream_size);\n            file.close();\n    \n            _inferEngine = make_unique<trt::InferenceEngine>(_engineCfg);\n            return true;\n        }\n        return false;\n    }\n\n    void IBNNet::preprocessing(const cv::Mat& img, float* const data, const std::size_t stride) {\n        for (std::size_t i = 0; i < stride; ++i) { \n            data[i] = img.at<cv::Vec3b>(i)[2] / 255.0; \n            data[i + stride] = img.at<cv::Vec3b>(i)[1] / 255.0;\n            data[i + (stride<<1)] = img.at<cv::Vec3b>(i)[0] / 255.0;\n        }\n    }\n\n    bool IBNNet::inference(std::vector<cv::Mat> &input) {\n        if(_inferEngine != nullptr) {\n            const std::size_t stride = _engineCfg.input_w * _engineCfg.input_h;\n            return _inferEngine.get()->doInference(input.size(), \n                [&](float* data) {\n                    for(const auto &img : input) {\n                        preprocessing(img, data, stride);\n                        data += 3 * stride;\n                    }\n                }\n            );\n        } else {\n            return false;\n        }\n    }\n\n    float* IBNNet::getOutput() { \n        if(_inferEngine != nullptr) \n            return _inferEngine.get()->getOutput(); \n        return nullptr;\n    }\n\n    int IBNNet::getDeviceID() { \n        return _engineCfg.device_id; \n    }\n\n}"
  },
  {
    "path": "ibnnet/ibnnet.h",
    "content": "#pragma once\n\n#include \"utils.h\"\n#include \"holder.h\"\n#include \"layers.h\"\n#include \"InferenceEngine.h\"\n#include <memory>\n#include <vector>\n#include <chrono>\n#include <opencv2/opencv.hpp>\nextern Logger gLogger;\nusing namespace trtxapi;\n\nnamespace trt {\n\n    enum IBN {\n        A, // resnet50-ibna,\n        B, // resnet50-ibnb,\n        NONE // resnet50\n    };\n\n    class IBNNet {\n    public:\n        IBNNet(trt::EngineConfig &enginecfg, const IBN ibn);\n        ~IBNNet() {};\n\n        bool serializeEngine(); /* create & serializeEngine */ \n        bool deserializeEngine();\n        bool inference(std::vector<cv::Mat> &input); /* support batch inference */\n\n        float* getOutput(); \n        int getDeviceID(); /* cuda deviceid */ \n\n    private:\n        ICudaEngine *createEngine(IBuilder *builder, IBuilderConfig *config);\n        void preprocessing(const cv::Mat& img, float* const data, const std::size_t stride);\n\n    private:\n        trt::EngineConfig _engineCfg;\n        std::unique_ptr<trt::InferenceEngine> _inferEngine{nullptr};\n        std::string _ibn;\n        DataType _dt{DataType::kFLOAT};\n    };\n\n}"
  },
  {
    "path": "ibnnet/layers.cpp",
    "content": "#include \"layers.h\"\n\nnamespace trtxapi {\n\n    ITensor* MeanStd(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor* input, const std::string lname, const float* mean, const float* std, const bool div255) {\n        if(div255) {\n            Weights Div_225{ DataType::kFLOAT, nullptr, 3 };\n            float *wgt = reinterpret_cast<float*>(malloc(sizeof(float) * 3));\n            std::fill_n(wgt, 3, 255.0f); \n            Div_225.values = wgt;\n            weightMap[lname + \".div\"] = Div_225;\n            IConstantLayer* d = network->addConstant(Dims3{ 3, 1, 1 }, Div_225);\n            input = network->addElementWise(*input, *d->getOutput(0), ElementWiseOperation::kDIV)->getOutput(0);\n        }\n        Weights Mean{ DataType::kFLOAT, nullptr, 3 };\n        Mean.values = mean;\n        IConstantLayer* m = network->addConstant(Dims3{ 3, 1, 1 }, Mean);\n        IElementWiseLayer* sub_mean = network->addElementWise(*input, *m->getOutput(0), ElementWiseOperation::kSUB);\n        if (std != nullptr) {\n            Weights Std{ DataType::kFLOAT, nullptr, 3 };\n            Std.values = std;\n            IConstantLayer* s = network->addConstant(Dims3{ 3, 1, 1 }, Std);\n            IElementWiseLayer* std_mean = network->addElementWise(*sub_mean->getOutput(0), *s->getOutput(0), ElementWiseOperation::kDIV);\n            return std_mean->getOutput(0);\n        } else {\n            return sub_mean->getOutput(0);\n        }\n    }\n\n    IScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, const std::string lname, const float eps) {\n        float *gamma = (float*)weightMap[lname + \".weight\"].values;\n        float *beta = (float*)weightMap[lname + \".bias\"].values;\n        float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n        float *var = (float*)weightMap[lname + \".running_var\"].values;\n        int len = weightMap[lname + \".running_var\"].count;\n\n        float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n        for (int i = 0; i < len; i++) {\n            scval[i] = gamma[i] / sqrt(var[i] + eps);\n        }\n        Weights wscale{DataType::kFLOAT, scval, len};\n\n        float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n        for (int i = 0; i < len; i++) {\n            shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n        }\n        Weights wshift{DataType::kFLOAT, shval, len};\n\n        float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n        for (int i = 0; i < len; i++) {\n            pval[i] = 1.0;\n        }\n        Weights wpower{DataType::kFLOAT, pval, len};\n\n        weightMap[lname + \".scale\"] = wscale;\n        weightMap[lname + \".shift\"] = wshift;\n        weightMap[lname + \".power\"] = wpower;\n        IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, wshift, wscale, wpower);\n        assert(scale_1);\n        return scale_1;\n    }\n\n    IScaleLayer* addInstanceNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, const std::string lname, const float eps) {\n\n        int len = weightMap[lname + \".weight\"].count;\n\n        IReduceLayer* reduce1 = network->addReduce(input, \n            ReduceOperation::kAVG,\n            6, \n            true);\n        assert(reduce1);\n\n        IElementWiseLayer* ew1 = network->addElementWise(input, \n            *reduce1->getOutput(0),\n            ElementWiseOperation::kSUB);  \n        assert(ew1);\n\n        const static float pval1[3]{0.0, 1.0, 2.0};   \n        Weights wshift1{DataType::kFLOAT, pval1, 1};\n        Weights wscale1{DataType::kFLOAT, pval1+1, 1};\n        Weights wpower1{DataType::kFLOAT, pval1+2, 1};\n\n        IScaleLayer* scale1 = network->addScale(\n            *ew1->getOutput(0), \n            ScaleMode::kUNIFORM,\n            wshift1,  \n            wscale1,  \n            wpower1); \n        assert(scale1);\n\n        IReduceLayer* reduce2 = network->addReduce(\n            *scale1->getOutput(0), \n            ReduceOperation::kAVG,\n            6, \n            true);\n        assert(reduce2);\n\n        const static float pval2[3]{eps, 1.0, 0.5}; \n        Weights wshift2{DataType::kFLOAT, pval2, 1};\n        Weights wscale2{DataType::kFLOAT, pval2+1, 1};\n        Weights wpower2{DataType::kFLOAT, pval2+2, 1};\n        \n        IScaleLayer* scale2 = network->addScale(\n            *reduce2->getOutput(0), \n            ScaleMode::kUNIFORM,\n            wshift2,  \n            wscale2,  \n            wpower2);\n        assert(scale2);\n\n        IElementWiseLayer* ew2 = network->addElementWise(*ew1->getOutput(0), \n            *scale2->getOutput(0),\n            ElementWiseOperation::kDIV); \n        assert(ew2);\n\n        float* pval3 = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n        std::fill_n(pval3, len, 1.0); \n        Weights wpower3{DataType::kFLOAT, pval3, len};\n        weightMap[lname + \".power3\"] = wpower3;\n\n        IScaleLayer* scale3 = network->addScale(\n            *ew2->getOutput(0), \n            ScaleMode::kCHANNEL,\n            weightMap[lname + \".bias\"], \n            weightMap[lname + \".weight\"],  \n            wpower3); \n        assert(scale3);\n        return scale3;\n    }\n\n    IConcatenationLayer* addIBN(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, const std::string lname) {\n        Dims spliteDims = input.getDimensions();\n        ISliceLayer *split1 = network->addSlice(input, \n            Dims3{0, 0, 0}, \n            Dims3{spliteDims.d[0]/2, spliteDims.d[1], spliteDims.d[2]}, \n            Dims3{1, 1, 1});\n        assert(split1);\n\n        ISliceLayer *split2 = network->addSlice(input, \n            Dims3{spliteDims.d[0]/2, 0, 0}, \n            Dims3{spliteDims.d[0]/2, spliteDims.d[1], spliteDims.d[2]}, \n            Dims3{1, 1, 1});\n        assert(split2);\n\n        auto in1 = addInstanceNorm2d(network, weightMap, *split1->getOutput(0), lname + \"IN\", 1e-5);\n        auto bn1 = addBatchNorm2d(network, weightMap, *split2->getOutput(0), lname + \"BN\", 1e-5);\n\n        ITensor* tensor1[] = {in1->getOutput(0), bn1->getOutput(0)};\n        auto cat1 = network->addConcatenation(tensor1, 2);\n        assert(cat1);\n        return cat1;\n    }\n\n    IActivationLayer* bottleneck_ibn(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, const int inch, const int outch, const int stride, const std::string lname, const std::string ibn) {\n        Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n        IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{1, 1}, weightMap[lname + \"conv1.weight\"], emptywts);\n        assert(conv1);\n\n        IActivationLayer* relu1{nullptr};\n        if (ibn == \"a\") {\n            IConcatenationLayer* bn1 = addIBN(network, weightMap, *conv1->getOutput(0), lname + \"bn1.\");\n            relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n            assert(relu1);\n        } else {\n            IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"bn1\", 1e-5);\n            relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n            assert(relu1);\n        }\n\n        IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{3, 3}, weightMap[lname + \"conv2.weight\"], emptywts);\n        assert(conv2);\n        conv2->setStrideNd(DimsHW{stride, stride});\n        conv2->setPaddingNd(DimsHW{1, 1});\n\n        IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"bn2\", 1e-5);\n\n        IActivationLayer* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n        assert(relu2);\n\n        IConvolutionLayer* conv3 = network->addConvolutionNd(*relu2->getOutput(0), outch * 4, DimsHW{1, 1}, weightMap[lname + \"conv3.weight\"], emptywts);\n        assert(conv3);\n\n        IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"bn3\", 1e-5);\n\n        IElementWiseLayer* ew1;\n        if (stride != 1 || inch != outch * 4) {\n            IConvolutionLayer* conv4 = network->addConvolutionNd(input, outch * 4, DimsHW{1, 1}, weightMap[lname + \"downsample.0.weight\"], emptywts);\n            assert(conv4);\n            conv4->setStrideNd(DimsHW{stride, stride});\n\n            IScaleLayer* bn4 = addBatchNorm2d(network, weightMap, *conv4->getOutput(0), lname + \"downsample.1\", 1e-5);\n            ew1 = network->addElementWise(*bn4->getOutput(0), *bn3->getOutput(0), ElementWiseOperation::kSUM);\n        } else {\n            ew1 = network->addElementWise(input, *bn3->getOutput(0), ElementWiseOperation::kSUM);\n        }\n    \n        IActivationLayer* relu3{nullptr};\n        if (ibn == \"b\") {\n            IScaleLayer* in1 = addInstanceNorm2d(network, weightMap, *ew1->getOutput(0), lname + \"IN\", 1e-5);\n            relu3 = network->addActivation(*in1->getOutput(0), ActivationType::kRELU);\n        } else {\n            relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n        }\n\n        assert(relu3);\n        return relu3;\n    }\n\n}"
  },
  {
    "path": "ibnnet/layers.h",
    "content": "#pragma once\n\n#include <map>\n#include <math.h>\n#include <assert.h>\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\nusing namespace nvinfer1;\n\nnamespace trtxapi {\n\n    ITensor* MeanStd(INetworkDefinition *network, \n        std::map<std::string, Weights>& weightMap, \n        ITensor* input, \n        const std::string lname,\n        const float* mean, \n        const float* std, \n        const bool div255);\n\n    IScaleLayer* addBatchNorm2d(INetworkDefinition *network, \n        std::map<std::string, Weights>& weightMap, \n        ITensor& input, \n        const std::string lname, \n        const float eps);\n\n    IScaleLayer* addInstanceNorm2d(INetworkDefinition *network, \n        std::map<std::string, Weights>& weightMap, \n        ITensor& input, \n        const std::string lname, \n        const float eps);\n\n    IConcatenationLayer* addIBN(INetworkDefinition *network, \n        std::map<std::string, Weights>& weightMap, \n        ITensor& input, \n        const std::string lname);\n\n    IActivationLayer* bottleneck_ibn(INetworkDefinition *network, \n        std::map<std::string, Weights>& weightMap, \n        ITensor& input, \n        const int inch, \n        const int outch,\n        const int stride, \n        const std::string lname, \n        const std::string ibn);\n\n}"
  },
  {
    "path": "ibnnet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "ibnnet/main.cpp",
    "content": "#include <thread>\n#include <vector>\n#include <memory>\n#include \"ibnnet.h\"\n#include \"InferenceEngine.h\"\n\n// stuff we know about the network and the input/output blobs\nstatic const int MAX_BATCH_SIZE = 4;\nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 1000;\nstatic const int DEVICE_ID = 0;\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\nextern Logger gLogger;\n\nvoid run_infer(std::shared_ptr<trt::IBNNet> model) {\n\n    CHECK(cudaSetDevice(model->getDeviceID()));\n\n    if(!model->deserializeEngine()) {\n        std::cout << \"DeserializeEngine Failed.\" << std::endl;\n        return;\n    }\n\n    /* support batch input data */\n    std::vector<cv::Mat> input;\n    input.emplace_back( cv::Mat(INPUT_H, INPUT_W, CV_8UC3, cv::Scalar(255,255,255)) ) ;\n\n    /* run inference */\n    model->inference(input); \n\n    /* get output data from cudaMalloc */\n    float* prob = model->getOutput();\n\n    /* print output */\n    std::cout << \"\\nOutput from thread_id: \" << std::this_thread::get_id() << std::endl;\n    if( prob != nullptr ) { \n        for (size_t batch_idx = 0; batch_idx < input.size(); ++batch_idx) {\n            for (int p = 0; p < OUTPUT_SIZE; ++p) {\n                std::cout<< prob[batch_idx+p] << \" \";\n                if ((p+1) % 10 == 0) {\n                    std::cout << std::endl;\n                }\n            }\n        }\n    }\n}\n\nint main(int argc, char** argv) {\n\n    trt::EngineConfig engineCfg { \n        INPUT_BLOB_NAME,\n        OUTPUT_BLOB_NAME,\n        nullptr,\n        MAX_BATCH_SIZE,\n        INPUT_H,\n        INPUT_W,\n        OUTPUT_SIZE,\n        0,\n        DEVICE_ID};\n\n    if (argc == 2 && std::string(argv[1]) == \"-s\") {\n        std::cout << \"Serializling Engine\" << std::endl;\n        trt::IBNNet ibnnet{engineCfg, trt::IBN::A}; \n        ibnnet.serializeEngine();\n        return 0;\n    } else if (argc == 2 && std::string(argv[1]) == \"-d\") {\n\n        /* \n         * Support multi thread inference (mthreads>1)\n         * Each thread holds their own CudaEngine\n         * They can run on different cuda device through trt::EngineConfig setting\n        */\n        int mthreads = 1; \n        std::vector<std::thread> workers;\n        std::vector<std::shared_ptr<trt::IBNNet>> models;\n\n        for(int i = 0; i < mthreads; ++i) {\n            models.emplace_back( std::make_shared<trt::IBNNet>(engineCfg, trt::IBN::A) ); // For IBNB: trt::IBN::B\n        }\n\n        for(int i = 0; i < mthreads; ++i) {\n            workers.emplace_back( std::thread(run_infer, models[i]) );\n        }\n\n        for(auto & worker : workers) {\n            worker.join();\n        } \n\n        return 0;\n    } else {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./ibnnet -s  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./ibnnet -d  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n}\n"
  },
  {
    "path": "ibnnet/utils.cpp",
    "content": "#include \"utils.h\"\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n"
  },
  {
    "path": "ibnnet/utils.h",
    "content": "#pragma once\n\n#include <map>\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"assert.h\"\n#include <fstream>\n#include <iostream>\n#include <memory>\n\nusing namespace nvinfer1;\n\n#define CHECK(status)                             \\\n    do                                            \\\n    {                                             \\\n        auto ret = (status);                      \\\n        if (ret != 0)                             \\\n        {                                         \\\n            std::cout << \"Cuda failure: \" << ret; \\\n            abort();                              \\\n        }                                         \\\n    } while (0)\n\ntemplate<typename T, typename... Args>\nstd::unique_ptr<T> make_unique(Args&&... args) {\n    return std::unique_ptr<T>(new T(std::forward<Args>(args)...));\n}\n\nstd::map<std::string, Weights> loadWeights(const std::string file);\n\n"
  },
  {
    "path": "inception/inceptionv3/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(inception)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nadd_executable(inception ${PROJECT_SOURCE_DIR}/inception_v3.cpp)\ntarget_link_libraries(inception nvinfer)\ntarget_link_libraries(inception cudart)\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "inception/inceptionv3/README.md",
    "content": "# Inception v3\n\nInception v3 model architecture from \"Rethinking the Inception Architecture for Computer Vision\" <http://arxiv.org/abs/1512.00567>.\n\nFor the details, you can refer to [pytorchx/inception](https://github.com/wang-xinyu/pytorchx/tree/master/inception)\n\nFollowing tricks are used in this inception:\n\n- For pooling layer with padding, we need pay attention to see if padding is included or excluded while calculating average number. Pytorch includes padding while doing avgPool by default, but Tensorrt doesn't. So for pooling layer with padding, we need `setAverageCountExcludesPadding(false)` in tensorrt.\n- Batchnorm layer, implemented by scale layer.\n\n```\n// 1. generate inception.wts from [pytorchx/inception](https://github.com/wang-xinyu/pytorchx/tree/master/inception)\n\n// 2. put inception.wts into tensorrtx/inception\n\n// 3. build and run\n\ncd tensorrtx/inception\n\nmkdir build\n\ncd build\n\ncmake ..\n\nmake\n\nsudo ./inception -s   // serialize model to plan file i.e. 'inception.engine'\n\nsudo ./inception -d   // deserialize plan file and run inference\n\n// 4. see if the output is same as pytorchx/inception\n```\n\n\n"
  },
  {
    "path": "inception/inceptionv3/inception_v3.cpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <cmath>\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 299;\nstatic const int INPUT_W = 299;\nstatic const int OUTPUT_SIZE = 1000;\n\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n    std::cout << \"len \" << len << std::endl;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nIActivationLayer* basicConv2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input,  int outch, DimsHW ksize, int s, DimsHW p, std::string lname) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, ksize, weightMap[lname + \"conv.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(p);\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"bn\", 1e-3);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    return relu1;\n}\n\nIConcatenationLayer* inceptionA(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname,\n    int pool_proj) {\n    IActivationLayer* relu1 = basicConv2d(network, weightMap, input, 64, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch1x1.\");\n\n    IActivationLayer* relu2 = basicConv2d(network, weightMap, input, 48, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch5x5_1.\");\n    relu2 = basicConv2d(network, weightMap, *relu2->getOutput(0), 64, DimsHW{5, 5}, 1, DimsHW{2, 2}, lname + \"branch5x5_2.\");\n\n    IActivationLayer* relu3 = basicConv2d(network, weightMap, input, 64, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch3x3dbl_1.\");\n    relu3 = basicConv2d(network, weightMap, *relu3->getOutput(0), 96, DimsHW{3, 3}, 1, DimsHW{1, 1}, lname + \"branch3x3dbl_2.\");\n    relu3 = basicConv2d(network, weightMap, *relu3->getOutput(0), 96, DimsHW{3, 3}, 1, DimsHW{1, 1}, lname + \"branch3x3dbl_3.\");\n\n    IPoolingLayer* pool1 = network->addPoolingNd(input, PoolingType::kAVERAGE, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{1, 1});\n    pool1->setPaddingNd(DimsHW{1, 1});\n    pool1->setAverageCountExcludesPadding(false);\n    IActivationLayer* relu4 = basicConv2d(network, weightMap, *pool1->getOutput(0), pool_proj, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch_pool.\");\n\n    ITensor* inputTensors[] = {relu1->getOutput(0), relu2->getOutput(0), relu3->getOutput(0), relu4->getOutput(0)};\n    IConcatenationLayer* cat1 = network->addConcatenation(inputTensors, 4);\n    assert(cat1);\n    return cat1;\n}\n\nIConcatenationLayer* inceptionB(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname) {\n    IActivationLayer* relu1 = basicConv2d(network, weightMap, input, 384, DimsHW{3, 3}, 2, DimsHW{0, 0}, lname + \"branch3x3.\");\n\n    IActivationLayer* relu2 = basicConv2d(network, weightMap, input, 64, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch3x3dbl_1.\");\n    relu2 = basicConv2d(network, weightMap, *relu2->getOutput(0), 96, DimsHW{3, 3}, 1, DimsHW{1, 1}, lname + \"branch3x3dbl_2.\");\n    relu2 = basicConv2d(network, weightMap, *relu2->getOutput(0), 96, DimsHW{3, 3}, 2, DimsHW{0, 0}, lname + \"branch3x3dbl_3.\");\n\n    IPoolingLayer* pool1 = network->addPoolingNd(input, PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n\n    ITensor* inputTensors[] = {relu1->getOutput(0), relu2->getOutput(0), pool1->getOutput(0)};\n    IConcatenationLayer* cat1 = network->addConcatenation(inputTensors, 3);\n    assert(cat1);\n    return cat1;\n}\n\nIConcatenationLayer* inceptionC(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname,\n    int c7) {\n    IActivationLayer* relu1 = basicConv2d(network, weightMap, input, 192, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch1x1.\");\n\n    IActivationLayer* relu2 = basicConv2d(network, weightMap, input, c7, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch7x7_1.\");\n    relu2 = basicConv2d(network, weightMap, *relu2->getOutput(0), c7, DimsHW{1, 7}, 1, DimsHW{0, 3}, lname + \"branch7x7_2.\");\n    relu2 = basicConv2d(network, weightMap, *relu2->getOutput(0), 192, DimsHW{7, 1}, 1, DimsHW{3, 0}, lname + \"branch7x7_3.\");\n\n    IActivationLayer* relu3 = basicConv2d(network, weightMap, input, c7, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch7x7dbl_1.\");\n    relu3 = basicConv2d(network, weightMap, *relu3->getOutput(0), c7, DimsHW{7, 1}, 1, DimsHW{3, 0}, lname + \"branch7x7dbl_2.\");\n    relu3 = basicConv2d(network, weightMap, *relu3->getOutput(0), c7, DimsHW{1, 7}, 1, DimsHW{0, 3}, lname + \"branch7x7dbl_3.\");\n    relu3 = basicConv2d(network, weightMap, *relu3->getOutput(0), c7, DimsHW{7, 1}, 1, DimsHW{3, 0}, lname + \"branch7x7dbl_4.\");\n    relu3 = basicConv2d(network, weightMap, *relu3->getOutput(0), 192, DimsHW{1, 7}, 1, DimsHW{0, 3}, lname + \"branch7x7dbl_5.\");\n\n    IPoolingLayer* pool1 = network->addPoolingNd(input, PoolingType::kAVERAGE, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{1, 1});\n    pool1->setPaddingNd(DimsHW{1, 1});\n    pool1->setAverageCountExcludesPadding(false);\n    IActivationLayer* relu4 = basicConv2d(network, weightMap, *pool1->getOutput(0), 192, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch_pool.\");\n\n    ITensor* inputTensors[] = {relu1->getOutput(0), relu2->getOutput(0), relu3->getOutput(0), relu4->getOutput(0)};\n    IConcatenationLayer* cat1 = network->addConcatenation(inputTensors, 4);\n    assert(cat1);\n    return cat1;\n}\n\nIConcatenationLayer* inceptionD(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname) {\n    IActivationLayer* relu1 = basicConv2d(network, weightMap, input, 192, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch3x3_1.\");\n    relu1 = basicConv2d(network, weightMap, *relu1->getOutput(0), 320, DimsHW{3, 3}, 2, DimsHW{0, 0}, lname + \"branch3x3_2.\");\n\n    IActivationLayer* relu2 = basicConv2d(network, weightMap, input, 192, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch7x7x3_1.\");\n    relu2 = basicConv2d(network, weightMap, *relu2->getOutput(0), 192, DimsHW{1, 7}, 1, DimsHW{0, 3}, lname + \"branch7x7x3_2.\");\n    relu2 = basicConv2d(network, weightMap, *relu2->getOutput(0), 192, DimsHW{7, 1}, 1, DimsHW{3, 0}, lname + \"branch7x7x3_3.\");\n    relu2 = basicConv2d(network, weightMap, *relu2->getOutput(0), 192, DimsHW{3, 3}, 2, DimsHW{0, 0}, lname + \"branch7x7x3_4.\");\n\n    IPoolingLayer* pool1 = network->addPoolingNd(input, PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n\n    ITensor* inputTensors[] = {relu1->getOutput(0), relu2->getOutput(0), pool1->getOutput(0)};\n    IConcatenationLayer* cat1 = network->addConcatenation(inputTensors, 3);\n    assert(cat1);\n    return cat1;\n}\n\nIConcatenationLayer* inceptionE(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname) {\n    IActivationLayer* relu1 = basicConv2d(network, weightMap, input, 320, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch1x1.\");\n\n    IActivationLayer* relu2 = basicConv2d(network, weightMap, input, 384, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch3x3_1.\");\n    IActivationLayer* relu2a = basicConv2d(network, weightMap, *relu2->getOutput(0), 384, DimsHW{1, 3}, 1, DimsHW{0, 1}, lname + \"branch3x3_2a.\");\n    IActivationLayer* relu2b = basicConv2d(network, weightMap, *relu2->getOutput(0), 384, DimsHW{3, 1}, 1, DimsHW{1, 0}, lname + \"branch3x3_2b.\");\n    ITensor* inputTensors[] = {relu2a->getOutput(0), relu2b->getOutput(0)};\n    IConcatenationLayer* cat1 = network->addConcatenation(inputTensors, 2);\n    assert(cat1);\n\n    IActivationLayer* relu3 = basicConv2d(network, weightMap, input, 448, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch3x3dbl_1.\");\n    relu3 = basicConv2d(network, weightMap, *relu3->getOutput(0), 384, DimsHW{3, 3}, 1, DimsHW{1, 1}, lname + \"branch3x3dbl_2.\");\n    IActivationLayer* relu3a = basicConv2d(network, weightMap, *relu3->getOutput(0), 384, DimsHW{1, 3}, 1, DimsHW{0, 1}, lname + \"branch3x3dbl_3a.\");\n    IActivationLayer* relu3b = basicConv2d(network, weightMap, *relu3->getOutput(0), 384, DimsHW{3, 1}, 1, DimsHW{1, 0}, lname + \"branch3x3dbl_3b.\");\n    ITensor* inputTensors1[] = {relu3a->getOutput(0), relu3b->getOutput(0)};\n    IConcatenationLayer* cat2 = network->addConcatenation(inputTensors1, 2);\n    assert(cat2);\n\n    IPoolingLayer* pool1 = network->addPoolingNd(input, PoolingType::kAVERAGE, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{1, 1});\n    pool1->setPaddingNd(DimsHW{1, 1});\n    pool1->setAverageCountExcludesPadding(false);\n    IActivationLayer* relu4 = basicConv2d(network, weightMap, *pool1->getOutput(0), 192, DimsHW{1, 1}, 1, DimsHW{0, 0}, lname + \"branch_pool.\");\n\n    ITensor* inputTensors2[] = {relu1->getOutput(0), cat1->getOutput(0), cat2->getOutput(0), relu4->getOutput(0)};\n    IConcatenationLayer* cat3 = network->addConcatenation(inputTensors2, 4);\n    assert(cat3);\n    return cat3;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt)\n{\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape { 1, 1, 32, 32 } with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../inception.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    float shval[3] = {(0.485 - 0.5) / 0.5, (0.456 - 0.5) / 0.5, (0.406 - 0.5) / 0.5};\n    float scval[3] = {0.229 / 0.5, 0.224 / 0.5, 0.225 / 0.5};\n    float pval[3] = {1.0, 1.0, 1.0};\n    Weights shift{DataType::kFLOAT, shval, 3};\n    Weights scale{DataType::kFLOAT, scval, 3};\n    Weights power{DataType::kFLOAT, pval, 3};\n    IScaleLayer* scale1 = network->addScale(*data, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale1);\n\n    IActivationLayer* relu1 = basicConv2d(network, weightMap, *scale1->getOutput(0), 32, DimsHW{3, 3}, 2, DimsHW{0, 0}, \"Conv2d_1a_3x3.\");\n    relu1 = basicConv2d(network, weightMap, *relu1->getOutput(0), 32, DimsHW{3, 3}, 1, DimsHW{0, 0}, \"Conv2d_2a_3x3.\");\n    relu1 = basicConv2d(network, weightMap, *relu1->getOutput(0), 64, DimsHW{3, 3}, 1, DimsHW{1, 1}, \"Conv2d_2b_3x3.\");\n    IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n    relu1 = basicConv2d(network, weightMap, *pool1->getOutput(0), 80, DimsHW{1, 1}, 1, DimsHW{0, 0}, \"Conv2d_3b_1x1.\");\n    relu1 = basicConv2d(network, weightMap, *relu1->getOutput(0), 192, DimsHW{3, 3}, 1, DimsHW{0, 0}, \"Conv2d_4a_3x3.\");\n    pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    pool1->setStrideNd(DimsHW{2, 2});\n\n    auto cat1 = inceptionA(network, weightMap, *pool1->getOutput(0), \"Mixed_5b.\", 32);\n    cat1 = inceptionA(network, weightMap, *cat1->getOutput(0), \"Mixed_5c.\", 64);\n    cat1 = inceptionA(network, weightMap, *cat1->getOutput(0), \"Mixed_5d.\", 64);\n    cat1 = inceptionB(network, weightMap, *cat1->getOutput(0), \"Mixed_6a.\");\n    cat1 = inceptionC(network, weightMap, *cat1->getOutput(0), \"Mixed_6b.\", 128);\n    cat1 = inceptionC(network, weightMap, *cat1->getOutput(0), \"Mixed_6c.\", 160);\n    cat1 = inceptionC(network, weightMap, *cat1->getOutput(0), \"Mixed_6d.\", 160);\n    cat1 = inceptionC(network, weightMap, *cat1->getOutput(0), \"Mixed_6e.\", 192);\n    cat1 = inceptionD(network, weightMap, *cat1->getOutput(0), \"Mixed_7a.\");\n    cat1 = inceptionE(network, weightMap, *cat1->getOutput(0), \"Mixed_7b.\");\n    cat1 = inceptionE(network, weightMap, *cat1->getOutput(0), \"Mixed_7c.\");\n\n    IPoolingLayer* pool2 = network->addPoolingNd(*cat1->getOutput(0), PoolingType::kAVERAGE, DimsHW{8, 8});\n    assert(pool2);\n\n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 1000, weightMap[\"fc.weight\"], weightMap[\"fc.bias\"]);\n    assert(fc1);\n\n    fc1->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n    network->markOutput(*fc1->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream)\n{\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n    config->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize)\n{\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./inception -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./inception -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(\"inception.engine\", std::ios::binary);\n        if (!p)\n        {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"inception.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n\n    // Subtract mean from image\n    static float data[3 * INPUT_H * INPUT_W];\n    for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n        data[i] = 1.0;\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    static float prob[OUTPUT_SIZE];\n    for (int i = 0; i < 100; i++) {\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\n    {\n        std::cout << prob[i] << \", \";\n        if (i % 10 == 0) std::cout << i / 10 << std::endl;\n    }\n    std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "inception/inceptionv3/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "inception/inceptionv4/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(InceptionV4)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -pthread -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nfile(GLOB SOURCE_FILES \"*.h\" \"*.cpp\")\n\nadd_executable(inceptionv4 ${SOURCE_FILES})\ntarget_link_libraries(inceptionv4 nvinfer)\ntarget_link_libraries(inceptionv4 cudart)\ntarget_link_libraries(inceptionv4 ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "inception/inceptionv4/README.md",
    "content": "# Inception v4\n\nInception v4 model architecture from \"Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning\" <https://arxiv.org/abs/1602.07261v2>.\n\nFor the details, you can refer to [rwightman/pytorch-image-models](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/inception_v4.py)\n\nFollowing tricks are used in this inception:\n\n- For pooling layer with padding, we need pay attention to see if padding is included or excluded while calculating average number. Pytorch includes padding while doing avgPool by default, but Tensorrt doesn't. So for pooling layer with padding, we need `setAverageCountExcludesPadding(false)` in tensorrt.\n- Batchnorm layer, implemented by scale layer.\n\n```\n// 1. generate inception.wts from [BlueMirrors/torchtrtz](https://github.com/BlueMirrors/torchtrtz/blob/main/generate_weights.py)\n\n// 2. put inception.wts into tensorrtx/inceptionV4\n\n// 3. build and run\n\ncd tensorrtx/inception/inceptionV4\n\nmkdir build\n\ncd build\n\ncmake ..\n\nmake\n\nsudo ./inceptionV4 -s   // serialize model to plan file i.e. 'inceptionV4.engine'\n\nsudo ./inceptionV4 -d   // deserialize plan file and run inference\n\n// 4. see if the output is same as rwightman/pytorch-image-models/inceptionv4\n```\n\n\n"
  },
  {
    "path": "inception/inceptionv4/inception_v4.cpp",
    "content": "# include \"inception_v4.h\"\n\n\nnamespace trtx {\n    InceptionV4::InceptionV4(const InceptionV4Params &params)\n    : mParams(params)\n    , mContext(nullptr)\n    , mEngine(nullptr)\n    {\n    }\n\n    /**\n     * Builds the tensorrt engine and serializes it.\n    **/\n    bool InceptionV4::serializeEngine()\n    {\n        // load weights\n        weightMap = loadWeights(mParams.weightsFile);\n\n        // create builder\n        IBuilder* builder = createInferBuilder(gLogger);\n        assert(builder);\n\n        // create builder config\n        IBuilderConfig* config = builder -> createBuilderConfig();\n        assert(config);\n\n        // create engine\n        bool created = buildEngine(builder, config);\n        if(!created)\n        {\n            std::cerr << \"Engine creation failed. Check logs.\" << std::endl;\n            return false;\n        }\n\n        // serilaize engine\n        assert(mEngine != nullptr);\n        IHostMemory* modelStream{nullptr};\n        modelStream = mEngine -> serialize();\n        assert(modelStream != nullptr);\n\n        // destroy\n        config -> destroy();\n        builder -> destroy();\n\n        // write serialized engine to file\n        std::ofstream trtFile(mParams.trtEngineFile, std::ios::binary);\n        if(!trtFile){\n            std::cerr << \"Unable to open engine file.\" << std::endl;\n            return false;\n        }\n\n        trtFile.write(reinterpret_cast<const char*>(modelStream -> data()), modelStream -> size());\n        std::cout << \"Engine serialized and saved.\" << std::endl;\n\n        // clean\n        modelStream -> destroy();\n\n        return true;\n    }\n\n    bool InceptionV4::buildEngine(IBuilder *builder, IBuilderConfig *config) {\n        INetworkDefinition* network = builder->createNetworkV2(0U);\n\n        // Create input tensor of shape { 1, 1, 32, 32 } with name INPUT_BLOB_NAME\n        ITensor* data = network->addInput(mParams.inputTensorName, dt, Dims3{3, mParams.inputH, mParams.inputW});\n        assert(data);\n\n        Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n        float shval[3] = {(0.485 - 0.5) / 0.5, (0.456 - 0.5) / 0.5, (0.406 - 0.5) / 0.5};\n        float scval[3] = {0.229 / 0.5, 0.224 / 0.5, 0.225 / 0.5};\n        float pval[3] = {1.0, 1.0, 1.0};\n        Weights shift{DataType::kFLOAT, shval, 3};\n        Weights scale{DataType::kFLOAT, scval, 3};\n        Weights power{DataType::kFLOAT, pval, 3};\n        IScaleLayer* scale1 = network->addScale(*data, ScaleMode::kCHANNEL, shift, scale, power);\n        assert(scale1);\n\n        IActivationLayer* relu0 = basicConv2d(network, weightMap, *scale1 -> getOutput(0), 32, DimsHW{ 3, 3 }, 2, DimsHW{ 0, 0 }, \"features.0\");\n        relu0 = basicConv2d(network, weightMap, *relu0 -> getOutput(0), 32, DimsHW{ 3, 3 }, 1, DimsHW{ 0, 0 }, \"features.1\");\n        relu0 = basicConv2d(network, weightMap, *relu0 -> getOutput(0), 64, DimsHW{ 3, 3 }, 1, DimsHW{ 1, 1 }, \"features.2\");\n\n        auto cat0 = mixed_3a(network, weightMap, *relu0 -> getOutput(0), \"features.3\");\n        cat0 = mixed_4a(network, weightMap, *cat0 -> getOutput(0), \"features.4\");\n        cat0 = mixed_5a(network, weightMap, *cat0 -> getOutput(0), \"features.5\");\n        cat0 = inceptionA(network, weightMap, *cat0 -> getOutput(0), \"features.6\");\n        cat0 = inceptionA(network, weightMap, *cat0 -> getOutput(0), \"features.7\");\n        cat0 = inceptionA(network, weightMap, *cat0 -> getOutput(0), \"features.8\");\n        cat0 = inceptionA(network, weightMap, *cat0 -> getOutput(0), \"features.9\");\n        cat0 = reductionA(network, weightMap, *cat0 -> getOutput(0), \"features.10\");\n\n        cat0 = inceptionB(network, weightMap, *cat0 -> getOutput(0), \"features.11\");\n        cat0 = inceptionB(network, weightMap, *cat0 -> getOutput(0), \"features.12\");\n        cat0 = inceptionB(network, weightMap, *cat0 -> getOutput(0), \"features.13\");\n        cat0 = inceptionB(network, weightMap, *cat0 -> getOutput(0), \"features.14\");\n        cat0 = inceptionB(network, weightMap, *cat0 -> getOutput(0), \"features.15\");\n        cat0 = inceptionB(network, weightMap, *cat0 -> getOutput(0), \"features.16\");\n        cat0 = inceptionB(network, weightMap, *cat0 -> getOutput(0), \"features.17\");\n        cat0 = reductionB(network, weightMap, *cat0 -> getOutput(0), \"features.18\");\n        \n        cat0 = inceptionC(network, weightMap, *cat0 -> getOutput(0), \"features.19\");\n        cat0 = inceptionC(network, weightMap, *cat0 -> getOutput(0), \"features.20\");\n        cat0 = inceptionC(network, weightMap, *cat0 -> getOutput(0), \"features.21\");\n\n        IPoolingLayer* pool2 = network->addPoolingNd(*cat0->getOutput(0), PoolingType::kAVERAGE, DimsHW{8, 8});\n        assert(pool2);\n\n        IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 1000, weightMap[\"last_linear.weight\"], weightMap[\"last_linear.bias\"]);\n        assert(fc1);\n\n        fc1->getOutput(0)->setName(mParams.outputTensorName);\n        std::cout << \"set name out\" << std::endl;\n        network->markOutput(*fc1->getOutput(0));\n\n        // Build engine\n        builder->setMaxBatchSize(mParams.batchSize);\n        config->setMaxWorkspaceSize(1 << 28);\n        if (mParams.fp16)\n            config->setFlag(BuilderFlag::kFP16);\n        mEngine = builder->buildEngineWithConfig(*network, *config);\n        std::cout << \"build out\" << std::endl;\n\n        // Don't need the network any more\n        network->destroy();\n\n        // Release host memory\n        for (auto& mem : weightMap)\n        {\n            free((void*) (mem.second.values));\n        }\n\n        if (mEngine == nullptr) return false;\n        return true;\n    }\n\n    bool InceptionV4::deserializeCudaEngine() {\n        if (mContext != nullptr && mEngine != nullptr)\n        {\n            return true;\n        }\n\n        if (mEngine == nullptr)\n        {\n            char* trtModelStream{nullptr};\n            size_t size{0};\n\n            // open file\n            std::ifstream f(mParams.trtEngineFile, std::ios::binary);\n\n            if (f.good())\n            {\n                // get size\n                f.seekg(0, f.end);\n                size = f.tellg();\n                f.seekg(0, f.beg);\n\n                trtModelStream = new char[size];\n\n                // read data as a block\n                f.read(trtModelStream, size);\n                f.close();\n            }\n\n            if (trtModelStream == nullptr)\n            {\n                return false;\n            }\n\n            // deserialize\n            IRuntime* runtime = createInferRuntime(gLogger);\n            assert(runtime);\n\n            mEngine = runtime -> deserializeCudaEngine(trtModelStream, size, 0);\n            assert(mEngine != nullptr);\n\n            // clean up\n            runtime -> destroy();\n            delete[] trtModelStream;\n        }\n\n        std::cout << \"deserialized engine successfully.\" << std::endl;\n\n        // create execution context\n        mContext = mEngine -> createExecutionContext();\n        assert(mContext != nullptr);\n\n        return true;\n    }\n\n    void InceptionV4::doInference(float* input, float* output, int batchSize) {\n        // Pointers to input and output device buffers to pass to engine.\n        // Engine requires exactly IEngine::getNbBindings() number of buffers.\n        assert(mEngine -> getNbBindings() == 2);\n        void* buffers[2];\n\n        // In order to bind the buffers, we need to know the names of the input and output tensors.\n        // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n        const int inputIndex = mEngine->getBindingIndex(mParams.inputTensorName);\n        const int outputIndex = mEngine->getBindingIndex(mParams.outputTensorName);\n\n        // Create GPU buffers on device\n        CUDA_CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * mParams.inputH * mParams.inputW * sizeof(float)));\n        CUDA_CHECK(cudaMalloc(&buffers[outputIndex], batchSize * 1000 * sizeof(float)));\n\n        // Create stream\n        cudaStream_t stream;\n        CUDA_CHECK(cudaStreamCreate(&stream));\n\n        // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n        CUDA_CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * mParams.inputH * mParams.inputW * sizeof(float), cudaMemcpyHostToDevice, stream));\n        mContext->enqueue(batchSize, buffers, stream, nullptr);\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * 1000 * sizeof(float), cudaMemcpyDeviceToHost, stream));\n        cudaStreamSynchronize(stream);\n\n        // Release stream and buffers\n        cudaStreamDestroy(stream);\n        CUDA_CHECK(cudaFree(buffers[inputIndex]));\n        CUDA_CHECK(cudaFree(buffers[outputIndex]));\n    }\n\n    /**\n     * Cleans up any state created in the InceptionV4Trt class\n    **/\n    bool InceptionV4::cleanUp()\n    {\n        if (mContext != nullptr)\n            mContext -> destroy();\n\n        if (mEngine != nullptr)\n            mEngine -> destroy();\n\n        return true;\n    }\n}\n\n"
  },
  {
    "path": "inception/inceptionv4/inception_v4.h",
    "content": "#ifndef TRTX_INCEPTION_NETWORK_H\n#define TRTX_INCEPTION_NETWORK_H\n\n\n#include <memory>\n#include <vector>\n#include <chrono>\n#include <opencv2/opencv.hpp>\n\n#include \"logging.h\"\n#include \"utils.h\"\n#include \"layers_api.h\"\n\n\nstatic Logger gLogger;\nusing namespace trtxlayers;\n\nnamespace trtx {\n    struct InceptionV4Params\n    {\n        /* data */\n        int32_t batchSize{1};              // Number of inputs in a batch\n        bool int8{false};                  // Allow runnning the network in Int8 mode.\n        bool fp16{false};                  // Allow running the network in FP16 mode.\n        const char* inputTensorName = \"data\";\n        const char* outputTensorName = \"prob\";\n\n        int inputW;                // The input width of the network.\n        int inputH;                // The input height of the the network.\n        int outputSize;           // THe output size of the network.\n        std::string weightsFile;   // Weights file filename.\n        std::string trtEngineFile; // trt engine file name\n    };\n    \n    class InceptionV4 {\n    public:\n        InceptionV4(const InceptionV4Params &enginecfg);\n        ~InceptionV4() {};\n\n        bool serializeEngine();                  // create & serialize netowrk Engine \n        bool deserializeCudaEngine();\n\n        void doInference(float* input, float* output, int batchSize);\n        bool cleanUp();\n    private:\n        bool buildEngine(IBuilder *builder, IBuilderConfig *config);\n        // Runs the Tensorrt network inference engine on a sample.\n    private:\n        InceptionV4Params mParams;\n        ICudaEngine* mEngine;  // The tensorrt engine used to run the network.\n        std::map<std::string, Weights> weightMap; // The weight value map.\n        IExecutionContext* mContext; // The TensorRT execution context to run inference.\n        std::string inception;\n        DataType dt{DataType::kFLOAT};\n    };\n}\n\n#endif"
  },
  {
    "path": "inception/inceptionv4/layers_api.cpp",
    "content": "#include \"layers_api.h\"\n\nnamespace trtxlayers {\n    IScaleLayer* addBatchNorm2d(\n        INetworkDefinition *network, \n        std::map<std::string, Weights>& weightMap, \n        ITensor& input, \n        std::string lname, \n        float eps\n    )\n    {\n        float *gamma = (float*)weightMap[lname + \".weight\"].values;\n        float *beta = (float*)weightMap[lname + \".bias\"].values;\n        float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n        float *var = (float*)weightMap[lname + \".running_var\"].values;\n        int len = weightMap[lname + \".running_var\"].count;\n        std::cout << \"len \" << len << std::endl;\n\n        float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n        for (int i = 0; i < len; i++) {\n            scval[i] = gamma[i] / sqrt(var[i] + eps);\n        }\n        Weights scale{DataType::kFLOAT, scval, len};\n\n        float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n        for (int i = 0; i < len; i++) {\n            shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n        }\n        Weights shift{DataType::kFLOAT, shval, len};\n\n        float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n        for (int i = 0; i < len; i++) {\n            pval[i] = 1.0;\n        }\n        Weights power{DataType::kFLOAT, pval, len};\n\n        weightMap[lname + \".scale\"] = scale;\n        weightMap[lname + \".shift\"] = shift;\n        weightMap[lname + \".power\"] = power;\n        IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n        assert(scale_1);\n        return scale_1;\n    }\n\n    IActivationLayer* basicConv2d(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        int outch, \n        DimsHW ksize, \n        int s, \n        DimsHW p, \n        std::string lname\n    )\n    {\n        // empty wts for bias\n        Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n        // add conv -> bn -> relu\n        IConvolutionLayer* conv = network -> addConvolutionNd(input, outch, ksize, weightMap[lname + \".conv.weight\"], emptywts);\n        assert(conv);\n        conv -> setStrideNd(DimsHW{s, s});\n        conv -> setPaddingNd(p);\n\n        IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv -> getOutput(0), lname + \".bn\", 1e-3);\n        \n        IActivationLayer* relu = network -> addActivation(*bn -> getOutput(0), ActivationType::kRELU);\n        assert(relu); \n        return relu;\n    }\n\n    IConcatenationLayer* mixed_3a(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    )\n    {\n        // branch 0\n        IPoolingLayer* pool = network -> addPoolingNd(input, PoolingType::kMAX, DimsHW{3, 3});\n        assert(pool);\n        pool -> setStrideNd(DimsHW{2, 2});\n\n        // branch 1\n        IActivationLayer* relu = basicConv2d(network, weightMap, input, 96, DimsHW{ 3, 3 }, 2, DimsHW{ 0, 0 }, lname + \".conv\");\n        \n        // concatenate two branches\n        ITensor* inputTensors[] = { pool -> getOutput(0), relu -> getOutput(0) };\n        IConcatenationLayer* cat = network -> addConcatenation(inputTensors, 2);\n        assert(cat);\n        return cat;\n    }\n\n    IConcatenationLayer* mixed_4a(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    )\n    {\n        // branch 0\n        IActivationLayer* relu1 = basicConv2d(network, weightMap, input, 64, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch0.0\");\n        relu1 = basicConv2d(network, weightMap, *relu1 -> getOutput(0), 96, DimsHW{ 3, 3 }, 1, DimsHW{ 0, 0 }, lname + \".branch0.1\");\n\n        // branch 1\n        IActivationLayer* relu2 = basicConv2d(network, weightMap, input, 64, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch1.0\");\n        relu2 = basicConv2d(network, weightMap, *relu2 -> getOutput(0), 64, DimsHW{ 1, 7 }, 1, DimsHW{ 0, 3 }, lname + \".branch1.1\");\n        relu2 = basicConv2d(network, weightMap, *relu2 -> getOutput(0), 64, DimsHW{ 7, 1 }, 1, DimsHW{ 3, 0 }, lname + \".branch1.2\");\n        relu2 = basicConv2d(network, weightMap, *relu2 -> getOutput(0), 96, DimsHW{ 3, 3 }, 1, DimsHW{ 0, 0 }, lname + \".branch1.3\");\n\n        // concatenate two branches\n        ITensor* inputTensors[] = { relu1 -> getOutput(0), relu2 -> getOutput(0) };\n        IConcatenationLayer* cat = network -> addConcatenation(inputTensors, 2);\n        assert(cat);\n        return cat;\n    }\n\n    IConcatenationLayer* mixed_5a(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    )\n    {\n        std::cout<<\"mixed_5a\"<<std::endl;\n        //branch 0\n        IActivationLayer* relu1 = basicConv2d(network, weightMap, input, 192, DimsHW{ 3, 3 }, 2, DimsHW{ 0, 0 }, lname + \".conv\");\n\n        //branch 1\n        IPoolingLayer* pool1 = network -> addPoolingNd(input, PoolingType::kMAX, DimsHW{ 3, 3 });\n        assert(pool1);\n        pool1 -> setStrideNd(DimsHW{ 2, 2 });\n\n        // concatenate branches\n        ITensor* inputTensors[] = { relu1 -> getOutput(0), pool1 -> getOutput(0)};\n        IConcatenationLayer* cat = network -> addConcatenation(inputTensors, 2);\n        assert(cat);\n        std::cout<<\"mixed_5a done\"<<std::endl;\n        return cat;\n    }\n\n    IConcatenationLayer* inceptionA(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    )\n    {\n        // branch 0\n        IActivationLayer* relu0 = basicConv2d(network, weightMap, input, 96, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch0\");\n\n        // branch 1\n        IActivationLayer* relu1 = basicConv2d(network, weightMap, input, 64, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname +\".branch1.0\");\n        relu1 = basicConv2d(network, weightMap, *relu1 -> getOutput(0), 96, DimsHW{ 3, 3 }, 1, DimsHW{ 1, 1 }, lname+\".branch1.1\");\n        \n        // branch 2\n        IActivationLayer* relu2 = basicConv2d(network, weightMap, input, 64, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname+\".branch2.0\");\n        relu2 = basicConv2d(network, weightMap, *relu2 -> getOutput(0), 96, DimsHW{ 3, 3 }, 1, DimsHW{ 1, 1 }, lname+\".branch2.1\");\n        relu2 = basicConv2d(network, weightMap, *relu2 -> getOutput(0), 96, DimsHW{ 3, 3 }, 1, DimsHW{ 1, 1 }, lname+\".branch2.2\");\n\n        // branch 3\n        IPoolingLayer* pool1 = network->addPoolingNd(input, PoolingType::kAVERAGE, DimsHW{3, 3});\n        assert(pool1);\n        pool1->setStrideNd(DimsHW{1, 1});\n        pool1->setPaddingNd(DimsHW{1, 1});\n        pool1->setAverageCountExcludesPadding(false);\n        IActivationLayer* relu3 = basicConv2d(network, weightMap, *pool1 -> getOutput(0), 96, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname+\".branch3.1\");\n\n        // concatenate all branches outputs\n        ITensor* inputTensors[] = { relu0 -> getOutput(0), relu1 -> getOutput(0), relu2 -> getOutput(0), relu3 -> getOutput(0)};\n        IConcatenationLayer* cat = network -> addConcatenation(inputTensors, 4);\n        assert(cat);\n        return cat;\n\n    }\n\n    IConcatenationLayer* reductionA(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    )\n    {\n        // features 10 branch 0\n        IActivationLayer* relu0 = basicConv2d(network, weightMap, input, 384, DimsHW{ 3, 3 }, 2, DimsHW{ 0, 0 }, lname + \".branch0\");\n\n        // branch 1\n        IActivationLayer* relu1 = basicConv2d(network, weightMap, input, 192, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch1.0\");\n        relu1 = basicConv2d(network, weightMap, *relu1 -> getOutput(0), 224, DimsHW{ 3, 3 }, 1, DimsHW{ 1, 1 }, lname + \".branch1.1\");\n        relu1 = basicConv2d(network, weightMap, *relu1 -> getOutput(0), 256, DimsHW{ 3, 3 }, 2, DimsHW{ 0, 0 }, lname + \".branch1.2\");\n\n        // branch 2\n        IPoolingLayer* pool1 = network -> addPoolingNd(input, PoolingType::kMAX, DimsHW{ 3, 3 });\n        assert(pool1);\n        pool1 -> setStrideNd(DimsHW{ 2, 2 });\n\n        // concatenate\n        ITensor* inputTensors[] = { relu0 -> getOutput(0), relu1 -> getOutput(0), pool1 -> getOutput(0) };\n        IConcatenationLayer* cat = network -> addConcatenation(inputTensors, 3);\n        assert(cat);\n        return cat;\n    }\n\n    IConcatenationLayer* inceptionB(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    )\n    {\n        // features 11 branch 0\n        IActivationLayer* relu0 = basicConv2d(network, weightMap, input, 384, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch0\");\n\n        // branch 1\n        IActivationLayer* relu1 = basicConv2d(network, weightMap, input, 192, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch1.0\");\n        relu1 = basicConv2d(network, weightMap, *relu1 -> getOutput(0), 224, DimsHW{ 1, 7 }, 1, DimsHW{ 0, 3 }, lname + \".branch1.1\");\n        relu1 = basicConv2d(network, weightMap, *relu1 -> getOutput(0), 256, DimsHW{ 7, 1 }, 1, DimsHW{ 3, 0 }, lname + \".branch1.2\");\n        \n        // branch 2\n        IActivationLayer* relu2 = basicConv2d(network, weightMap, input, 192, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch2.0\");\n        relu2 = basicConv2d(network, weightMap, *relu2 -> getOutput(0), 192, DimsHW{ 7, 1 }, 1, DimsHW{ 3, 0 }, lname + \".branch2.1\");\n        relu2 = basicConv2d(network, weightMap, *relu2 -> getOutput(0), 224, DimsHW{ 1, 7 }, 1, DimsHW{ 0, 3 }, lname + \".branch2.2\");\n        relu2 = basicConv2d(network, weightMap, *relu2 -> getOutput(0), 224, DimsHW{ 7, 1 }, 1, DimsHW{ 3, 0 }, lname + \".branch2.3\");\n        relu2 = basicConv2d(network, weightMap, *relu2 -> getOutput(0), 256, DimsHW{ 1, 7 }, 1, DimsHW{ 0, 3 }, lname + \".branch2.4\");\n\n        // branch 3\n        IPoolingLayer* pool0 = network -> addPoolingNd(input, PoolingType::kAVERAGE, DimsHW{ 3, 3 });\n        assert(pool0);\n        pool0 -> setStrideNd(DimsHW{ 1, 1 });\n        pool0 -> setPaddingNd(DimsHW{ 1, 1 });\n        pool0 -> setAverageCountExcludesPadding(false);\n        IActivationLayer* relu3 = basicConv2d(network, weightMap, *pool0 -> getOutput(0), 128, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch3.1\");\n\n        // concatenate branches\n        ITensor* inputTensors[] = { relu0 -> getOutput(0), relu1 -> getOutput(0), relu2 -> getOutput(0), relu3 -> getOutput(0) };\n        IConcatenationLayer* cat = network -> addConcatenation(inputTensors, 4);\n        assert(cat);\n\n        return cat;\n    }\n\n    IConcatenationLayer* reductionB(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    )\n    {\n        // features 18 branch 0\n        IActivationLayer* relu0 = basicConv2d(network, weightMap, input, 192, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch0.0\");\n        relu0 = basicConv2d(network, weightMap, *relu0 -> getOutput(0), 192, DimsHW{ 3, 3 }, 2, DimsHW{ 0, 0 }, lname + \".branch0.1\");\n\n        // branch 1\n        IActivationLayer* relu1 = basicConv2d(network, weightMap, input, 256, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch1.0\");\n        relu1 = basicConv2d(network, weightMap, *relu1 -> getOutput(0), 256, DimsHW{ 1, 7 }, 1, DimsHW{ 0, 3 }, lname + \".branch1.1\");\n        relu1 = basicConv2d(network, weightMap, *relu1 -> getOutput(0), 320, DimsHW{ 7, 1 }, 1, DimsHW{ 3, 0 }, lname + \".branch1.2\");\n        relu1 = basicConv2d(network, weightMap, *relu1 -> getOutput(0), 320, DimsHW{ 3, 3 }, 2, DimsHW{ 0, 0 }, lname + \".branch1.3\");\n\n        // branch 2\n        IPoolingLayer* pool1 = network -> addPoolingNd(input, PoolingType::kMAX, DimsHW{ 3, 3 });\n        assert(pool1);\n        pool1 -> setStrideNd(DimsHW{ 2, 2 });\n\n        // concatenate\n        ITensor* inputTensors[] = { relu0 -> getOutput(0), relu1 -> getOutput(0), pool1 -> getOutput(0) };\n        IConcatenationLayer* cat = network -> addConcatenation(inputTensors, 3);\n        assert(cat);\n\n        return cat;\n    }\n\n    IConcatenationLayer* inceptionC(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    )\n    {\n\n        // features 19 branch 0\n        IActivationLayer* relu0 = basicConv2d(network, weightMap, input, 256, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch0\");\n\n        // branch 1\n        IActivationLayer* relu1_0 = basicConv2d(network, weightMap, input, 384, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch1_0\");\n        IActivationLayer* relu1_1a = basicConv2d(network, weightMap, *relu1_0 -> getOutput(0), 256, DimsHW{ 1, 3 }, 1, DimsHW{ 0, 1 }, lname + \".branch1_1a\");\n        IActivationLayer* relu1_1b = basicConv2d(network, weightMap, *relu1_0 -> getOutput(0), 256, DimsHW{ 3, 1 }, 1, DimsHW{ 1, 0 }, lname + \".branch1_1b\");\n        ITensor* inputTensors1[] = { relu1_1a -> getOutput(0), relu1_1b -> getOutput(0) };\n        IConcatenationLayer* cat1 = network -> addConcatenation(inputTensors1, 2);\n        assert(cat1);\n\n        // branch 2\n        IActivationLayer* relu2_0 = basicConv2d(network, weightMap, input, 384, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch2_0\");\n        IActivationLayer* relu2_1 = basicConv2d(network, weightMap, *relu2_0 -> getOutput(0), 448, DimsHW{ 3, 1 }, 1, DimsHW{ 1, 0 }, lname + \".branch2_1\");\n        IActivationLayer* relu2_2 = basicConv2d(network, weightMap, *relu2_1 -> getOutput(0), 512, DimsHW{ 1, 3 }, 1, DimsHW{ 0, 1 }, lname + \".branch2_2\");\n        IActivationLayer* relu2_3a = basicConv2d(network, weightMap, *relu2_2 -> getOutput(0), 256, DimsHW{ 1, 3 }, 1, DimsHW{ 0, 1 }, lname + \".branch2_3a\");\n        IActivationLayer* relu2_3b = basicConv2d(network, weightMap, *relu2_2 -> getOutput(0), 256, DimsHW{ 3, 1 }, 1, DimsHW{ 1, 0 }, lname + \".branch2_3b\");\n        ITensor* inputTensors2[] = { relu2_3a -> getOutput(0), relu2_3b -> getOutput(0) };\n        IConcatenationLayer* cat2 = network -> addConcatenation(inputTensors2, 2);\n        assert(cat2);\n\n        // branch 3\n        IPoolingLayer* pool3 = network -> addPoolingNd(input, PoolingType::kAVERAGE, DimsHW{ 3, 3 });\n        assert(pool3);\n        pool3 -> setStrideNd(DimsHW{ 1, 1 });\n        pool3 -> setPaddingNd(DimsHW{ 1, 1 });\n        pool3 -> setAverageCountExcludesPadding(false);\n        IActivationLayer* relu3 = basicConv2d(network, weightMap, *pool3 -> getOutput(0), 256, DimsHW{ 1, 1 }, 1, DimsHW{ 0, 0 }, lname + \".branch3.1\");\n\n        // concatenate\n        ITensor* inputTensors[] = { relu0 -> getOutput(0), cat1 -> getOutput(0), cat2 -> getOutput(0), relu3 -> getOutput(0) };\n        IConcatenationLayer* cat = network -> addConcatenation(inputTensors, 4);\n        assert(cat);\n        return cat;\n    }\n}"
  },
  {
    "path": "inception/inceptionv4/layers_api.h",
    "content": "#ifndef TRTX_LAYERS_API_H\n#define TRTX_LAYERS_API_H\n\n#include <map>\n#include <math.h>\n#include <assert.h>\n#include <iostream>\n\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n\nusing namespace nvinfer1;\n\nnamespace trtxlayers {\n\n    // Declare your layers here\n    IScaleLayer* addBatchNorm2d(\n        INetworkDefinition *network, \n        std::map<std::string, Weights>& weightMap, \n        ITensor& input, \n        std::string lname, \n        float eps\n    );\n\n    IActivationLayer* basicConv2d(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        int outch, \n        DimsHW ksize, \n        int s, \n        DimsHW p, \n        std::string lname\n    );\n\n    IConcatenationLayer* mixed_3a(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    );\n\n    IConcatenationLayer* mixed_4a(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    );\n\n    IConcatenationLayer* mixed_5a(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    );\n\n    IConcatenationLayer* inceptionA(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    );\n\n    IConcatenationLayer* reductionA(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    );\n    \n    IConcatenationLayer* inceptionB(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    );\n\n    IConcatenationLayer* reductionB(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    );\n\n    IConcatenationLayer* inceptionC(\n        INetworkDefinition *network,\n        std::map<std::string, Weights>& weightMap, \n        ITensor& input,  \n        std::string lname\n    );\n}\n\n#endif  // TRTX_LAYERS_API_H"
  },
  {
    "path": "inception/inceptionv4/logging.h",
    "content": "/*\n * Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n        , mPrefix(other.mPrefix)\n        , mShouldLog(other.mShouldLog)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n            {\n                ss << \" \";\n            }\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//!         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H"
  },
  {
    "path": "inception/inceptionv4/main.cpp",
    "content": "#include \"inception_v4.h\"\n\n\n/**\n * Initializes Inception class params in the \n * InceptionV4Params structure.\n**/\ntrtx::InceptionV4Params initializeParams()\n{\n    trtx::InceptionV4Params params;\n\n    params.batchSize = 1;\n    params.fp16 = false;\n\n    params.inputH = 299;\n    params.inputW = 299;\n    params.outputSize = 1000;\n\n    // change weights file name here\n    params.weightsFile = \"../inceptionV4.wts\";\n\n    // change engine file name here\n    params.trtEngineFile = \"inceptionV4.engine\";\n    return params;\n}\n\n\nint main(int argc, char** argv){\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./inception -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./inception -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    trtx::InceptionV4Params params = initializeParams();\n    trtx::InceptionV4 inceptionV4(params);\n\n    if (std::string(argv[1]) == \"-s\") {\n        // check if engine exists already\n        std::ifstream f(params.trtEngineFile, std::ios::binary);\n\n        // if engine does not exists build, serialize and save\n        if(!f.good())\n        {\n            std::cout << \"Building network ...\" << std::endl;\n            f.close();\n            inceptionV4.serializeEngine();\n        }\n\n        return 1;\n    } \n    else if(std::string(argv[1]) == \"-d\")\n    {\n        // deserialize\n        inceptionV4.deserializeCudaEngine();\n    }\n\n    // create data\n    float data[3 * params.inputH * params.inputW];\n    for(int i=0; i<3*params.inputH*params.inputW; i++)\n    {\n        data[i] = 1.0;\n    }\n    \n    // run inference\n    float prob[params.outputSize];\n    for(int i=0; i<100; i++)\n    {\n        auto start = std::chrono::system_clock::now();\n        inceptionV4.doInference(data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    // cleanup\n    bool cleaned = inceptionV4.cleanUp();\n    \n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < params.outputSize; i++)\n    {\n        std::cout << prob[i] << \", \";\n        if (i % 10 == 0) std::cout << i / 10 << std::endl;\n    }\n    std::cout << std::endl;\n\n    return 0;\n}"
  },
  {
    "path": "inception/inceptionv4/utils.cpp",
    "content": "# include \"utils.h\"\n\n\n// Load weights from files.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n} "
  },
  {
    "path": "inception/inceptionv4/utils.h",
    "content": "# ifndef TRTX_UTILS_H\n# define TRTX_UTILS_H\n\n#include <map>\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"assert.h\"\n#include <fstream>\n#include <iostream>\n#include <memory>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)\\\n    {\\\n        cudaError_t error_code = callstr;\\\n        if (error_code != cudaSuccess) {\\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__;\\\n            assert(0);\\\n        }\\\n    }\n#endif  // CUDA_CHECK\n\nusing namespace nvinfer1;\n\nstd::map<std::string, Weights> loadWeights(const std::string input);\n\n#endif // TRTX_UTILS_H"
  },
  {
    "path": "lenet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.17.0)\n\nproject(\n  lenet\n  VERSION 0.1\n  LANGUAGES C CXX CUDA)\n\nif(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)\n  set(CMAKE_CUDA_ARCHITECTURES\n      75\n      80\n      86\n      89\n      90\n      100\n      120)\nendif()\n\nset(CMAKE_CXX_STANDARD 17)\nset(CMAKE_CXX_STANDARD_REQUIRED ON)\nset(CMAKE_CUDA_STANDARD 17)\nset(CMAKE_CUDA_STANDARD_REQUIRED ON)\nset(CMAKE_EXPORT_COMPILE_COMMANDS ON)\nset(CMAKE_INCLUDE_CURRENT_DIR TRUE)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME \"Use static cudaruntime library\" OFF)\n\nfind_package(Threads REQUIRED)\nfind_package(CUDAToolkit REQUIRED)\nfind_package(OpenCV REQUIRED)\n\nif(NOT TARGET TensorRT::TensorRT)\n  include(FindTensorRT.cmake)\nelse()\n  message(\"TensorRT has been found, skipping for ${PROJECT_NAME}\")\nendif()\n\nadd_executable(${PROJECT_NAME} lenet.cpp)\n\ntarget_include_directories(${PROJECT_NAME} PRIVATE ${CMAKE_CURRENT_LIST_DIR}\n                                                   ${OpenCV_INCLUDE_DIRS})\n\ntarget_link_libraries(${PROJECT_NAME} PUBLIC Threads::Threads CUDA::cudart\n                                             TensorRT::TensorRT ${OpenCV_LIBS})\n\nif(WIN32)\n  set_target_properties(\n    ${PROJECT_NAME} PROPERTIES MSVC_RUNTIME_LIBRARY\n                               \"MultiThreaded$<$<CONFIG:Debug>:Debug>\")\nendif()\n"
  },
  {
    "path": "lenet/FindTensorRT.cmake",
    "content": "cmake_minimum_required(VERSION 3.17.0)\n\nfunction(_guess_path var_name required_files)\n  set(_result \"\")\n\n  foreach(path_entry IN LISTS ARGN)\n    if(NOT EXISTS \"${path_entry}\")\n      message(DEBUG \"skip non-existing path '${path_entry}'\")\n      continue()\n    endif()\n\n    set(_ok TRUE)\n    foreach(required_file IN LISTS required_files)\n      if(NOT EXISTS \"${path_entry}/${required_file}\")\n        set(_ok FALSE)\n        message(DEBUG \"'${path_entry}' missing '${required_file}'\")\n        break()\n      endif()\n    endforeach()\n\n    if(_ok)\n      list(APPEND _result \"${path_entry}\")\n      message(DEBUG \"accept '${path_entry}'\")\n    else()\n      message(DEBUG \"reject '${path_entry}'\")\n    endif()\n  endforeach()\n\n  if(_result STREQUAL \"\")\n    message(\n      FATAL_ERROR\n        \"_guess_path(${var_name}) failed: no valid path found. required_files='${required_files}' candidates='${ARGN}'\"\n    )\n  endif()\n\n  set(${var_name}\n      \"${_result}\"\n      PARENT_SCOPE)\nendfunction()\n\n# add library\nadd_library(TensorRT IMPORTED INTERFACE)\nadd_library(TensorRT::TensorRT ALIAS TensorRT)\n\nset(TRT_VERSION\n    CACHE\n      STRING\n      \"TensorRT version, e.g. \\\"8.6.1.6\\\" or \\\"8.6.1.6+cuda12.0.1.011\\\", \\\"8.6.1.6.Windows10.x86_64.cuda-12.0\\\" etc\"\n)\n\nif(NOT TRT_VERSION STREQUAL \"\" AND NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  message(\n    WARNING\n      \"TRT_VERSION defined by cmake and environment variable both, using the later one\"\n  )\nendif()\n\nif(NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  set(TRT_VERSION $ENV{TRT_VERSION})\nendif()\n\nstring(REGEX MATCH \"([0-9]+)\" _match ${TRT_VERSION})\nset(TRT_MAJOR_VERSION \"${_match}\")\nunset(_match)\n\nif(WIN32)\n  set(TensorRT_DIR \"C:/Program Files/TensorRT-${TRT_VERSION}\")\n  if(NOT EXISTS \"${TensorRT_DIR}\")\n    message(\n      FATAL_ERROR\n        \"TensorRT_DIR=${TensorRT_DIR} does not exist!\"\n    )\n  endif()\n\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 10)\n    set(_modules nvinfer_10 nvinfer_plugin_10 nvinfer_vc_plugin_10\n                 nvinfer_dispatch_10 nvinfer_lean_10)\n    message(DEBUG \"Using ${_modules}\")\n  else()\n    set(_modules nvinfer nvinfer_plugin nvinfer_vc_plugin nvinfer_dispatch\n                 nvinfer_lean)\n  endif()\n\n  set(TensorRT_LIBRARY_DIR \"${TensorRT_DIR}/lib\")\n  set(TensorRT_INCLUDE_DIR \"${TensorRT_DIR}/include\")\nelseif(UNIX)\n  string(TOLOWER \"${CMAKE_SYSTEM_PROCESSOR}\" _trt_arch)\n  set(_trt_include_candidates)\n  if(_trt_arch MATCHES \"^(aarch64|arm64|arch64)$\")\n    set(_trt_include_candidates \"/usr/include/aarch64-linux-gnu\" \"/usr/include\"\n                                \"/usr/local/cuda/targets/aarch64-linux/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/aarch64-linux-gnu/lib\"\n        \"/usr/lib/aarch64-linux-gnu\" \"/usr/lib/aarch64-linux-gnu/tegra\"\n        \"/usr/lib\")\n  elseif(_trt_arch MATCHES \"^(x86_64|amd64)$\")\n    set(_trt_include_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/include\"\n        \"/usr/include/x86_64-linux-gnu\" \"/usr/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/lib\"\n        \"/usr/lib/x86_64-linux-gnu\" \"/usr/lib\")\n  else()\n    message(FATAL_ERROR \"Unknown architecture\")\n  endif()\n\n  set(_modules nvinfer nvinfer_plugin)\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 8)\n    list(APPEND _modules nvinfer_vc_plugin nvinfer_dispatch nvinfer_lean)\n  endif()\n\n  _guess_path(TensorRT_LIBRARY_DIR \"libnvinfer.so;libnvinfer_plugin.so\"\n              ${_trt_library_candidates})\n  message(STATUS \"TensorRT libraries: ${TensorRT_LIBRARY_DIR}\")\n  _guess_path(TensorRT_INCLUDE_DIR \"NvInfer.h\" ${_trt_include_candidates})\n  message(STATUS \"TensorRT includes: ${TensorRT_INCLUDE_DIR}\")\nendif()\n\nforeach(lib IN LISTS _modules)\n  find_library(\n    TensorRT_${lib}_LIBRARY\n    NAMES ${lib}\n    HINTS ${TensorRT_LIBRARY_DIR})\n  list(APPEND TensorRT_LIBRARIES ${TensorRT_${lib}_LIBRARY})\nendforeach()\n\ntarget_link_libraries(TensorRT INTERFACE ${TensorRT_LIBRARIES})\n\nmessage(STATUS \"Found TensorRT libs: ${TensorRT_LIBRARIES}\")\n\nset_target_properties(\n  TensorRT\n  PROPERTIES C_STANDARD 17\n             CXX_STANDARD 17\n             POSITION_INDEPENDENT_CODE ON\n             SKIP_BUILD_RPATH TRUE\n             BUILD_WITH_INSTALL_RPATH TRUE\n             INSTALL_RPATH \"$ORIGIN\"\n             INTERFACE_INCLUDE_DIRECTORIES \"${TensorRT_INCLUDE_DIR}\")\n\nunset(TRT_MAJOR_VERSION)\nunset(_modules)\nunset(_trt_include_candidates)\nunset(_trt_library_candidates)\nunset(_trt_arch)\n"
  },
  {
    "path": "lenet/README.md",
    "content": "# lenet5\n\nlenet5 is one of the simplest net in this repo. You can learn the basic procedures of building CNN from TensorRT API. This demo includes 2 major steps:\n\n1. Build engine\n   - define network\n   - set input/output\n   - serialize model to `.engine` file\n2. Do inference\n   - load and deserialize model from `.engine` file\n   - run inference\n\n## Usage\n\n1. download pt model from `https://github.com/SunnyHaze/LeNet5-MNIST-Pytorch/blob/main/model.pt`\n\n2. run `gen_wts.py` to generate `.wts` file\n\n```bash\npython3 gen_wts.py\n```\n\noutput looks like:\n\n```bash\nlenet out shape: torch.Size([1, 10])\nlenet out: [tensor([0.0725, 0.0730, 0.1056, 0.1201, 0.1059, 0.0741, 0.1328, 0.0953, 0.1230,\n        0.0975])]\ninference result: 6\n```\n\n3. build C++ code\n\n```bash\ncd tensorrtx/lenet\ncmake -S . -B build\ncmake --build build\n```\n\n4. serialize wts model to engine file\n\n```bash\n./build/lenet -s\n```\n\n5. run inference\n\n```bash\n./build/lenet -d\n```\n\noutput looks like:\n\n```bash\n...\nExecution time: 32us\n0.09727, 0.09732, 0.1005, 0.102, 0.1006, 0.09743, 0.1033, 0.09951, 0.1023, 0.09973,\n====\nExecution time: 33us\n0.09727, 0.09732, 0.1005, 0.102, 0.1006, 0.09743, 0.1033, 0.09951, 0.1023, 0.09973,\n====\nprediction result:\nTop: 0 idx: 6, logits: 0.1033, label: 6\nTop: 1 idx: 8, logits: 0.1023, label: 8\nTop: 2 idx: 3, logits: 0.102, label: 3\n```\n\n## Tripy (New TensorRT Python Programming Model)\n\n1. Generate `lenet5.wts`\n\n2. Copy `lenet5.wts` into [tensorrtx/lenet](./)\n\n3. Install Tripy:\n\n   ```bash\n   python3 -m pip install nvtripy -f https://nvidia.github.io/TensorRT-Incubator/packages.html\n   ```\n\n4. Change directories:\n\n   ```bash\n   cd tensorrtx/lenet\n   ```\n\n5. Compile and save the model:\n\n   ```bash\n   python3 lenet_tripy.py -s\n   ```\n\n6. Load and run the model:\n\n   ```bash\n   python3 lenet_tripy.py -d\n   ```\n"
  },
  {
    "path": "lenet/gen_wts.py",
    "content": "import struct\nfrom collections import OrderedDict\n\nimport cv2\nimport numpy as np\nimport torch\nimport torch.nn as nn\n\n\nclass LeNet(nn.Module):\n    def __init__(self):\n        super(LeNet, self).__init__()\n        self.conv1 = nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0)\n        self.relu1 = nn.ReLU()\n        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)\n\n        self.conv2 = nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0)\n        self.relu2 = nn.ReLU()\n        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)\n\n        self.fc1 = nn.Linear(400, 120)\n        self.relu3 = nn.ReLU()\n        self.fc2 = nn.Linear(120, 84)\n        self.relu4 = nn.ReLU()\n        self.fc3 = nn.Linear(84, 10)\n\n    def forward(self, x):\n        y = self.conv1(x)\n        y = self.relu1(y)\n        y = self.pool1(y)\n\n        y = self.conv2(y)\n        y = self.relu2(y)\n        y = self.pool2(y)\n\n        y = y.view(y.shape[0], -1)\n\n        y = self.fc1(y)\n        y = self.relu3(y)\n\n        y = self.fc2(y)\n        y = self.relu4(y)\n\n        y = self.fc3(y)\n        return y\n\n\ndef reformat_state_dict(state: OrderedDict) -> OrderedDict:\n    mapping: dict[str, str] = {\n        \"layer1.0.weight\": \"conv1.weight\",\n        \"layer1.0.bias\": \"conv1.bias\",\n        \"layer1.3.weight\": \"conv2.weight\",\n        \"layer1.3.bias\": \"conv2.bias\",\n        \"layer2.0.weight\": \"fc1.weight\",\n        \"layer2.0.bias\": \"fc1.bias\",\n        \"layer2.2.weight\": \"fc2.weight\",\n        \"layer2.2.bias\": \"fc2.bias\",\n        \"layer2.4.weight\": \"fc3.weight\",\n        \"layer2.4.bias\": \"fc3.bias\",\n    }\n    for i, j in mapping.items():\n        state.setdefault(j, state.pop(i))\n    return state\n\n\ndef main():\n    model = LeNet()\n    model.eval()\n    with torch.inference_mode():\n        img = cv2.imread(\"../assets/6.pgm\", cv2.IMREAD_GRAYSCALE)\n        img = cv2.resize(img, (32, 32), interpolation=cv2.INTER_LINEAR)\n        img = (((img / 255.0) - 0.1307) / 0.3081).astype(np.float32)\n        state = torch.load(\"../models/model.pt\", weights_only=False)\n        state = reformat_state_dict(state[\"state_dict\"])\n        model.load_state_dict(state)\n        input = torch.from_numpy(img)[None, None, ...]\n        out = model(input)\n        print(f\"lenet output shape: {out.shape}\")\n        print(f\"lenet output: {out}\")\n        print(f\"inference result for MNIST data: {int(torch.argmax(out, 1))}\")\n\n    # save to wts\n    print(\"Writing into lenet.wts\")\n    with open(\"../models/lenet.wts\", \"w\") as f:\n        f.write(\"{}\\n\".format(len(model.state_dict().keys())))\n        for k, v in model.state_dict().items():\n            vr = v.reshape(-1).cpu().numpy()\n            f.write(\"{} {} \".format(k, len(vr)))\n            for vv in vr:\n                f.write(\" \")\n                f.write(struct.pack(\">f\", float(vv)).hex())\n            f.write(\"\\n\")\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "lenet/lenet.cpp",
    "content": "#include <NvInfer.h>\n#include <cassert>\n#include <chrono>\n#include <cmath>\n#include <exception>\n#include <filesystem>\n#include <map>\n#include <opencv2/opencv.hpp>\n#include <vector>\n#include \"logging.h\"\n#include \"utils.h\"\n\nusing M = nvinfer1::MatrixOperation;\nusing E = nvinfer1::ElementWiseOperation;\n\n// parameters we know about the lenet-5\nconstexpr static const int64_t INPUT_H = 32;\nconstexpr static const int64_t INPUT_W = 32;\nconstexpr static const std::array<const char*, 2> NAMES = {\"data\", \"prob\"};\nconstexpr static const std::array<const int64_t, 2> SIZES = {1ll * INPUT_H * INPUT_W, 10};\nconstexpr static const char* WTS_PATH = \"../models/lenet.wts\";\nconstexpr static const char* ENGINE_PATH = \"../models/lenet.engine\";\n\nstatic Logger gLogger;\n\n/**\n * @brief Creat the engine using only the API and not any parser.\n *\n * @param N max batch size\n * @param runtime runtime\n * @param builder builder\n * @param config config\n * @param dt data type\n * @return ICudaEngine*\n */\nICudaEngine* createLenetEngine(int32_t N, IRuntime* runtime, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n#if TRT_VERSION >= 11200\n    auto flag = 1U << static_cast<int>(NetworkDefinitionCreationFlag::kSTRONGLY_TYPED);\n#elif TRT_VERSION >= 10000\n    auto flag = 0U;\n#else\n    auto flag = 1U << static_cast<int>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);\n#endif\n    auto* network = builder->createNetworkV2(flag);\n\n    // Create input tensor of shape { 1, 1, 32, 32 } with name INPUT_NAME\n    ITensor* data = network->addInput(NAMES[0], dt, Dims4{N, 1, INPUT_H, INPUT_W});\n    assert(data);\n\n    // Add convolution layer with 6 outputs and a 5x5 filter.\n    std::filesystem::path wts_path{WTS_PATH};\n    wts_path = std::filesystem::absolute(wts_path);\n    std::map<std::string, Weights> weightMap = loadWeights(wts_path.string());\n    auto* conv1 = network->addConvolutionNd(*data, 6, DimsHW{5, 5}, weightMap[\"conv1.weight\"], weightMap[\"conv1.bias\"]);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{1, 1});\n    conv1->setName(\"conv1\");\n\n    // Add activation layer using the ReLU algorithm.\n    IActivationLayer* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    relu1->setName(\"relu1\");\n\n    // Add max pooling layer with stride of 2x2 and kernel size of 2x2.\n    IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n    pool1->setName(\"pool1\");\n\n    // Add second convolution layer with 16 outputs and a 5x5 filter.\n    auto* conv2 = network->addConvolutionNd(*pool1->getOutput(0), 16, DimsHW{5, 5}, weightMap[\"conv2.weight\"],\n                                            weightMap[\"conv2.bias\"]);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{1, 1});\n    conv2->setName(\"conv2\");\n\n    // Add activation layer using the ReLU algorithm.\n    IActivationLayer* relu2 = network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n\n    // Add second max pooling layer with stride of 2x2 and kernel size of 2x2>\n    IPoolingLayer* pool2 = network->addPoolingNd(*relu2->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    assert(pool2);\n    pool2->setStrideNd(DimsHW{2, 2});\n    pool2->setName(\"pool2\");\n\n    // Add fully connected layer\n    auto* flatten = network->addShuffle(*pool2->getOutput(0));\n    flatten->setReshapeDimensions(Dims2{-1, 400});\n    auto* tensor_fc1w = network->addConstant(Dims2{120, 400}, weightMap[\"fc1.weight\"])->getOutput(0);\n    auto* fc1w = network->addMatrixMultiply(*tensor_fc1w, M::kNONE, *flatten->getOutput(0), M::kTRANSPOSE);\n    assert(tensor_fc1w && fc1w);\n    auto tensor_fc1b = network->addConstant(Dims2{120, 1}, weightMap[\"fc1.bias\"])->getOutput(0);\n    auto* fc1b = network->addElementWise(*fc1w->getOutput(0), *tensor_fc1b, E::kSUM);\n    fc1b->setName(\"fc1b\");\n    assert(tensor_fc1b && fc1b);\n\n    // Add activation layer using the ReLU algorithm.\n    IActivationLayer* relu3 = network->addActivation(*fc1b->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n    auto* flatten_relu3 = network->addShuffle(*relu3->getOutput(0));\n    flatten_relu3->setReshapeDimensions(Dims2{-1, 120});\n\n    auto* fc2w = network->addConstant(Dims2{84, 120}, weightMap[\"fc2.weight\"])->getOutput(0);\n    auto* fc2b = network->addConstant(Dims2{84, 1}, weightMap[\"fc2.bias\"])->getOutput(0);\n    auto* fc3w = network->addConstant(Dims2{10, 84}, weightMap[\"fc3.weight\"])->getOutput(0);\n    auto* fc3b = network->addConstant(Dims2{10, 1}, weightMap[\"fc3.bias\"])->getOutput(0);\n    assert(fc2w && fc2b && fc3w && fc3b);\n\n    // fully connected layer with relu\n    auto* fc2_0 = network->addMatrixMultiply(*fc2w, M::kNONE, *flatten_relu3->getOutput(0), M::kTRANSPOSE);\n    assert(fc2_0);\n    fc2_0->setName(\"fc2\");\n    auto* fc2_1 = network->addElementWise(*fc2_0->getOutput(0), *fc2b, E::kSUM);\n    assert(fc2_1);\n    IActivationLayer* relu4 = network->addActivation(*fc2_1->getOutput(0), ActivationType::kRELU);\n    assert(relu4);\n    auto* shuffle = network->addShuffle(*relu4->getOutput(0));\n    shuffle->setReshapeDimensions(Dims2{-1, 84});\n    auto* fc3_0 = network->addMatrixMultiply(*fc3w, M::kNONE, *shuffle->getOutput(0), M::kTRANSPOSE);\n    assert(fc3_0);\n    auto* fc3_1 = network->addElementWise(*fc3_0->getOutput(0), *fc3b, E::kSUM);\n    assert(fc3_1);\n    // clang-format on\n\n    // Add softmax layer to determine the probability.\n    ISoftMaxLayer* prob = network->addSoftMax(*fc3_1->getOutput(0));\n    assert(prob);\n    prob->getOutput(0)->setName(NAMES[1]);\n    network->markOutput(*prob->getOutput(0));\n\n#if TRT_VERSION >= 8400\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, WORKSPACE_SIZE);\n#else\n    config->setMaxWorkspaceSize(WORKSPACE_SIZE);\n    builder->setMaxBatchSize(N);\n#endif\n\n    // Build engine\n#if TRT_VERSION >= 8000\n    IHostMemory* serialized_mem = builder->buildSerializedNetwork(*network, *config);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(serialized_mem->data(), serialized_mem->size());\n    delete network;\n#else\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    network->destroy();\n#endif\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\n/**\n * @brief create a model using the API directly and serialize it to a stream\n *\n * @param N max batch size\n * @param runtime runtime\n * @param modelStream\n */\nvoid APIToModel(int32_t N, IRuntime* runtime, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createLenetEngine(N, runtime, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n#if TRT_VERSION >= 8000\n    delete engine;\n    delete config;\n    delete builder;\n#else\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n#endif\n}\n\nstd::vector<std::vector<float>> doInference(IExecutionContext& context, void* input, int64_t batchSize) {\n    const auto& engine = context.getEngine();\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n    std::vector<void*> buffers;\n\n#if TRT_VERSION >= 8000\n    const int32_t nIO = engine.getNbIOTensors();\n#else\n    const int32_t nIO = engine.getNbBindings();\n#endif\n\n    buffers.resize(nIO);\n    for (auto i = 0; i < nIO; ++i) {\n        std::size_t size = 0;\n#if TRT_VERSION >= 8000\n        auto* tensor_name = engine.getIOTensorName(i);\n        auto s = getSize(engine.getTensorDataType(tensor_name));\n        size = s * batchSize * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n        context.setTensorAddress(tensor_name, buffers[i]);\n#else\n        const int32_t idx = engine.getBindingIndex(NAMES[i]);\n        auto s = getSize(engine.getBindingDataType(idx));\n        assert(idx == i);\n        size = s * batchSize * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n#endif\n    }\n\n#if TRT_VERSION >= 8000\n    assert(context.enqueueV3(stream));\n#else\n    assert(context.enqueueV2(buffers.data(), stream, nullptr));\n#endif\n\n    std::vector<std::vector<float>> prob;\n    for (int i = 1; i < nIO; ++i) {\n        std::vector<float> tmp(batchSize * SIZES[i], std::nanf(\"\"));\n        std::size_t size = batchSize * SIZES[i] * sizeof(float);\n        CHECK(cudaMemcpyAsync(tmp.data(), buffers[i], size, cudaMemcpyDeviceToHost, stream));\n        prob.emplace_back(tmp);\n    }\n    CHECK(cudaStreamSynchronize(stream));\n\n    for (auto& buffer : buffers) {\n        CHECK(cudaFree(buffer));\n    }\n    CHECK(cudaStreamDestroy(stream));\n    return prob;\n}\n\nint main(int argc, char** argv) {\n    try {\n        if (argc != 2) {\n            std::cerr << \"arguments not right!\\n\";\n            std::cerr << \"./lenet -s   // serialize model to plan file\\n\";\n            std::cerr << \"./lenet -d   // deserialize plan file and run inference\\n\";\n            return -1;\n        }\n\n        IRuntime* runtime = createInferRuntime(gLogger);\n        assert(runtime != nullptr);\n\n        char* trtModelStream{nullptr};\n        std::streamsize size{0};\n\n        if (std::string(argv[1]) == \"-s\") {\n            IHostMemory* modelStream{nullptr};\n            APIToModel(1, runtime, &modelStream);\n            assert(modelStream != nullptr);\n\n            std::ofstream p(ENGINE_PATH, std::ios::binary | std::ios::trunc);\n            if (!p) {\n                std::cerr << \"could not open plan output file\\n\";\n                return -1;\n            }\n            if (modelStream->size() > static_cast<std::size_t>(std::numeric_limits<std::streamsize>::max())) {\n                std::cerr << \"this model is too large to serialize\\n\";\n                return -1;\n            }\n            const auto* data_ptr = reinterpret_cast<const char*>(modelStream->data());\n            auto data_size = static_cast<std::streamsize>(modelStream->size());\n            p.write(data_ptr, data_size);\n\n#if TRT_VERSION >= 8000\n            delete modelStream;\n#else\n            modelStream->destroy();\n#endif\n            std::cout << \"serialized weights to lenet5.engine\\n\";\n            return 0;\n        } else if (std::string(argv[1]) == \"-d\") {\n            std::ifstream file(ENGINE_PATH, std::ios::binary);\n            if (file.good()) {\n                file.seekg(0, file.end);\n                size = file.tellg();\n                file.seekg(0, file.beg);\n                trtModelStream = new char[size];\n                assert(trtModelStream);\n                file.read(trtModelStream, size);\n                file.close();\n            }\n        } else {\n            return -1;\n        }\n\n        // prepare input/output data\n        auto img = cv::imread(\"../assets/6.pgm\", cv::IMREAD_GRAYSCALE);\n        cv::resize(img, img, cv::Size(32, 32), 0, 0, cv::INTER_LINEAR);\n        assert(img.channels() == 1);\n        img.convertTo(img, CV_32FC1, 0.00392156f, -0.1307f);\n        img = img / cv::Scalar(0.3081);\n        assert(img.total() * img.elemSize() == SIZES[0] * sizeof(float));\n\n#if TRT_VERSION >= 8000\n        ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n#else\n        ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n#endif\n        assert(engine != nullptr);\n        IExecutionContext* context = engine->createExecutionContext();\n        assert(context != nullptr);\n\n        // Run inference\n        for (int32_t i = 0; i < 100; ++i) {\n            auto _start = std::chrono::system_clock::now();\n            auto prob = doInference(*context, img.data, 1);\n            auto _end = std::chrono::system_clock::now();\n            auto _time = std::chrono::duration_cast<std::chrono::microseconds>(_end - _start).count();\n            std::cout << \"Execution time: \" << _time << \"us\\n\";\n\n            for (const auto& vector : prob) {\n                int idx = 0;\n                for (auto v : vector) {\n                    std::cout << std::setprecision(4) << v << \", \" << std::flush;\n                    if (++idx > 9) {\n                        std::cout << \"\\n====\\n\";\n                        break;\n                    }\n                }\n            }\n\n            if (i == 99) {\n                std::cout << \"prediction result:\\n\";\n                int _top = 0;\n                for (auto& [idx, logits] : topk(prob[0], 3)) {\n                    std::cout << \"Top: \" << _top++ << \" idx: \" << idx << \", logits: \" << logits << \", label: \" << idx\n                              << \"\\n\";\n                }\n            }\n        }\n\n#if TRT_VERSION >= 8000\n        delete context;\n        delete engine;\n        delete runtime;\n#else\n        context->destroy();\n        engine->destroy();\n        runtime->destroy();\n#endif\n\n        return 0;\n    } catch (const std::exception& err) {\n        std::cerr << \"fatal error: \" << err.what() << '\\n';\n        return -1;\n    } catch (...) {\n        std::cerr << \"fatal error: unknown exception\\n\";\n        return -1;\n    }\n}\n"
  },
  {
    "path": "lenet/lenet.py",
    "content": "import argparse\nimport os\nimport struct\nimport sys\n\nimport numpy as np\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nINPUT_H = 32\nINPUT_W = 32\nOUTPUT_SIZE = 10\nINPUT_BLOB_NAME = \"data\"\nOUTPUT_BLOB_NAME = \"prob\"\n\nweight_path = \"./lenet5.wts\"\nengine_path = \"./lenet5.engine\"\n\ngLogger = trt.Logger(trt.Logger.INFO)\n\n\ndef load_weights(file):\n    print(f\"Loading weights: {file}\")\n\n    assert os.path.exists(file), 'Unable to load weight file.'\n\n    weight_map = {}\n    with open(file, \"r\") as f:\n        lines = [line.strip() for line in f]\n    count = int(lines[0])\n    assert count == len(lines) - 1\n    for i in range(1, count + 1):\n        splits = lines[i].split(\" \")\n        name = splits[0]\n        cur_count = int(splits[1])\n        assert cur_count + 2 == len(splits)\n        values = []\n        for j in range(2, len(splits)):\n            # hex string to bytes to float\n            values.append(struct.unpack(\">f\", bytes.fromhex(splits[j])))\n        weight_map[name] = np.array(values, dtype=np.float32)\n\n    return weight_map\n\n\ndef createLenetEngine(maxBatchSize, builder, config, dt):\n    weight_map = load_weights(weight_path)\n    network = builder.create_network()\n\n    data = network.add_input(INPUT_BLOB_NAME, dt, (1, INPUT_H, INPUT_W))\n    assert data\n\n    conv1 = network.add_convolution(input=data,\n                                    num_output_maps=6,\n                                    kernel_shape=(5, 5),\n                                    kernel=weight_map[\"conv1.weight\"],\n                                    bias=weight_map[\"conv1.bias\"])\n    assert conv1\n    conv1.stride = (1, 1)\n\n    relu1 = network.add_activation(conv1.get_output(0),\n                                   type=trt.ActivationType.RELU)\n    assert relu1\n\n    pool1 = network.add_pooling(input=relu1.get_output(0),\n                                window_size=trt.DimsHW(2, 2),\n                                type=trt.PoolingType.AVERAGE)\n    assert pool1\n    pool1.stride = (2, 2)\n\n    conv2 = network.add_convolution(pool1.get_output(0), 16, trt.DimsHW(5, 5),\n                                    weight_map[\"conv2.weight\"],\n                                    weight_map[\"conv2.bias\"])\n    assert conv2\n    conv2.stride = (1, 1)\n\n    relu2 = network.add_activation(conv2.get_output(0),\n                                   type=trt.ActivationType.RELU)\n    assert relu2\n\n    pool2 = network.add_pooling(input=relu2.get_output(0),\n                                window_size=trt.DimsHW(2, 2),\n                                type=trt.PoolingType.AVERAGE)\n    assert pool2\n    pool2.stride = (2, 2)\n\n    fc1 = network.add_fully_connected(input=pool2.get_output(0),\n                                      num_outputs=120,\n                                      kernel=weight_map['fc1.weight'],\n                                      bias=weight_map['fc1.bias'])\n    assert fc1\n\n    relu3 = network.add_activation(fc1.get_output(0),\n                                   type=trt.ActivationType.RELU)\n    assert relu3\n\n    fc2 = network.add_fully_connected(input=relu3.get_output(0),\n                                      num_outputs=84,\n                                      kernel=weight_map['fc2.weight'],\n                                      bias=weight_map['fc2.bias'])\n    assert fc2\n\n    relu4 = network.add_activation(fc2.get_output(0),\n                                   type=trt.ActivationType.RELU)\n    assert relu4\n\n    fc3 = network.add_fully_connected(input=relu4.get_output(0),\n                                      num_outputs=OUTPUT_SIZE,\n                                      kernel=weight_map['fc3.weight'],\n                                      bias=weight_map['fc3.bias'])\n    assert fc3\n\n    prob = network.add_softmax(fc3.get_output(0))\n    assert prob\n\n    prob.get_output(0).name = OUTPUT_BLOB_NAME\n    network.mark_output(prob.get_output(0))\n\n    # Build engine\n    builder.max_batch_size = maxBatchSize\n    config.max_workspace_size = 1 << 20\n    engine = builder.build_engine(network, config)\n\n    del network\n    del weight_map\n\n    return engine\n\n\ndef APIToModel(maxBatchSize):\n    builder = trt.Builder(gLogger)\n    config = builder.create_builder_config()\n    engine = createLenetEngine(maxBatchSize, builder, config, trt.float32)\n    assert engine\n    with open(engine_path, \"wb\") as f:\n        f.write(engine.serialize())\n\n    del engine\n    del builder\n\n\ndef doInference(context, host_in, host_out, batchSize):\n    engine = context.engine\n    assert engine.num_bindings == 2\n\n    devide_in = cuda.mem_alloc(host_in.nbytes)\n    devide_out = cuda.mem_alloc(host_out.nbytes)\n    bindings = [int(devide_in), int(devide_out)]\n    stream = cuda.Stream()\n\n    cuda.memcpy_htod_async(devide_in, host_in, stream)\n    context.execute_async(bindings=bindings, stream_handle=stream.handle)\n    cuda.memcpy_dtoh_async(host_out, devide_out, stream)\n    stream.synchronize()\n\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"-s\", action='store_true')\n    parser.add_argument(\"-d\", action='store_true')\n    args = parser.parse_args()\n\n    if not (args.s ^ args.d):\n        print(\"arguments not right!\")\n        print(\"python lenet.py -s   # serialize model to plan file\")\n        print(\"python lenet.py -d   # deserialize plan file and run inference\")\n        sys.exit()\n\n    if args.s:\n        APIToModel(1)\n    else:\n        runtime = trt.Runtime(gLogger)\n        assert runtime\n\n        with open(engine_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        assert engine\n\n        context = engine.create_execution_context()\n        assert context\n\n        data = np.ones((INPUT_H * INPUT_W), dtype=np.float32)\n        host_in = cuda.pagelocked_empty(INPUT_H * INPUT_W, dtype=np.float32)\n        np.copyto(host_in, data.ravel())\n        host_out = cuda.pagelocked_empty(OUTPUT_SIZE, dtype=np.float32)\n\n        doInference(context, host_in, host_out, 1)\n\n        print(f'Output: {host_out}')\n"
  },
  {
    "path": "lenet/lenet_tripy.py",
    "content": "import argparse\nimport os\nimport struct\n\nimport nvtripy as tp\n\nINPUT_SHAPE = (1, 1, 32, 32)\nWEIGHT_PATH = \"lenet5.wts\"\nCOMPILED_MODEL_PATH = \"lenet5.tpymodel\"\n\n\ndef load_weights(file):\n    if not os.path.exists(file):\n        raise FileNotFoundError(f\"Weight file: {file} does not exist.\")\n\n    with open(file, \"r\") as f:\n        lines = [line.strip() for line in f]\n\n    count = int(lines[0])\n    assert count == len(lines) - 1, \"Mismatch in weight count.\"\n\n    return {\n        splits[0]: tp.Tensor([struct.unpack(\">f\", bytes.fromhex(hex_val))[0] for hex_val in splits[2:]])\n        for splits in (line.split(\" \") for line in lines[1:])\n    }\n\n\nclass Lenet5(tp.Module):\n    def __init__(self):\n        super().__init__()\n\n        self.conv1 = tp.Conv(1, 6, kernel_dims=(5, 5))\n        self.conv2 = tp.Conv(6, 16, kernel_dims=(5, 5))\n        self.fc1 = tp.Linear(16 * 5 * 5, 120)\n        self.fc2 = tp.Linear(120, 84)\n        self.fc3 = tp.Linear(84, 10)\n\n    def forward(self, x):\n        x = tp.relu(self.conv1(x))\n        x = tp.avgpool(x, kernel_dims=(2, 2), stride=(2, 2))\n        x = tp.relu(self.conv2(x))\n        x = tp.avgpool(x, kernel_dims=(2, 2), stride=(2, 2))\n\n        x = tp.flatten(x, 1)\n\n        x = tp.relu(self.fc1(x))\n        x = tp.relu(self.fc2(x))\n        x = tp.softmax(self.fc3(x), dim=1)\n        return x\n\n\ndef main():\n    parser = argparse.ArgumentParser()\n    group = parser.add_mutually_exclusive_group(required=True)\n    group.add_argument(\"-s\", action=\"store_true\", help=\"Save the model\")\n    group.add_argument(\"-d\", action=\"store_true\", help=\"Load a saved model\")\n    args = parser.parse_args()\n\n    if args.s:\n        model = Lenet5()\n\n        weights = load_weights(WEIGHT_PATH)\n        # The weights in the weights file are flattened, so we need to reshape\n        # them to the right shape before we can load them:\n        for name, tensor in model.state_dict().items():\n            weights[name] = tp.reshape(weights[name], tensor.shape)\n\n        model.load_state_dict(weights)\n\n        compiled_model = tp.compile(model, args=[tp.InputInfo(INPUT_SHAPE, dtype=tp.float32)])\n\n        compiled_model.save(COMPILED_MODEL_PATH)\n    else:\n        compiled_model = tp.Executable.load(COMPILED_MODEL_PATH)\n\n        data = tp.ones(INPUT_SHAPE, dtype=tp.float32).eval()\n\n        output = compiled_model(data)\n\n        print(f\"Output: {output}\")\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "lenet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <cstdint>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include <utility>\n#include \"NvInferRuntime.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(std::move(prefix)), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) noexcept\n        : mOutput(other.mOutput), mPrefix(std::move(other.mPrefix)), mShouldLog(other.mShouldLog) {}\n\n    ~LogStreamConsumerBuffer() override {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    int sync() override {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mBuffer(stream, std::move(prefix), shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other) noexcept\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   private:\n    struct TestInfo;\n\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult : std::uint8_t {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << '\\n';\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, TestInfo info)\n            : mStarted(started), mName(std::move(info.name)), mCmdline(std::move(info.cmdline)) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom{false, TestInfo{name, cmdline}};\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    [[nodiscard]] Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    struct TestInfo {\n        std::string name;\n        std::string cmdline;\n    };\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << '\\n';\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kVERBOSE};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINFO};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kWARNING};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kERROR};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINTERNAL_ERROR};\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "lenet/macros.h",
    "content": "#pragma once\n#include <NvInfer.h>\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#define TRT_VERSION \\\n    ((NV_TENSORRT_MAJOR * 1000) + (NV_TENSORRT_MINOR * 100) + (NV_TENSORRT_PATCH * 10) + NV_TENSORRT_BUILD)\n\n#if TRT_VERSION < 7220\n#error \"TensorRT >= 7.2.2 is required for this demo.\"\n#endif\n\n#if TRT_VERSION >= 8000\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n"
  },
  {
    "path": "lenet/utils.h",
    "content": "#pragma once\n#include <cuda_runtime_api.h>\n#include <algorithm>\n#include <cassert>\n#include <cstddef>\n#include <cstdint>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <numeric>\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\nusing namespace nvinfer1;\n\nenum : std::uint32_t { WORKSPACE_SIZE = 16 << 20 };\n\n#define CHECK(status)                                     \\\n    do {                                                  \\\n        auto ret = (status);                              \\\n        if (ret != cudaSuccess) {                         \\\n            std::cerr << \"Cuda failure: \" << ret << \"\\n\"; \\\n            std::abort();                                 \\\n        }                                                 \\\n    } while (0)\n\nstatic void checkTrtEnv(int device = 0) {\n#if TRT_VERSION < 8000\n    CHECK(cudaGetDevice(&device));\n    cudaDeviceProp prop{};\n    CHECK(cudaGetDeviceProperties(&prop, device));\n    const int sm = prop.major * 10 + prop.minor;\n    if (sm > 86) {\n        std::cerr << \"TensorRT < 8 does not support SM > 86 on this GPU.\";\n        std::abort();\n    }\n#endif\n}\n\n/**\n * @brief TensorRT weight files have a simple space delimited format:\n * [type] [size] <data x size in hex>\n * \n * @param file input weight file path\n * @return std::map<std::string, nvinfer1::Weights> \n */\nstatic auto loadWeights(const std::string& file) {\n    std::cout << \"Loading weights: \" << file << \"\\n\";\n    std::map<std::string, nvinfer1::Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> wt.count;\n\n        // Load blob\n        auto* val = new uint32_t[wt.count];\n        input >> std::hex;\n        for (auto x = 0ll; x < wt.count; ++x) {\n            input >> val[x];\n        }\n        wt.values = val;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nstatic std::vector<std::pair<int, float>> topk(const std::vector<float>& v, int64_t k) {\n    if (k <= 0)\n        return {};\n    auto s = std::min<std::ptrdiff_t>(k, static_cast<std::ptrdiff_t>(v.size()));\n\n    std::vector<int> idx(v.size());\n    std::iota(idx.begin(), idx.end(), 0);\n\n    std::partial_sort(idx.begin(), std::next(idx.begin(), s), idx.end(), [&](int a, int b) { return v[a] > v[b]; });\n\n    std::vector<std::pair<int, float>> out;\n    out.reserve(k);\n    for (int i = 0; i < k; ++i)\n        out.emplace_back(idx[i], v[idx[i]]);\n    return out;\n}\n\nstatic size_t getSize(DataType dt) {\n    switch (dt) {\n#if TRT_VERSION >= 8510\n        case DataType::kUINT8:\n#endif\n        case DataType::kINT8:\n            return sizeof(int8_t);\n        case DataType::kFLOAT:\n            return sizeof(float);\n        case DataType::kHALF:\n            return sizeof(int16_t);\n        case DataType::kINT32:\n            return sizeof(int32_t);\n        default: {\n            std::cerr << \"Unsupported data type\\n\";\n            std::abort();\n        }\n    }\n}\n"
  },
  {
    "path": "lprnet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.17.0)\n\nproject(\n  lprnet\n  VERSION 0.1\n  LANGUAGES C CXX CUDA)\n\nif(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)\n  set(CMAKE_CUDA_ARCHITECTURES\n      60\n      70\n      72\n      75\n      80\n      86\n      89)\nendif()\n\nset(CMAKE_CXX_STANDARD 17)\nset(CMAKE_CXX_STANDARD_REQUIRED ON)\nset(CMAKE_CUDA_STANDARD 17)\nset(CMAKE_CUDA_STANDARD_REQUIRED ON)\nset(CMAKE_EXPORT_COMPILE_COMMANDS ON)\nset(CMAKE_INCLUDE_CURRENT_DIR TRUE)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME \"Use static cudaruntime library\" OFF)\n\nfind_package(Threads REQUIRED)\nfind_package(CUDAToolkit REQUIRED)\nfind_package(OpenCV)\n\nif(NOT TARGET TensorRT::TensorRT)\n  include(FindTensorRT.cmake)\nelse()\n  message(\"TensorRT has been found, skipping for ${PROJECT_NAME}\")\nendif()\n\nadd_executable(${PROJECT_NAME} ${PROJECT_NAME}.cpp)\ntarget_include_directories(${PROJECT_NAME} PRIVATE ${OpenCV_INCLUDE_DIRS})\ntarget_link_libraries(${PROJECT_NAME} PUBLIC Threads::Threads CUDA::cudart\n                                             TensorRT::TensorRT ${OpenCV_LIBS})\n\nif(WIN32)\n  set_target_properties(\n    ${PROJECT_NAME} PROPERTIES MSVC_RUNTIME_LIBRARY\n                               \"MultiThreaded$<$<CONFIG:Debug>:Debug>\")\nendif()\n\ntarget_compile_options(${PROJECT_NAME} PRIVATE $<$<CXX_COMPILER_ID:MSVC>:/utf-8>)\n"
  },
  {
    "path": "lprnet/FindTensorRT.cmake",
    "content": "cmake_minimum_required(VERSION 3.17.0)\n\nfunction(_guess_path var_name required_files)\n  set(_result \"\")\n\n  foreach(path_entry IN LISTS ARGN)\n    if(NOT EXISTS \"${path_entry}\")\n      message(DEBUG \"skip non-existing path '${path_entry}'\")\n      continue()\n    endif()\n\n    set(_ok TRUE)\n    foreach(required_file IN LISTS required_files)\n      if(NOT EXISTS \"${path_entry}/${required_file}\")\n        set(_ok FALSE)\n        message(DEBUG \"'${path_entry}' missing '${required_file}'\")\n        break()\n      endif()\n    endforeach()\n\n    if(_ok)\n      list(APPEND _result \"${path_entry}\")\n      message(DEBUG \"accept '${path_entry}'\")\n    else()\n      message(DEBUG \"reject '${path_entry}'\")\n    endif()\n  endforeach()\n\n  if(_result STREQUAL \"\")\n    message(\n      FATAL_ERROR\n        \"_guess_path(${var_name}) failed: no valid path found. required_files='${required_files}' candidates='${ARGN}'\"\n    )\n  endif()\n\n  set(${var_name}\n      \"${_result}\"\n      PARENT_SCOPE)\nendfunction()\n\n# add library\nadd_library(TensorRT IMPORTED INTERFACE)\nadd_library(TensorRT::TensorRT ALIAS TensorRT)\n\nset(TRT_VERSION\n    CACHE\n      STRING\n      \"TensorRT version, e.g. \\\"8.6.1.6\\\" or \\\"8.6.1.6+cuda12.0.1.011\\\", \\\"8.6.1.6.Windows10.x86_64.cuda-12.0\\\" etc\"\n)\n\nif(NOT TRT_VERSION STREQUAL \"\" AND NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  message(\n    WARNING\n      \"TRT_VERSION defined by cmake and environment variable both, using the later one\"\n  )\nendif()\n\nif(NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  set(TRT_VERSION $ENV{TRT_VERSION})\nendif()\n\nstring(REGEX MATCH \"([0-9]+)\" _match ${TRT_VERSION})\nset(TRT_MAJOR_VERSION \"${_match}\")\nunset(_match)\n\nif(WIN32)\n  set(TensorRT_DIR \"C:/Program Files/TensorRT-${TRT_VERSION}\")\n  if(NOT EXISTS \"${TensorRT_DIR}\")\n    message(\n      FATAL_ERROR\n        \"TensorRT_DIR=${TensorRT_DIR} does not exist!\"\n    )\n  endif()\n\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 10)\n    set(_modules nvinfer_10 nvinfer_plugin_10 nvinfer_vc_plugin_10\n                 nvinfer_dispatch_10 nvinfer_lean_10)\n    message(DEBUG \"Using ${_modules}\")\n  else()\n    set(_modules nvinfer nvinfer_plugin nvinfer_vc_plugin nvinfer_dispatch\n                 nvinfer_lean)\n  endif()\n\n  set(TensorRT_LIBRARY_DIR \"${TensorRT_DIR}/lib\")\n  set(TensorRT_INCLUDE_DIR \"${TensorRT_DIR}/include\")\nelseif(UNIX)\n  string(TOLOWER \"${CMAKE_SYSTEM_PROCESSOR}\" _trt_arch)\n  set(_trt_include_candidates)\n  if(_trt_arch MATCHES \"^(aarch64|arm64|arch64)$\")\n    set(_trt_include_candidates \"/usr/include/aarch64-linux-gnu\" \"/usr/include\"\n                                \"/usr/local/cuda/targets/aarch64-linux/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/aarch64-linux-gnu/lib\"\n        \"/usr/lib/aarch64-linux-gnu\" \"/usr/lib/aarch64-linux-gnu/tegra\"\n        \"/usr/lib\")\n  elseif(_trt_arch MATCHES \"^(x86_64|amd64)$\")\n    set(_trt_include_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/include\"\n        \"/usr/include/x86_64-linux-gnu\" \"/usr/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/lib\"\n        \"/usr/lib/x86_64-linux-gnu\" \"/usr/lib\")\n  else()\n    message(FATAL_ERROR \"Unknown architecture\")\n  endif()\n\n  set(_modules nvinfer nvinfer_plugin)\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 8)\n    list(APPEND _modules nvinfer_vc_plugin nvinfer_dispatch nvinfer_lean)\n  endif()\n\n  _guess_path(TensorRT_LIBRARY_DIR \"libnvinfer.so;libnvinfer_plugin.so\"\n              ${_trt_library_candidates})\n  message(STATUS \"TensorRT libraries: ${TensorRT_LIBRARY_DIR}\")\n  _guess_path(TensorRT_INCLUDE_DIR \"NvInfer.h\" ${_trt_include_candidates})\n  message(STATUS \"TensorRT includes: ${TensorRT_INCLUDE_DIR}\")\nendif()\n\nforeach(lib IN LISTS _modules)\n  find_library(\n    TensorRT_${lib}_LIBRARY\n    NAMES ${lib}\n    HINTS ${TensorRT_LIBRARY_DIR})\n  list(APPEND TensorRT_LIBRARIES ${TensorRT_${lib}_LIBRARY})\nendforeach()\n\ntarget_link_libraries(TensorRT INTERFACE ${TensorRT_LIBRARIES})\n\nmessage(STATUS \"Found TensorRT libs: ${TensorRT_LIBRARIES}\")\n\nset_target_properties(\n  TensorRT\n  PROPERTIES C_STANDARD 17\n             CXX_STANDARD 17\n             POSITION_INDEPENDENT_CODE ON\n             SKIP_BUILD_RPATH TRUE\n             BUILD_WITH_INSTALL_RPATH TRUE\n             INSTALL_RPATH \"$ORIGIN\"\n             INTERFACE_INCLUDE_DIRECTORIES \"${TensorRT_INCLUDE_DIR}\")\n\nunset(TRT_MAJOR_VERSION)\nunset(_modules)\nunset(_trt_include_candidates)\nunset(_trt_library_candidates)\nunset(_trt_arch)\n"
  },
  {
    "path": "lprnet/README.md",
    "content": "# LPRNet\n\nThe Pytorch implementation is [xuexingyu24/License_Plate_Detection_Pytorch](https://github.com/xuexingyu24/License_Plate_Detection_Pytorch).\n\n## Usage\n\n1. download model from [HERE](https://github.com/xuexingyu24/License_Plate_Detection_Pytorch/blob/master/LPRNet/weights/Final_LPRNet_model.pth) and put it into `models` folder\n\n2. use `genwts.py` to generate wts file\n\n```bash\npython3 genwts.py\n```\n\n3. build C++ code\n\n```bash\npushd tensorrtx/lprnet\ncmake -S . -B build -G Ninja --fresh\ncmake --build build\n```\n\n4. serialize wts model to engine file\n\n```bash\n./build/LPRnet -s\n```\n\nnow you may see `LPRNet.engine` under `models`\n\n5. run inference\n\nsample code use the image under assets by default:\n\n![sample](../assets/car_plate.jpg)\n\n```bash\n./build/LPRnet -d\n```\n\noutput looks like:\n\n```bash\n...\nExecution time: 205us\n-65.58, -28.74, -52.1, -70.79, -53.36, -57.58, -70.97, -60.66, -48.18, -57.38, -54.07, -58.56, -49.04, -52.39, -51.94, -53.4, -49.04, -45.89, -49.42, -7.863, -42.12,\n====\nExecution time: 202us\n-65.58, -28.74, -52.1, -70.79, -53.36, -57.58, -70.97, -60.66, -48.18, -57.38, -54.07, -58.56, -49.04, -52.39, -51.94, -53.4, -49.04, -45.89, -49.42, -7.863, -42.12,\n====\nresult: 沪BKB770\n```\n\n## Note\n\nif you are running this demo on windows, you may need to check the code page, e.g., for Windows PowerShell, run:\n\n```ps1\nchcp\n```\n\nif the output is not **65001**, then use\n\n```ps1\nchcp 65001\n```\n\nto set the code page to utf-8, so you can get the correct literal result.\n"
  },
  {
    "path": "lprnet/gen_wts.py",
    "content": "\"\"\"\nmodel codes are borrowed from:\n`https://github.com/xuexingyu24/License_Plate_Detection_Pytorch/blob/master/LPRNet/model/LPRNET.py`\n\ncheck `.pth` model here:\n`https://github.com/xuexingyu24/License_Plate_Detection_Pytorch/blob/master/LPRNet/weights/Final_LPRNet_model.pth`\n\n\"\"\"\n\nimport struct\n\nimport cv2\nimport numpy as np\nimport torch\nimport torch.nn as nn\n\nCHARS = \"京沪津渝冀晋蒙辽吉黑苏浙皖闽赣鲁豫鄂湘粤桂琼川贵云藏陕甘青宁新0123456789ABCDEFGHJKLMNPQRSTUVWXYZIO-\"\n\n\ndef preprocess(path):\n    image = cv2.imread(path, cv2.IMREAD_COLOR)\n    image = cv2.resize(image, (94, 24), interpolation=cv2.INTER_CUBIC)\n    image = image.astype(np.float32)\n    image = image / 255.0 - 0.5  # still HxWx3, BGR\n    image = image.transpose(2, 0, 1)[None, ...]\n    image = torch.from_numpy(image)\n    return image\n\n\nclass small_basic_block(nn.Module):\n    def __init__(self, ch_in, ch_out):\n        super(small_basic_block, self).__init__()\n        self.block = nn.Sequential(\n            nn.Conv2d(ch_in, ch_out // 4, kernel_size=1),\n            nn.ReLU(),\n            nn.Conv2d(ch_out // 4, ch_out // 4, kernel_size=(3, 1), padding=(1, 0)),\n            nn.ReLU(),\n            nn.Conv2d(ch_out // 4, ch_out // 4, kernel_size=(1, 3), padding=(0, 1)),\n            nn.ReLU(),\n            nn.Conv2d(ch_out // 4, ch_out, kernel_size=1),\n        )\n\n    def forward(self, x):\n        return self.block(x)\n\n\nclass LPRNet(nn.Module):\n    def __init__(self, class_num, dropout_rate):\n        super(LPRNet, self).__init__()\n        self.class_num = class_num\n        self.backbone = nn.Sequential(\n            nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1),  # 0\n            nn.BatchNorm2d(num_features=64),\n            nn.ReLU(),  # 2\n            nn.MaxPool3d(kernel_size=(1, 3, 3), stride=(1, 1, 1)),\n            small_basic_block(ch_in=64, ch_out=128),  # 4\n            nn.BatchNorm2d(num_features=128),\n            nn.ReLU(),  # 6\n            nn.MaxPool3d(kernel_size=(1, 3, 3), stride=(2, 1, 2)),\n            small_basic_block(ch_in=64, ch_out=256),  # 8\n            nn.BatchNorm2d(num_features=256),\n            nn.ReLU(),  # 10\n            small_basic_block(ch_in=256, ch_out=256),  # 11\n            nn.BatchNorm2d(num_features=256),  # 12\n            nn.ReLU(),  # 13\n            nn.MaxPool3d(kernel_size=(1, 3, 3), stride=(4, 1, 2)),  # 14\n            nn.Dropout(dropout_rate),\n            nn.Conv2d(in_channels=64, out_channels=256, kernel_size=(1, 4), stride=1),  # 16\n            nn.BatchNorm2d(num_features=256),\n            nn.ReLU(),  # 18\n            nn.Dropout(dropout_rate),\n            nn.Conv2d(in_channels=256, out_channels=class_num, kernel_size=(13, 1), stride=1),  # 20\n            nn.BatchNorm2d(num_features=class_num),\n            nn.ReLU(),  # 22\n        )\n        self.container = nn.Sequential(\n            nn.Conv2d(\n                in_channels=256 + class_num + 128 + 64, out_channels=self.class_num, kernel_size=(1, 1), stride=(1, 1)\n            )\n        )\n\n    def forward(self, x):\n        keep_features = list()\n        for i, layer in enumerate(self.backbone.children()):\n            x = layer(x)\n            if i in [2, 6, 13, 22]:  # [2, 4, 8, 11, 22]\n                print(self.backbone[i])\n                keep_features.append(x)\n\n        global_context = list()\n        for i, f in enumerate(keep_features):\n            if i in [0, 1]:\n                f = nn.AvgPool2d(kernel_size=5, stride=5)(f)\n            if i in [2]:\n                f = nn.AvgPool2d(kernel_size=(4, 10), stride=(4, 2))(f)\n            f_pow = torch.pow(f, 2)\n            f_mean = torch.mean(f_pow)\n            f = torch.div(f, f_mean)\n            global_context.append(f)\n\n        x = torch.cat(global_context, 1)\n        x = self.container(x)\n        logits = torch.mean(x, dim=2)\n\n        return logits\n\n\nif __name__ == \"__main__\":\n    model_path = \"../models/Final_LPRNet_model.pth\"\n\n    model = LPRNet(class_num=len(CHARS), dropout_rate=0)\n    print(\"loading pretrained model from %s\" % model_path)\n    device = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"cpu\")\n    model.load_state_dict(torch.load(model_path, map_location=device))\n\n    img = preprocess(\"../assets/car_plate.jpg\")\n    model.eval()\n    print(model)\n    with torch.inference_mode():\n        preds = model(img)\n        res = \"\".join(CHARS[i] for i in torch.argmax(preds[0], dim=0).tolist())\n        res = res.replace(\"-\", \"\")\n\n    with open(\"../models/LPRNet.wts\", \"w\") as f:\n        f.write(\"{}\\n\".format(len(model.state_dict().keys())))\n        for k, v in model.state_dict().items():\n            print(\"key: \", k)\n            print(\"value: \", v.shape)\n            vr = v.reshape(-1).cpu().numpy()\n            f.write(\"{} {}\".format(k, len(vr)))\n            for vv in vr:\n                f.write(\" \")\n                f.write(struct.pack(\">f\", float(vv)).hex())\n            f.write(\"\\n\")\n\n    print(f\"inference result: {res}\")\n"
  },
  {
    "path": "lprnet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <cstdint>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include <utility>\n#include \"NvInferRuntime.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(std::move(prefix)), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) noexcept\n        : mOutput(other.mOutput), mPrefix(std::move(other.mPrefix)), mShouldLog(other.mShouldLog) {}\n\n    ~LogStreamConsumerBuffer() override {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    int sync() override {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mBuffer(stream, std::move(prefix), shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other) noexcept\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   private:\n    struct TestInfo;\n\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult : std::uint8_t {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << '\\n';\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, TestInfo info)\n            : mStarted(started), mName(std::move(info.name)), mCmdline(std::move(info.cmdline)) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom{false, TestInfo{name, cmdline}};\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    [[nodiscard]] Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    struct TestInfo {\n        std::string name;\n        std::string cmdline;\n    };\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << '\\n';\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kVERBOSE};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINFO};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kWARNING};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kERROR};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINTERNAL_ERROR};\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "lprnet/lprnet.cpp",
    "content": "#include <NvInfer.h>\n#include <algorithm>\n#include <array>\n#include <chrono>\n#include <cstdint>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"logging.h\"\n#include \"utils.h\"\n#ifdef _WIN32\n#define NOMINMAX\n#include <Windows.h>\n#endif\n\nusing namespace nvinfer1;\n\nusing WeightMap = std::map<std::string, Weights>;\nusing NDCF = nvinfer1::NetworkDefinitionCreationFlag;\n\nstatic Logger gLogger;\n\nstatic constexpr const std::size_t WORKSPACE_SIZE = 16 << 20;\nstatic constexpr const int32_t DEVICE = 0;\nstatic constexpr const int32_t BATCH_SIZE = 1;\nstatic constexpr const char* WTS_PATH = \"../models/LPRNet.wts\";\nstatic constexpr const char* ENGINE_PATH = \"../models/LPRNet.engine\";\n// stuff we know about the network and the input/output blobs\nstatic constexpr const int32_t INPUT_H = 24;\nstatic constexpr const int32_t INPUT_W = 94;\nstatic constexpr const std::array<const char*, 2> NAMES = {\"data\", \"prob\"};\nstatic constexpr const std::array<int32_t, 2> SIZES = {3 * INPUT_H * INPUT_W, 18 * 68};\nstatic constexpr const bool TRT_PREPROCESS = TRT_VERSION >= 8510 ? true : false;\nstatic constexpr const std::array<const float, 3> mean = {0.5f, 0.5f, 0.5f};\nstatic constexpr const std::array<const float, 3> stdv = {1.f, 1.f, 1.f};\n\nconst std::array<const std::string, 68> alphabet = {\n        \"京\", \"沪\", \"津\", \"渝\", \"冀\", \"晋\", \"蒙\", \"辽\", \"吉\", \"黑\", \"苏\", \"浙\", \"皖\", \"闽\", \"赣\", \"鲁\", \"豫\",\n        \"鄂\", \"湘\", \"粤\", \"桂\", \"琼\", \"川\", \"贵\", \"云\", \"藏\", \"陕\", \"甘\", \"青\", \"宁\", \"新\", \"0\",  \"1\",  \"2\",\n        \"3\",  \"4\",  \"5\",  \"6\",  \"7\",  \"8\",  \"9\",  \"A\",  \"B\",  \"C\",  \"D\",  \"E\",  \"F\",  \"G\",  \"H\",  \"J\",  \"K\",\n        \"L\",  \"M\",  \"N\",  \"P\",  \"Q\",  \"R\",  \"S\",  \"T\",  \"U\",  \"V\",  \"W\",  \"X\",  \"Y\",  \"Z\",  \"I\",  \"O\",  \"-\"};\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition* network, WeightMap& weightMap, ITensor& input, const std::string& lname,\n                            float eps = 1e-5) {\n    const float* gamma = reinterpret_cast<const float*>(weightMap[lname + \".weight\"].values);\n    const float* beta = reinterpret_cast<const float*>(weightMap[lname + \".bias\"].values);\n    const float* mean = reinterpret_cast<const float*>(weightMap[lname + \".running_mean\"].values);\n    const float* var = reinterpret_cast<const float*>(weightMap[lname + \".running_var\"].values);\n    int64_t len = weightMap[lname + \".running_var\"].count;\n\n    auto* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n\n    auto* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    auto* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0f;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    scale_1->setName(lname.c_str());\n    return scale_1;\n}\n\nIConvolutionLayer* smallBasicBlock(INetworkDefinition* network, WeightMap& w, ITensor& input, int ch_out,\n                                   const std::string& lname) {\n    int o = ch_out / 4, i = 0;\n    ITensor* cur_input = &input;\n    IConvolutionLayer* ret{nullptr};\n    struct ConvParams {\n        DimsHW k_dim, p_dim;\n        int ch_out;\n        std::string w_name, b_name;\n    };\n    const std::array<ConvParams, 4> conv_params = {{\n            {DimsHW{1, 1}, DimsHW{0, 0}, o, lname + \".block.0.weight\", lname + \".block.0.bias\"},\n            {DimsHW{3, 1}, DimsHW{1, 0}, o, lname + \".block.2.weight\", lname + \".block.2.bias\"},\n            {DimsHW{1, 3}, DimsHW{0, 1}, o, lname + \".block.4.weight\", lname + \".block.4.bias\"},\n            {DimsHW{1, 1}, DimsHW{0, 0}, ch_out, lname + \".block.6.weight\", lname + \".block.6.bias\"},\n    }};\n    for (const auto& param : conv_params) {\n        ret = network->addConvolutionNd(*cur_input, param.ch_out, param.k_dim, w[param.w_name], w[param.b_name]);\n        assert(ret);\n        ret->setPaddingNd(param.p_dim);\n        ret->setName((lname + \".block.\" + std::to_string(i++)).c_str());\n        if (i != 4) {\n            auto* relu = network->addActivation(*ret->getOutput(0), ActivationType::kRELU);\n            assert(relu);\n            cur_input = relu->getOutput(0);\n        } else {\n            cur_input = ret->getOutput(0);\n        }\n    }\n    return ret;\n}\n\nICudaEngine* createEngine(int32_t N, IRuntime* runtime, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    const int nc = 68;\n    WeightMap w = loadWeights(WTS_PATH);\n\n#if TRT_VERSION >= 11200\n    auto flag = 1U << static_cast<int>(NDCF::kSTRONGLY_TYPED);\n#elif TRT_VERSION >= 10000\n    auto flag = 0U;\n#else\n    auto flag = 1U << static_cast<int>(NDCF::kEXPLICIT_BATCH);\n#endif\n    auto* network = builder->createNetworkV2(flag);\n\n    ITensor* data{nullptr};\n    if constexpr (TRT_PREPROCESS) {\n        // for simplicity, resize image on cpu side\n        dt = DataType::kUINT8;\n        auto* input = network->addInput(NAMES[0], dt, Dims4{N, INPUT_H, INPUT_W, 3});\n        auto* trans = addTransformLayer(network, *input, false, mean, stdv);\n        data = trans->getOutput(0);\n    } else {\n        data = network->addInput(NAMES[0], dt, Dims4{N, 3, INPUT_H, INPUT_W});\n    }\n    assert(data);\n\n    // CBR (Conv-BatchNorm-ReLU)\n    auto* c0 = network->addConvolutionNd(*data, 64, DimsHW{3, 3}, w[\"backbone.0.weight\"], w[\"backbone.0.bias\"]);\n    auto* bn0 = addBatchNorm2d(network, w, *c0->getOutput(0), \"backbone.1\");\n    auto* relu0 = network->addActivation(*bn0->getOutput(0), ActivationType::kRELU);\n\n    auto* f0 = network->addPoolingNd(*relu0->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    f0->setStrideNd(DimsHW{1, 1});\n    assert(c0 && bn0 && relu0);\n\n    auto* sm0 = smallBasicBlock(network, w, *f0->getOutput(0), 128, \"backbone.4\");\n    auto* bn1 = addBatchNorm2d(network, w, *sm0->getOutput(0), \"backbone.5\");\n    auto* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(sm0 && bn1 && relu1);\n\n    // need to unsqueeze to 5D tensor for 3D pooling\n    auto* to5d0 = network->addShuffle(*relu1->getOutput(0));\n    to5d0->setReshapeDimensions({5, {BATCH_SIZE, 1, 128, 20, 90}});\n    auto* f1 = network->addPoolingNd(*to5d0->getOutput(0), PoolingType::kMAX, Dims3{1, 3, 3});\n    f1->setStrideNd(Dims3{2, 1, 2});\n    f1->setName(\"MaxPool3d_1\");\n    auto* to5d1 = network->addShuffle(*f1->getOutput(0));\n    to5d1->setReshapeDimensions(Dims4{BATCH_SIZE, 64, 18, 44});\n\n    auto* sm1 = smallBasicBlock(network, w, *to5d1->getOutput(0), 256, \"backbone.8\");\n    auto* bn2 = addBatchNorm2d(network, w, *sm1->getOutput(0), \"backbone.9\");\n    auto* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    auto* sm2 = smallBasicBlock(network, w, *relu2->getOutput(0), 256, \"backbone.11\");\n    auto* bn3 = addBatchNorm2d(network, w, *sm2->getOutput(0), \"backbone.12\");\n    auto* relu3 = network->addActivation(*bn3->getOutput(0), ActivationType::kRELU);\n\n    // need to unsqueeze to 5D tensor for 3D pooling\n    auto* to5d2 = network->addShuffle(*relu3->getOutput(0));\n    to5d2->setReshapeDimensions({5, {BATCH_SIZE, 1, 256, 18, 44}});\n    auto* f2 = network->addPoolingNd(*to5d2->getOutput(0), PoolingType::kMAX, Dims3{1, 3, 3});\n    f2->setStrideNd(Dims3{4, 1, 2});\n    f2->setName(\"MaxPool3d_2\");\n    auto* to5d3 = network->addShuffle(*f2->getOutput(0));\n    to5d3->setReshapeDimensions(Dims4{BATCH_SIZE, 64, 16, 21});\n\n    // CBR (Conv-BatchNorm-ReLU)\n    c0 = network->addConvolutionNd(*to5d3->getOutput(0), 256, DimsHW{1, 4}, w[\"backbone.16.weight\"],\n                                   w[\"backbone.16.bias\"]);\n    auto* bn4 = addBatchNorm2d(network, w, *c0->getOutput(0), \"backbone.17\");\n    auto* relu5 = network->addActivation(*bn4->getOutput(0), ActivationType::kRELU);\n\n    // CBR (Conv-BatchNorm-ReLU)\n    c0 = network->addConvolutionNd(*relu5->getOutput(0), nc, DimsHW{13, 1}, w[\"backbone.20.weight\"],\n                                   w[\"backbone.20.bias\"]);\n    auto* bn5 = addBatchNorm2d(network, w, *c0->getOutput(0), \"backbone.21\");\n    auto* backbone = network->addActivation(*bn5->getOutput(0), ActivationType::kRELU);\n\n    auto makeGlobalContext = [&](ITensor* feat, bool pool5, bool pool4x10) -> ITensor* {\n        static int j = 0;\n        ITensor* t = feat;\n        if (pool5) {\n            auto* pool = network->addPoolingNd(*t, PoolingType::kAVERAGE, DimsHW{5, 5});\n            assert(pool);\n            pool->setStrideNd(DimsHW{5, 5});\n            auto _name = \"global5.\" + std::to_string(j);\n            pool->setName(_name.c_str());\n            t = pool->getOutput(0);\n        }\n        if (pool4x10) {\n            auto* pool = network->addPoolingNd(*t, PoolingType::kAVERAGE, DimsHW{4, 10});\n            assert(pool);\n            pool->setStrideNd(DimsHW{4, 2});\n            auto _name = \"global4x10.\" + std::to_string(j);\n            pool->setName(_name.c_str());\n            t = pool->getOutput(0);\n        }\n\n        // pow\n        Dims dims = t->getDimensions();\n        int64_t size = dims.d[0] * dims.d[1] * dims.d[2] * dims.d[3];\n        void* data = malloc(sizeof(float) * size);\n        for (int i = 0; i < size; ++i) {\n            reinterpret_cast<float*>(data)[i] = 2.0f;\n        }\n        auto name = \"pow.\" + std::to_string(j);\n        w[name] = {DataType::kFLOAT, data, size};\n        auto* pow_const = network->addConstant(dims, w[name]);\n        auto* pow = network->addElementWise(*t, *pow_const->getOutput(0), ElementWiseOperation::kPOW);\n        assert(pow);\n        pow->setName(name.c_str());\n\n        // mean\n        int32_t mask = (1 << dims.nbDims) - 1;\n        auto* mean = network->addReduce(*pow->getOutput(0), ReduceOperation::kAVG, mask, true);\n        auto _mean_name = \"mean.\" + std::to_string(j);\n        mean->setName(_mean_name.c_str());\n\n        // div\n        auto* div = network->addElementWise(*t, *mean->getOutput(0), ElementWiseOperation::kDIV);\n        auto _div_name = \"div.\" + std::to_string(j);\n        div->setName(_div_name.c_str());\n        ++j;\n        return div->getOutput(0);\n    };\n\n    auto* gc0 = makeGlobalContext(relu0->getOutput(0), true, false);\n    auto* gc1 = makeGlobalContext(relu1->getOutput(0), true, false);\n    auto* gc2 = makeGlobalContext(relu3->getOutput(0), false, true);\n    auto* gc3 = makeGlobalContext(backbone->getOutput(0), false, false);\n    const std::array<ITensor*, 4> gcs = {gc0, gc1, gc2, gc3};\n    auto* cat = network->addConcatenation(gcs.data(), 4);\n    assert(cat);\n    cat->setAxis(1);\n\n    auto* c = network->addConvolutionNd(*cat->getOutput(0), nc, DimsHW{1, 1}, w[\"container.0.weight\"],\n                                        w[\"container.0.bias\"]);\n    auto* logits = network->addReduce(*c->getOutput(0), ReduceOperation::kAVG, 0x04, false);\n    logits->getOutput(0)->setName(NAMES[1]);\n\n    network->markOutput(*logits->getOutput(0));\n\n#if TRT_VERSION >= 8000\n    config->setMemoryPoolLimit(MemoryPoolType::kWORKSPACE, WORKSPACE_SIZE);\n    IHostMemory* mem = builder->buildSerializedNetwork(*network, *config);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(mem->data(), mem->size());\n    delete network;\n#else\n    builder->setMaxBatchSize(N);\n    config->setMaxWorkspaceSize(WORKSPACE_SIZE);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    network->destroy();\n#endif\n\n    std::cout << \"build finished\\n\";\n    // Release host memory\n    for (auto& mem : w) {\n        free((void*)mem.second.values);\n    }\n\n    return engine;\n}\n\nvoid APIToModel(int32_t N, IRuntime* runtime, IHostMemory** modelStream) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    ICudaEngine* engine = createEngine(N, runtime, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    (*modelStream) = engine->serialize();\n\n#if TRT_VERSION >= 8000\n    delete engine;\n    delete config;\n    delete builder;\n#else\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n#endif\n}\n\nauto doInference(IExecutionContext& context, void* input, int64_t batchSize) -> std::vector<std::vector<float>> {\n    const auto& engine = context.getEngine();\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n    std::vector<void*> buffers;\n\n#if TRT_VERSION >= 8000\n    const int32_t nIO = engine.getNbIOTensors();\n#else\n    const int32_t nIO = engine.getNbBindings();\n#endif\n\n    buffers.resize(nIO);\n    for (auto i = 0; i < nIO; ++i) {\n        std::size_t size = 0;\n#if TRT_VERSION >= 8000\n        auto* tensor_name = engine.getIOTensorName(i);\n        auto s = getSize(engine.getTensorDataType(tensor_name));\n        size = s * batchSize * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n        context.setTensorAddress(tensor_name, buffers[i]);\n#else\n        const int32_t idx = engine.getBindingIndex(NAMES[i]);\n        auto s = getSize(engine.getBindingDataType(idx));\n        assert(idx == i);\n        size = s * batchSize * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n#endif\n    }\n\n#if TRT_VERSION >= 8000\n    assert(context.enqueueV3(stream));\n#else\n    assert(context.enqueueV2(buffers.data(), stream, nullptr));\n#endif\n\n    std::vector<std::vector<float>> prob;\n    for (int i = 1; i < nIO; ++i) {\n        std::vector<float> tmp(batchSize * SIZES[i], std::nanf(\"\"));\n        std::size_t size = batchSize * SIZES[i] * sizeof(float);\n        CHECK(cudaMemcpyAsync(tmp.data(), buffers[i], size, cudaMemcpyDeviceToHost, stream));\n        prob.emplace_back(tmp);\n    }\n    CHECK(cudaStreamSynchronize(stream));\n\n    for (auto& buffer : buffers) {\n        CHECK(cudaFree(buffer));\n    }\n    CHECK(cudaStreamDestroy(stream));\n    return prob;\n}\n\nint main(int argc, char** argv) {\n#if _WIN32\n    SetConsoleOutputCP(CP_UTF8);\n#endif\n    cudaSetDevice(DEVICE);\n    checkTrtEnv(DEVICE);\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\\n\";\n        std::cerr << \"./LPRnet -s  // serialize model to plan file\\n\";\n        std::cerr << \"./LPRnet -d  // deserialize plan file and run inference\\n\";\n        return -1;\n    }\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n\n    char* trtModelStream{nullptr};\n    std::streamsize size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(BATCH_SIZE, runtime, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(ENGINE_PATH, std::ios::binary | std::ios::trunc);\n        if (!p) {\n            std::cerr << \"could not open plan output file\\n\";\n            return -1;\n        }\n        if (modelStream->size() > static_cast<std::size_t>(std::numeric_limits<std::streamsize>::max())) {\n            std::cerr << \"this model is too large to serialize\\n\";\n            return -1;\n        }\n        const auto* data_ptr = reinterpret_cast<const char*>(modelStream->data());\n        auto data_size = static_cast<std::streamsize>(modelStream->size());\n        p.write(data_ptr, data_size);\n#if TRT_VERSION >= 8000\n        delete modelStream;\n#else\n        modelStream->destroy();\n#endif\n        return 0;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(ENGINE_PATH, std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return 1;\n    }\n\n    void* input = nullptr;\n    std::vector<float> data;\n    cv::Mat img = cv::imread(\"../assets/car_plate.jpg\");\n    if constexpr (TRT_PREPROCESS) {\n        // for simplicity, resize image on cpu side\n        cv::resize(img, img, cv::Size(INPUT_W, INPUT_H), 0, 0, cv::INTER_CUBIC);\n        input = static_cast<void*>(img.data);\n    } else {\n        data = preprocess_img(img, false, mean, stdv, BATCH_SIZE, INPUT_H, INPUT_W);\n        input = data.data();\n    }\n\n#if TRT_VERSION >= 8000\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n#else\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n#endif\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n\n    for (int32_t i = 0; i < 100; ++i) {\n        auto _start = std::chrono::system_clock::now();\n        auto prob = doInference(*context, input, 1);\n        auto _end = std::chrono::system_clock::now();\n        auto _time = std::chrono::duration_cast<std::chrono::microseconds>(_end - _start).count();\n        std::cout << \"Execution time: \" << _time << \"us\\n\";\n\n        for (const auto& vector : prob) {\n            int idx = 0;\n            for (auto v : vector) {\n                std::cout << std::setprecision(4) << v << \", \" << std::flush;\n                if (++idx > 20) {\n                    std::cout << \"\\n====\\n\";\n                    break;\n                }\n            }\n        }\n\n        if (i == 99) {\n            int prev = 67;\n            std::string str;\n            for (int t = 0; t < 18; ++t) {\n                std::array<float, 68> scores{};\n                for (int c = 0; c < 68; ++c) {\n                    scores[c] = prob[0][t + 18 * c];\n                }\n                int best =\n                        static_cast<int>(std::distance(scores.begin(), std::max_element(scores.begin(), scores.end())));\n                if (best != prev && best != 67)\n                    str += alphabet[best];\n                prev = best;\n            }\n            std::cout << \"result: \" << str << \"\\n\";\n        }\n    }\n\n    delete[] trtModelStream;\n#if TRT_VERSION >= 8000\n    delete context;\n    delete engine;\n    delete runtime;\n#else\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n#endif\n\n    return 0;\n}\n"
  },
  {
    "path": "lprnet/macros.h",
    "content": "#pragma once\n#include <NvInfer.h>\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#define TRT_VERSION \\\n    ((NV_TENSORRT_MAJOR * 1000) + (NV_TENSORRT_MINOR * 100) + (NV_TENSORRT_PATCH * 10) + NV_TENSORRT_BUILD)\n\n#if TRT_VERSION < 7220\n#error \"TensorRT >= 7.2.2 is required for this demo.\"\n#endif\n\n#if TRT_VERSION >= 8000\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n"
  },
  {
    "path": "lprnet/utils.h",
    "content": "#pragma once\n#include <cuda_runtime_api.h>\n#include <algorithm>\n#include <cassert>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <numeric>\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\nusing namespace nvinfer1;\n\n#define CHECK(status)                                     \\\n    do {                                                  \\\n        auto ret = (status);                              \\\n        if (ret != cudaSuccess) {                         \\\n            std::cerr << \"Cuda failure: \" << ret << \"\\n\"; \\\n            std::abort();                                 \\\n        }                                                 \\\n    } while (0)\n\nstatic inline void checkTrtEnv(int device = 0) {\n#if TRT_VERSION < 8000\n    CHECK(cudaGetDevice(&device));\n    cudaDeviceProp prop{};\n    CHECK(cudaGetDeviceProperties(&prop, device));\n    const int sm = prop.major * 10 + prop.minor;\n    if (sm > 86) {\n        std::cerr << \"TensorRT < 8 does not support SM > 86 on this GPU.\";\n        std::abort();\n    }\n#endif\n}\n\n/**\n * @brief TensorRT weight files have a simple space delimited format:\n * [type] [size] <data x size in hex>\n * \n * @param file input weight file path\n * @return std::map<std::string, nvinfer1::Weights> \n */\nstatic inline auto loadWeights(const std::string& file) {\n    std::cout << \"Loading weights: \" << file << \"\\n\";\n    std::map<std::string, nvinfer1::Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> wt.count;\n\n        // Load blob\n        auto* val = new uint32_t[wt.count];\n        input >> std::hex;\n        for (auto x = 0ll; x < wt.count; ++x) {\n            input >> val[x];\n        }\n        wt.values = val;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\n/**\n * @brief a preprocess function aligning with ImageNet preprocess in torchvision, only support 3-channel image\n * \n * @param img opencv image with BGR layout\n * @param bgr2rgb whether to convert BGR to RGB\n * @param mean subtract mean\n * @param std divide std\n * @param n batch size\n * @param h resize height\n * @param w resize width\n * @return std::vector<float> contiguous flatten image data in float32 type\n */\nstatic inline std::vector<float> preprocess_img(cv::Mat& img, bool bgr2rgb, const std::array<const float, 3>& mean,\n                                                const std::array<const float, 3>& std, int n, int h, int w) {\n    const auto c = img.channels();\n    const auto size = c * h * w;\n    if (c != 3) {\n        std::cerr << \"this demo only supports 3 channel input image.\\n\";\n        std::abort();\n    }\n    if (bgr2rgb) {\n        cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n    }\n    cv::resize(img, img, cv::Size(w, h), 0, 0, cv::INTER_LINEAR);\n    img.convertTo(img, CV_32FC3, 1.f / 255);\n    img = (img - cv::Scalar(mean[0], mean[1], mean[2])) / cv::Scalar(std[0], std[1], std[2]);\n    std::vector<float> chw(static_cast<std::size_t>(n) * c * h * w, 0.f);\n\n    // fill all batch with the same input image\n    for (int i = 0; i < n; ++i) {\n        for (int y = 0; y < h; ++y) {\n            for (int x = 0; x < w; ++x) {\n                const cv::Vec3f v = img.at<cv::Vec3f>(y, x);\n                chw[i * size + 0 * h * w + y * w + x] = v[0];\n                chw[i * size + 1 * h * w + y * w + x] = v[1];\n                chw[i * size + 2 * h * w + y * w + x] = v[2];\n            }\n        }\n    }\n    return chw;\n}\n\nstatic inline std::vector<std::pair<int, float>> topk(const std::vector<float>& v, int64_t k) {\n    if (k <= 0)\n        return {};\n    auto s = std::min<std::ptrdiff_t>(k, static_cast<std::ptrdiff_t>(v.size()));\n\n    std::vector<int> idx(v.size());\n    std::iota(idx.begin(), idx.end(), 0);\n\n    std::partial_sort(idx.begin(), std::next(idx.begin(), s), idx.end(), [&](int a, int b) { return v[a] > v[b]; });\n\n    std::vector<std::pair<int, float>> out;\n    out.reserve(k);\n    for (int i = 0; i < k; ++i)\n        out.emplace_back(idx[i], v[idx[i]]);\n    return out;\n}\n\nstatic inline std::map<int, std::string> loadImagenetLabelMap(const std::string& path) {\n    std::map<int, std::string> labels;\n    std::ifstream in(path);\n    if (!in.is_open()) {\n        return labels;\n    }\n    std::string line;\n    while (std::getline(in, line)) {\n        auto colon = line.find(':');\n        if (colon == std::string::npos) {\n            continue;\n        }\n        auto first_quote = line.find('\\'', colon);\n        if (first_quote == std::string::npos) {\n            continue;\n        }\n        auto second_quote = line.find('\\'', first_quote + 1);\n        if (second_quote == std::string::npos) {\n            continue;\n        }\n        int idx = std::stoi(line.substr(0, colon));\n        labels[idx] = line.substr(first_quote + 1, second_quote - first_quote - 1);\n    }\n    return labels;\n}\n\nstatic inline ILayer* addTransformLayer(INetworkDefinition* network, ITensor& input, bool bgr2rgb,\n                                        const std::array<const float, 3>& mean, const std::array<const float, 3>& std) {\n    struct ScaleParams {\n        std::array<float, 3> shift;\n        std::array<float, 3> scale;\n    };\n    static std::vector<std::unique_ptr<ScaleParams>> gScaleParams;\n    auto params = std::make_unique<ScaleParams>();\n    params->shift = {-mean[0] / std[0], -mean[1] / std[1], -mean[2] / std[2]};\n    params->scale = {1.f / (std[0] * 255.f), 1.f / (std[1] * 255.f), 1.f / (std[2] * 255.f)};\n\n    static const Weights empty{DataType::kFLOAT, nullptr, 0ll};\n    const Weights shift{DataType::kFLOAT, params->shift.data(), 3ll};\n    const Weights scale{DataType::kFLOAT, params->scale.data(), 3ll};\n\n    gScaleParams.emplace_back(std::move(params));\n\n    ITensor* in = &input;\n    if (input.getType() != DataType::kFLOAT) {\n#if TRT_VERSION >= 8000\n        auto* cast = network->addCast(input, DataType::kFLOAT);\n        assert(cast);\n        cast->setName(\"Cast to FP32\");\n        in = cast->getOutput(0);\n#else\n        auto* identity = network->addIdentity(input);\n        assert(identity);\n        identity->setName(\"Convert to FP32\");\n        identity->setOutputType(0, DataType::kFLOAT);\n        in = identity->getOutput(0);\n#endif\n    }\n    // Convert from NHWC to NCHW\n    auto* perm = network->addShuffle(*in);\n    assert(perm);\n    perm->setName(\"NHWC -> NCHW\");\n    perm->setFirstTranspose(Permutation{0, 3, 1, 2});\n\n    // Convert from BGR to RGB (optional)\n    ITensor* data{nullptr};\n    if (bgr2rgb) {\n        auto add_slice = [&](int c, const char* name) -> ITensor* {\n            auto dims = perm->getOutput(0)->getDimensions();\n            Dims4 start = {0, c, 0, 0}, stride = {1, 1, 1, 1};\n            Dims4 size = {dims.d[0], 1, dims.d[2], dims.d[3]};\n            auto* _slice = network->addSlice(*perm->getOutput(0), start, size, stride);\n            _slice->setName(name);\n            assert(_slice && _slice->getNbOutputs() == 1);\n            return _slice->getOutput(0);\n        };\n        std::array<ITensor*, 3> channels = {add_slice(2, \"R\"), add_slice(1, \"G\"), add_slice(0, \"B\")};\n        auto* cat = network->addConcatenation(channels.data(), 3);\n        assert(cat);\n        cat->setName(\"RGB\");\n        cat->setAxis(1);\n        data = cat->getOutput(0);\n    } else {\n        data = perm->getOutput(0);\n    }\n\n    // Normalize\n    auto* trans = network->addScale(*data, ScaleMode::kCHANNEL, shift, scale, empty);\n    assert(trans);\n    trans->setName(\"mean & std\");\n#if TRT_VERSION >= 8000\n    trans->setChannelAxis(1);\n#endif\n    return trans;\n}\n\nstatic inline size_t getSize(DataType dt) {\n    switch (dt) {\n#if TRT_VERSION >= 8510\n        case DataType::kUINT8:\n#endif\n        case DataType::kINT8:\n            return sizeof(int8_t);\n        case DataType::kFLOAT:\n            return sizeof(float);\n        case DataType::kHALF:\n            return sizeof(int16_t);\n        case DataType::kINT32:\n            return sizeof(int32_t);\n        default: {\n            std::cerr << \"Unsupported data type\\n\";\n            std::abort();\n        }\n    }\n}\n"
  },
  {
    "path": "mlp/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.17.0)\n\nproject(\n  mlp\n  VERSION 0.1\n  LANGUAGES C CXX CUDA)\n\nif(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)\n  set(CMAKE_CUDA_ARCHITECTURES\n      60\n      70\n      72\n      75\n      80\n      86\n      89)\nendif()\n\nset(CMAKE_CXX_STANDARD 17)\nset(CMAKE_CXX_STANDARD_REQUIRED ON)\nset(CMAKE_CUDA_STANDARD 17)\nset(CMAKE_CUDA_STANDARD_REQUIRED ON)\nset(CMAKE_EXPORT_COMPILE_COMMANDS ON)\nset(CMAKE_INCLUDE_CURRENT_DIR TRUE)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME \"Use static cudaruntime library\" OFF)\n\nfind_package(Threads REQUIRED)\nfind_package(CUDAToolkit REQUIRED)\n\nif(NOT TARGET TensorRT::TensorRT)\n  include(FindTensorRT.cmake)\nelse()\n  message(\"TensorRT has been found, skipping for ${PROJECT_NAME}\")\nendif()\n\nadd_executable(${PROJECT_NAME} mlp.cpp)\n\ntarget_include_directories(${PROJECT_NAME} PUBLIC ${CMAKE_CURRENT_LIST_DIR})\n\ntarget_link_libraries(${PROJECT_NAME} PUBLIC Threads::Threads CUDA::cudart\n                                             TensorRT::TensorRT)\n"
  },
  {
    "path": "mlp/FindTensorRT.cmake",
    "content": "cmake_minimum_required(VERSION 3.17.0)\n\nfunction(_guess_path var_name required_files)\n  set(_result \"\")\n\n  foreach(path_entry IN LISTS ARGN)\n    if(NOT EXISTS \"${path_entry}\")\n      message(DEBUG \"skip non-existing path '${path_entry}'\")\n      continue()\n    endif()\n\n    set(_ok TRUE)\n    foreach(required_file IN LISTS required_files)\n      if(NOT EXISTS \"${path_entry}/${required_file}\")\n        set(_ok FALSE)\n        message(DEBUG \"'${path_entry}' missing '${required_file}'\")\n        break()\n      endif()\n    endforeach()\n\n    if(_ok)\n      list(APPEND _result \"${path_entry}\")\n      message(DEBUG \"accept '${path_entry}'\")\n    else()\n      message(DEBUG \"reject '${path_entry}'\")\n    endif()\n  endforeach()\n\n  if(_result STREQUAL \"\")\n    message(\n      FATAL_ERROR\n        \"_guess_path(${var_name}) failed: no valid path found. required_files='${required_files}' candidates='${ARGN}'\"\n    )\n  endif()\n\n  set(${var_name}\n      \"${_result}\"\n      PARENT_SCOPE)\nendfunction()\n\n# add library\nadd_library(TensorRT IMPORTED INTERFACE)\nadd_library(TensorRT::TensorRT ALIAS TensorRT)\n\nset(TRT_VERSION\n    CACHE\n      STRING\n      \"TensorRT version, e.g. \\\"8.6.1.6\\\" or \\\"8.6.1.6+cuda12.0.1.011\\\", \\\"8.6.1.6.Windows10.x86_64.cuda-12.0\\\" etc\"\n)\n\nif(NOT TRT_VERSION STREQUAL \"\" AND NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  message(\n    WARNING\n      \"TRT_VERSION defined by cmake and environment variable both, using the later one\"\n  )\nendif()\n\nif(NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  set(TRT_VERSION $ENV{TRT_VERSION})\nendif()\n\nstring(REGEX MATCH \"([0-9]+)\" _match ${TRT_VERSION})\nset(TRT_MAJOR_VERSION \"${_match}\")\nunset(_match)\n\nif(WIN32)\n  set(TensorRT_DIR \"C:/Program Files/TensorRT-${TRT_VERSION}\")\n  if(NOT EXISTS \"${TensorRT_DIR}\")\n    message(\n      FATAL_ERROR\n        \"TensorRT_DIR=${TensorRT_DIR} does not exist!\"\n    )\n  endif()\n\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 10)\n    set(_modules nvinfer_10 nvinfer_plugin_10 nvinfer_vc_plugin_10\n                 nvinfer_dispatch_10 nvinfer_lean_10)\n    message(DEBUG \"Using ${_modules}\")\n  else()\n    set(_modules nvinfer nvinfer_plugin nvinfer_vc_plugin nvinfer_dispatch\n                 nvinfer_lean)\n  endif()\n\n  set(TensorRT_LIBRARY_DIR \"${TensorRT_DIR}/lib\")\n  set(TensorRT_INCLUDE_DIR \"${TensorRT_DIR}/include\")\nelseif(UNIX)\n  string(TOLOWER \"${CMAKE_SYSTEM_PROCESSOR}\" _trt_arch)\n  set(_trt_include_candidates)\n  if(_trt_arch MATCHES \"^(aarch64|arm64|arch64)$\")\n    set(_trt_include_candidates \"/usr/include/aarch64-linux-gnu\" \"/usr/include\"\n                                \"/usr/local/cuda/targets/aarch64-linux/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/aarch64-linux-gnu/lib\"\n        \"/usr/lib/aarch64-linux-gnu\" \"/usr/lib/aarch64-linux-gnu/tegra\"\n        \"/usr/lib\")\n  elseif(_trt_arch MATCHES \"^(x86_64|amd64)$\")\n    set(_trt_include_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/include\"\n        \"/usr/include/x86_64-linux-gnu\" \"/usr/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/lib\"\n        \"/usr/lib/x86_64-linux-gnu\" \"/usr/lib\")\n  else()\n    message(FATAL_ERROR \"Unknown architecture\")\n  endif()\n\n  set(_modules nvinfer nvinfer_plugin)\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 8)\n    list(APPEND _modules nvinfer_vc_plugin nvinfer_dispatch nvinfer_lean)\n  endif()\n\n  _guess_path(TensorRT_LIBRARY_DIR \"libnvinfer.so;libnvinfer_plugin.so\"\n              ${_trt_library_candidates})\n  message(STATUS \"TensorRT libraries: ${TensorRT_LIBRARY_DIR}\")\n  _guess_path(TensorRT_INCLUDE_DIR \"NvInfer.h\" ${_trt_include_candidates})\n  message(STATUS \"TensorRT includes: ${TensorRT_INCLUDE_DIR}\")\nendif()\n\nforeach(lib IN LISTS _modules)\n  find_library(\n    TensorRT_${lib}_LIBRARY\n    NAMES ${lib}\n    HINTS ${TensorRT_LIBRARY_DIR})\n  list(APPEND TensorRT_LIBRARIES ${TensorRT_${lib}_LIBRARY})\nendforeach()\n\ntarget_link_libraries(TensorRT INTERFACE ${TensorRT_LIBRARIES})\n\nmessage(STATUS \"Found TensorRT libs: ${TensorRT_LIBRARIES}\")\n\nset_target_properties(\n  TensorRT\n  PROPERTIES C_STANDARD 17\n             CXX_STANDARD 17\n             POSITION_INDEPENDENT_CODE ON\n             SKIP_BUILD_RPATH TRUE\n             BUILD_WITH_INSTALL_RPATH TRUE\n             INSTALL_RPATH \"$ORIGIN\"\n             INTERFACE_INCLUDE_DIRECTORIES \"${TensorRT_INCLUDE_DIR}\")\n\nunset(TRT_MAJOR_VERSION)\nunset(_modules)\nunset(_trt_include_candidates)\nunset(_trt_library_candidates)\nunset(_trt_arch)\n"
  },
  {
    "path": "mlp/README.md",
    "content": "# mlp\n\nMLP is the most basic net in this tensorrtx project for starters. You can learn the basic procedures of building TensorRT app from the provided APIs. The process of building a TensorRT engine explained in the chart below.\n\n![TensorRT Image](https://user-images.githubusercontent.com/33795294/148565279-795b12da-5243-4e7e-881b-263eb7658683.jpg)\n\nThis demo creates a single-layer MLP with `TensorRT >= 7.2.x` version support.\n\n## Helper Files\n\n`logging.h` : A logger file for using NVIDIA TensorRT API (mostly same for all models)\n\n`mlp.wts` : Converted weight file, can be generated from [pytorchx/mlp](https://github.com/wang-xinyu/pytorchx/tree/master/mlp), for mlp, it looks like:\n\n```bash\n2\nlinear.weight 1 3fff7e32\nlinear.bias 1 3c138a5a\n```\n\n(you can create `mlp.wts` and copy this content into it directly)\n\n## TensorRT C++ API\n\nsee [HERE](../README.md#how-to-run)\n\n## TensorRT Python API\n\n1. Generate mlp.wts (from `pytorchx` or create on your own)\n\n2. Put mlp.wts into tensorrtx/mlp (if using the generated weights)\n\n3. Run\n   ```bash\n   cd tensorrtx/mlp\n   python mlp.py -s   # serialize model to plan file, i.e. 'mlp.engine'\n   python mlp.py -d   # deserialize plan file and run inference\n   ```\n"
  },
  {
    "path": "mlp/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <cstdint>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include <utility>\n#include \"NvInferRuntime.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(std::move(prefix)), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) noexcept\n        : mOutput(other.mOutput), mPrefix(std::move(other.mPrefix)), mShouldLog(other.mShouldLog) {}\n\n    ~LogStreamConsumerBuffer() override {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    int sync() override {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mBuffer(stream, std::move(prefix), shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other) noexcept\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   private:\n    struct TestInfo;\n\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult : std::uint8_t {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << '\\n';\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, TestInfo info)\n            : mStarted(started), mName(std::move(info.name)), mCmdline(std::move(info.cmdline)) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom{false, TestInfo{name, cmdline}};\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    [[nodiscard]] Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    struct TestInfo {\n        std::string name;\n        std::string cmdline;\n    };\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << '\\n';\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kVERBOSE};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINFO};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kWARNING};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kERROR};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINTERNAL_ERROR};\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "mlp/macros.h",
    "content": "#pragma once\n#include <NvInfer.h>\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#define TRT_VERSION \\\n    ((NV_TENSORRT_MAJOR * 1000) + (NV_TENSORRT_MINOR * 100) + (NV_TENSORRT_PATCH * 10) + NV_TENSORRT_BUILD)\n\n#if TRT_VERSION < 7220\n#error \"TensorRT >= 7.2.2 is required for this demo.\"\n#endif\n\n#if TRT_VERSION >= 8000\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n"
  },
  {
    "path": "mlp/mlp.cpp",
    "content": "#include <array>\n#include <chrono>\n#include <iostream>\n#include <numeric>\n#include <vector>\n#include \"logging.h\"\n#include \"utils.h\"\n\nusing namespace nvinfer1;\n\nconstexpr static const int64_t INPUT_SIZE = 1;\nconstexpr static const int64_t OUTPUT_SIZE = 1;\nconstexpr static const char* INPUT_NAME = \"data\";\nconstexpr static const char* OUTPUT_NAME = \"out\";\nconstexpr static const char* WTS_PATH = \"../models/mlp.wts\";\nconstexpr static const char* ENGINE_PATH = \"../models/mlp.engine\";\n\n// Logger from TRT API\nstatic Logger gLogger;\n\n/**\n * Create a single-layer \"MLP\" using the TRT Builder and Configurations\n *\n * @param N: max batch size for built TRT model\n * @param builder: to build engine and networks\n * @param config: configuration related to Hardware\n * @param dt: datatype for model layers\n * @return engine: TRT model\n */\nICudaEngine* createMLPEngine(int32_t N, IRuntime* runtime, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    std::cout << \"[INFO]: Creating MLP using TensorRT...\\n\";\n\n    // Load Weights from relevant file\n    std::map<std::string, Weights> weightMap = loadWeights(WTS_PATH);\n\n    // Create an empty network\n#if TRT_VERSION >= 10000\n    auto* network = builder->createNetworkV2(0);\n#else\n    auto* network = builder->createNetworkV2(1u << static_cast<int>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n#endif\n\n    // Create an input with proper name\n    ITensor* data = network->addInput(INPUT_NAME, dt, Dims4{N, 1, 1, 1});\n    assert(data);\n\n    // all tensors\n    auto* fc1w = network->addConstant(Dims4{1, 1, 1, 1}, weightMap[\"linear.weight\"])->getOutput(0);\n    auto* fc1b = network->addConstant(Dims4{1, 1, 1, 1}, weightMap[\"linear.bias\"])->getOutput(0);\n    assert(fc1w && fc1b);\n    // fc layer\n    auto* fc1_0 = network->addMatrixMultiply(*data, MatrixOperation::kNONE, *fc1w, MatrixOperation::kTRANSPOSE);\n    auto* fc1_1 = network->addElementWise(*fc1_0->getOutput(0), *fc1b, ElementWiseOperation::kSUM);\n    assert(fc1_0 && fc1_1);\n    fc1_0->setName(\"fc1_0\");\n\n    // set output with name\n    auto* output = fc1_1->getOutput(0);\n    output->setName(OUTPUT_NAME);\n\n    // mark the output\n    network->markOutput(*output);\n\n#if TRT_VERSION >= 8000\n    IHostMemory* serialized_mem = builder->buildSerializedNetwork(*network, *config);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(serialized_mem->data(), serialized_mem->size());\n    delete network;\n#else\n    builder->setMaxBatchSize(N);\n    config->setMaxWorkspaceSize(WORKSPACE_SIZE);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    network->destroy();\n#endif\n    assert(engine != nullptr);\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(int32_t maxBatchSize, IRuntime* runtime, IHostMemory** modelStream) {\n    /**\n     * Create engine using TensorRT APIs\n     *\n     * @param maxBatchSize: for the deployed model configs\n     * @param modelStream: shared memory to store serialized model\n     */\n\n    // Create builder with the logger\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Build an engine\n    ICudaEngine* engine = createMLPEngine(maxBatchSize, runtime, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // serialize the engine into binary stream\n    (*modelStream) = engine->serialize();\n\n#if TRT_VERSION >= 8000\n    delete engine;\n    delete config;\n    delete builder;\n#else\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n#endif\n}\n\nvoid doInference(IExecutionContext& ctx, void* input, float* output, int64_t batchSize = 1) {\n    /**\n     * Perform inference using the CUDA ctx\n     *\n     * @param ctx: context created by engine\n     * @param input: input from the host\n     * @param output: output to save on host\n     * @param batchSize: batch size for TRT model\n     */\n    // Get engine from the ctx\n    const ICudaEngine& engine = ctx.getEngine();\n\n#if TRT_VERSION >= 8000\n    int32_t nIO = engine.getNbIOTensors();\n    const int inputIndex = 0;\n    const int outputIndex = engine.getNbIOTensors() - 1;\n#else\n    int32_t nIO = engine.getNbBindings();\n    const int inputIndex = engine.getBindingIndex(INPUT_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_NAME);\n#endif\n    assert(nIO == 2);  // mlp contains 1 input and 1 output\n\n    // create cuda stream for aync cuda operations\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // create GPU buffers on cuda device and copy input data from host\n    std::vector<void*> buffers(nIO, nullptr);\n    size_t inputSize = 0;\n    size_t outputSize = batchSize * OUTPUT_SIZE * sizeof(float);\n#if TRT_VERSION >= 8000\n    auto* input_name = engine.getIOTensorName(inputIndex);\n    inputSize = batchSize * INPUT_SIZE * getSize(engine.getTensorDataType(input_name));\n#else\n    inputSize = batchSize * INPUT_SIZE * getSize(engine.getBindingDataType(inputIndex));\n#endif\n    CHECK(cudaMalloc(&buffers[inputIndex], inputSize));\n    CHECK(cudaMalloc(&buffers[outputIndex], outputSize));\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, inputSize, cudaMemcpyHostToDevice, stream));\n\n    // execute inference using ctx provided by engine\n#if TRT_VERSION >= 8000\n    for (int32_t i = 0; i < engine.getNbIOTensors(); i++) {\n        auto const name = engine.getIOTensorName(i);\n        auto dims = ctx.getTensorShape(name);\n        auto total = std::accumulate(dims.d, dims.d + dims.nbDims, 1ll, std::multiplies<>());\n        std::cout << name << \"\\t\" << total << \"\\n\";\n        ctx.setTensorAddress(name, buffers[i]);\n    }\n    assert(ctx.enqueueV3(stream));\n#else\n    assert(ctx.enqueueV2(buffers.data(), stream, nullptr));\n#endif\n\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], outputSize, cudaMemcpyDeviceToHost, stream));\n    CHECK(cudaStreamSynchronize(stream));\n    for (auto& buffer : buffers) {\n        CHECK(cudaFree(buffer));\n    }\n    CHECK(cudaStreamDestroy(stream));\n}\n\nint main(int argc, char** argv) {\n    checkTrtEnv();\n    if (argc != 2) {\n        std::cerr << \"[ERROR]: Arguments not right!\\n\";\n        std::cerr << \"./mlp -s   // serialize model to plan file\\n\";\n        std::cerr << \"./mlp -d   // deserialize plan file and run inference\\n\";\n        return 1;\n    }\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    char* trtModelStream{nullptr};\n    std::streamsize size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, runtime, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(ENGINE_PATH, std::ios::binary | std::ios::trunc);\n        if (!p.good()) {\n            std::cerr << \"could not open plan output file\\n\";\n            return 1;\n        }\n        if (modelStream->size() > static_cast<std::size_t>(std::numeric_limits<std::streamsize>::max())) {\n            std::cerr << \"this model is too large to serialize\\n\";\n            return -1;\n        }\n        const auto* data_ptr = reinterpret_cast<const char*>(modelStream->data());\n        auto data_size = static_cast<std::streamsize>(modelStream->size());\n        p.write(data_ptr, data_size);\n\n#if TRT_VERSION >= 8000\n        delete modelStream;\n#else\n        modelStream->destroy();\n#endif\n        std::cout << \"[INFO]: Successfully created TensorRT engine.\\n\";\n        return 0;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(ENGINE_PATH, std::ios::binary);\n\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    }\n\n#if TRT_VERSION >= 8000\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n#else\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n#endif\n    assert(engine != nullptr);\n    delete[] trtModelStream;\n\n    IExecutionContext* ctx = engine->createExecutionContext();\n    assert(ctx != nullptr);\n\n    std::array<float, 1> output = {-1.f};\n    std::array<float, 1> input = {12.0f};\n\n    for (int i = 0; i < 100; i++) {\n        auto start = std::chrono::high_resolution_clock::now();\n        doInference(*ctx, input.data(), output.data());\n        auto end = std::chrono::high_resolution_clock::now();\n        auto time = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();\n        std::cout << \"Execution time: \" << time << \"us\\n\"\n                  << \"output: \" << output[0] << \"\\n\";\n    }\n\n#if TRT_VERSION >= 8000\n    delete ctx;\n    delete engine;\n    delete runtime;\n#else\n    ctx->destroy();\n    engine->destroy();\n    runtime->destroy();\n#endif\n\n    return 0;\n}\n"
  },
  {
    "path": "mlp/mlp.py",
    "content": "import argparse\nimport os\nimport numpy as np\nimport struct\n\n# required for the model creation\nimport tensorrt as trt\n\n# required for the inference using TRT engine\nimport pycuda.driver as cuda\n\n# Sizes of input and output for TensorRT model\nINPUT_SIZE = 1\nOUTPUT_SIZE = 1\n\n# path of .wts (weight file) and .engine (model file)\nWEIGHT_PATH = \"./mlp.wts\"\nENGINE_PATH = \"./mlp.engine\"\n\n# input and output names are must for the TRT model\nINPUT_BLOB_NAME = 'data'\nOUTPUT_BLOB_NAME = 'out'\n\n# A logger provided by NVIDIA-TRT\ngLogger = trt.Logger(trt.Logger.INFO)\n\n\n################################\n# DEPLOYMENT RELATED ###########\n################################\ndef load_weights(file_path):\n    \"\"\"\n    Parse the .wts file and store weights in dict format\n    :param file_path:\n    :return weight_map: dictionary containing weights and their values\n    \"\"\"\n    print(f\"[INFO]: Loading weights: {file_path}\")\n    assert os.path.exists(file_path), '[ERROR]: Unable to load weight file.'\n\n    weight_map = {}\n    with open(file_path, \"r\") as f:\n        lines = [line.strip() for line in f]\n\n    # count for total # of weights\n    count = int(lines[0])\n    assert count == len(lines) - 1\n\n    # Loop through counts and get the exact num of values against weights\n    for i in range(1, count + 1):\n        splits = lines[i].split(\" \")\n        name = splits[0]\n        cur_count = int(splits[1])\n\n        # len of splits must be greater than current weight counts\n        assert cur_count + 2 == len(splits)\n\n        # loop through all weights and unpack from the hexadecimal values\n        values = []\n        for j in range(2, len(splits)):\n            # hex string to bytes to float\n            values.append(struct.unpack(\">f\", bytes.fromhex(splits[j])))\n\n        # store in format of { 'weight.name': [weights_val0, weight_val1, ..] }\n        weight_map[name] = np.array(values, dtype=np.float32)\n\n    return weight_map\n\n\ndef create_mlp_engine(max_batch_size, builder, config, dt):\n    \"\"\"\n    Create Multi-Layer Perceptron using the TRT Builder and Configurations\n    :param max_batch_size: batch size for built TRT model\n    :param builder: to build engine and networks\n    :param config: configuration related to Hardware\n    :param dt: datatype for model layers\n    :return engine: TRT model\n    \"\"\"\n    print(\"[INFO]: Creating MLP using TensorRT...\")\n    # load weight maps from the file\n    weight_map = load_weights(WEIGHT_PATH)\n\n    # build an empty network using builder\n    network = builder.create_network()\n\n    # add an input to network using the *input-name\n    data = network.add_input(INPUT_BLOB_NAME, dt, (1, 1, INPUT_SIZE))\n    assert data\n\n    # add the layer with output-size (number of outputs)\n    linear = network.add_fully_connected(input=data,\n                                         num_outputs=OUTPUT_SIZE,\n                                         kernel=weight_map['linear.weight'],\n                                         bias=weight_map['linear.bias'])\n    assert linear\n\n    # set the name for output layer\n    linear.get_output(0).name = OUTPUT_BLOB_NAME\n\n    # mark this layer as final output layer\n    network.mark_output(linear.get_output(0))\n\n    # set the batch size of current builder\n    builder.max_batch_size = max_batch_size\n\n    # create the engine with model and hardware configs\n    engine = builder.build_engine(network, config)\n\n    # free captured memory\n    del network\n    del weight_map\n\n    # return engine\n    return engine\n\n\ndef api_to_model(max_batch_size):\n    \"\"\"\n    Create engine using TensorRT APIs\n    :param max_batch_size: for the deployed model configs\n    :return:\n    \"\"\"\n    # Create Builder with logger provided by TRT\n    builder = trt.Builder(gLogger)\n\n    # Create configurations from Engine Builder\n    config = builder.create_builder_config()\n\n    # Create MLP Engine\n    engine = create_mlp_engine(max_batch_size, builder, config, trt.float32)\n    assert engine\n\n    # Write the engine into binary file\n    print(\"[INFO]: Writing engine into binary...\")\n    with open(ENGINE_PATH, \"wb\") as f:\n        # write serialized model in file\n        f.write(engine.serialize())\n\n    # free the memory\n    del engine\n    del builder\n\n\n################################\n# INFERENCE RELATED ############\n################################\ndef perform_inference(input_val):\n    \"\"\"\n    Get inference using the pre-trained model\n    :param input_val: a number as an input\n    :return:\n    \"\"\"\n\n    def do_inference(inf_context, inf_host_in, inf_host_out):\n        \"\"\"\n        Perform inference using the CUDA context\n        :param inf_context: context created by engine\n        :param inf_host_in: input from the host\n        :param inf_host_out: output to save on host\n        :return:\n        \"\"\"\n\n        inference_engine = inf_context.engine\n        # Input and output bindings are required for inference\n        assert inference_engine.num_bindings == 2\n\n        # allocate memory in GPU using CUDA bindings\n        device_in = cuda.mem_alloc(inf_host_in.nbytes)\n        device_out = cuda.mem_alloc(inf_host_out.nbytes)\n\n        # create bindings for input and output\n        bindings = [int(device_in), int(device_out)]\n\n        # create CUDA stream for simultaneous CUDA operations\n        stream = cuda.Stream()\n\n        # copy input from host (CPU) to device (GPU)  in stream\n        cuda.memcpy_htod_async(device_in, inf_host_in, stream)\n\n        # execute inference using context provided by engine\n        inf_context.execute_async(bindings=bindings, stream_handle=stream.handle)\n\n        # copy output back from device (GPU) to host (CPU)\n        cuda.memcpy_dtoh_async(inf_host_out, device_out, stream)\n\n        # synchronize the stream to prevent issues\n        #       (block CUDA and wait for CUDA operations to be completed)\n        stream.synchronize()\n\n    # create a runtime (required for deserialization of model) with NVIDIA's logger\n    runtime = trt.Runtime(gLogger)\n    assert runtime\n\n    # read and deserialize engine for inference\n    with open(ENGINE_PATH, \"rb\") as f:\n        engine = runtime.deserialize_cuda_engine(f.read())\n    assert engine\n\n    # create execution context -- required for inference executions\n    context = engine.create_execution_context()\n    assert context\n\n    # create input as array\n    data = np.array([input_val], dtype=np.float32)\n\n    # capture free memory for input in GPU\n    host_in = cuda.pagelocked_empty((INPUT_SIZE), dtype=np.float32)\n\n    # copy input-array from CPU to Flatten array in GPU\n    np.copyto(host_in, data.ravel())\n\n    # capture free memory for output in GPU\n    host_out = cuda.pagelocked_empty(OUTPUT_SIZE, dtype=np.float32)\n\n    # do inference using required parameters\n    do_inference(context, host_in, host_out)\n\n    print(f'\\n[INFO]: Predictions using pre-trained model..\\n\\tInput:\\t{input_val}\\n\\tOutput:\\t{host_out[0]:.4f}')\n\n\ndef get_args():\n    \"\"\"\n    Parse command line arguments\n    :return arguments: parsed arguments\n    \"\"\"\n    arg_parser = argparse.ArgumentParser()\n    arg_parser.add_argument('-s', action='store_true')\n    arg_parser.add_argument('-d', action='store_true')\n    arguments = vars(arg_parser.parse_args())\n    # check for the arguments\n    if not (arguments['s'] ^ arguments['d']):\n        print(\"[ERROR]: Arguments not right!\\n\")\n        print(\"\\tpython mlp.py -s   # serialize model to engine file\")\n        print(\"\\tpython mlp.py -d   # deserialize engine file and run inference\")\n        exit()\n\n    return arguments\n\n\nif __name__ == \"__main__\":\n    args = get_args()\n    if args['s']:\n        api_to_model(max_batch_size=1)\n        print(\"[INFO]: Successfully created TensorRT engine...\")\n        print(\"\\n\\tRun inference using `python mlp.py -d`\\n\")\n    else:\n        perform_inference(input_val=4.0)\n"
  },
  {
    "path": "mlp/utils.h",
    "content": "#pragma once\n#include <cuda_runtime_api.h>\n#include <cassert>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <stdexcept>\n#include <string>\n#include \"macros.h\"\n\nusing namespace nvinfer1;\n\nconstexpr const std::size_t WORKSPACE_SIZE = 16 << 20;\n\n#define CHECK(status)                                     \\\n    do {                                                  \\\n        auto ret = (status);                              \\\n        if (ret != cudaSuccess) {                         \\\n            std::cerr << \"Cuda failure: \" << ret << \"\\n\"; \\\n            std::abort();                                 \\\n        }                                                 \\\n    } while (0)\n\nstatic void checkTrtEnv(int device = 0) {\n#if TRT_VERSION < 8000\n    CHECK(cudaGetDevice(&device));\n    cudaDeviceProp prop{};\n    CHECK(cudaGetDeviceProperties(&prop, device));\n    const int sm = prop.major * 10 + prop.minor;\n    if (sm > 86) {\n        std::cerr << \"TensorRT < 8 does not support SM > 86 on this GPU.\";\n        std::abort();\n    }\n#endif\n}\n\n/**\n * @brief TensorRT weight files have a simple space delimited format:\n * [type] [size] <data x size in hex>\n * \n * @param file input weight file path\n * @return std::map<std::string, nvinfer1::Weights> \n */\nstatic auto loadWeights(const std::string& file) {\n    std::cout << \"Loading weights: \" << file << \"\\n\";\n    std::map<std::string, nvinfer1::Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> wt.count;\n\n        // Load blob\n        auto* val = new uint32_t[wt.count];\n        input >> std::hex;\n        for (auto x = 0ll; x < wt.count; ++x) {\n            input >> val[x];\n        }\n        wt.values = val;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nstatic size_t getSize(DataType dt) {\n    switch (dt) {\n#if TRT_VERSION >= 8510\n        case DataType::kUINT8:\n#endif\n        case DataType::kINT8:\n            return sizeof(int8_t);\n        case DataType::kFLOAT:\n            return sizeof(float);\n        case DataType::kHALF:\n            return sizeof(int16_t);\n        case DataType::kINT32:\n            return sizeof(int32_t);\n        default: {\n            std::cerr << \"Unsupported data type\\n\";\n            std::abort();\n        }\n    }\n}\n"
  },
  {
    "path": "mnasnet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.14)\n\nproject(\n  mnasnet\n  VERSION 0.1\n  LANGUAGES C CXX CUDA)\n\nif(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)\n  set(CMAKE_CUDA_ARCHITECTURES\n      60\n      70\n      72\n      75\n      80\n      86\n      89)\nendif()\n\nset(CMAKE_CXX_STANDARD 17)\nset(CMAKE_CXX_STANDARD_REQUIRED ON)\nset(CMAKE_CUDA_STANDARD 17)\nset(CMAKE_CUDA_STANDARD_REQUIRED ON)\nset(CMAKE_EXPORT_COMPILE_COMMANDS ON)\nset(CMAKE_INCLUDE_CURRENT_DIR TRUE)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME \"Use static cudaruntime library\" OFF)\n\nfind_package(Threads REQUIRED)\nfind_package(CUDAToolkit REQUIRED)\nfind_package(OpenCV REQUIRED)\n\nif(NOT TARGET TensorRT::TensorRT)\n  include(FindTensorRT.cmake)\nelse()\n  message(\"TensorRT has been found, skipping for ${PROJECT_NAME}\")\nendif()\n\nadd_executable(${PROJECT_NAME} mnasnet.cpp)\n\ntarget_include_directories(${PROJECT_NAME} PUBLIC ${CMAKE_CURRENT_LIST_DIR}\n                                                  ${OpenCV_INCLUDE_DIRS})\n\ntarget_link_libraries(${PROJECT_NAME} PUBLIC Threads::Threads CUDA::cudart\n                                             TensorRT::TensorRT ${OpenCV_LIBS})\n"
  },
  {
    "path": "mnasnet/FindTensorRT.cmake",
    "content": "cmake_minimum_required(VERSION 3.17.0)\n\nfunction(_guess_path var_name required_files)\n  set(_result \"\")\n\n  foreach(path_entry IN LISTS ARGN)\n    if(NOT EXISTS \"${path_entry}\")\n      message(DEBUG \"skip non-existing path '${path_entry}'\")\n      continue()\n    endif()\n\n    set(_ok TRUE)\n    foreach(required_file IN LISTS required_files)\n      if(NOT EXISTS \"${path_entry}/${required_file}\")\n        set(_ok FALSE)\n        message(DEBUG \"'${path_entry}' missing '${required_file}'\")\n        break()\n      endif()\n    endforeach()\n\n    if(_ok)\n      list(APPEND _result \"${path_entry}\")\n      message(DEBUG \"accept '${path_entry}'\")\n    else()\n      message(DEBUG \"reject '${path_entry}'\")\n    endif()\n  endforeach()\n\n  if(_result STREQUAL \"\")\n    message(\n      FATAL_ERROR\n        \"_guess_path(${var_name}) failed: no valid path found. required_files='${required_files}' candidates='${ARGN}'\"\n    )\n  endif()\n\n  set(${var_name}\n      \"${_result}\"\n      PARENT_SCOPE)\nendfunction()\n\n# add library\nadd_library(TensorRT IMPORTED INTERFACE)\nadd_library(TensorRT::TensorRT ALIAS TensorRT)\n\nset(TRT_VERSION\n    CACHE\n      STRING\n      \"TensorRT version, e.g. \\\"8.6.1.6\\\" or \\\"8.6.1.6+cuda12.0.1.011\\\", \\\"8.6.1.6.Windows10.x86_64.cuda-12.0\\\" etc\"\n)\n\nif(NOT TRT_VERSION STREQUAL \"\" AND NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  message(\n    WARNING\n      \"TRT_VERSION defined by cmake and environment variable both, using the later one\"\n  )\nendif()\n\nif(NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  set(TRT_VERSION $ENV{TRT_VERSION})\nendif()\n\nstring(REGEX MATCH \"([0-9]+)\" _match ${TRT_VERSION})\nset(TRT_MAJOR_VERSION \"${_match}\")\nunset(_match)\n\nif(WIN32)\n  set(TensorRT_DIR \"C:/Program Files/TensorRT-${TRT_VERSION}\")\n  if(NOT EXISTS \"${TensorRT_DIR}\")\n    message(\n      FATAL_ERROR\n        \"TensorRT_DIR=${TensorRT_DIR} does not exist!\"\n    )\n  endif()\n\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 10)\n    set(_modules nvinfer_10 nvinfer_plugin_10 nvinfer_vc_plugin_10\n                 nvinfer_dispatch_10 nvinfer_lean_10)\n    message(DEBUG \"Using ${_modules}\")\n  else()\n    set(_modules nvinfer nvinfer_plugin nvinfer_vc_plugin nvinfer_dispatch\n                 nvinfer_lean)\n  endif()\n\n  set(TensorRT_LIBRARY_DIR \"${TensorRT_DIR}/lib\")\n  set(TensorRT_INCLUDE_DIR \"${TensorRT_DIR}/include\")\nelseif(UNIX)\n  string(TOLOWER \"${CMAKE_SYSTEM_PROCESSOR}\" _trt_arch)\n  set(_trt_include_candidates)\n  if(_trt_arch MATCHES \"^(aarch64|arm64|arch64)$\")\n    set(_trt_include_candidates \"/usr/include/aarch64-linux-gnu\" \"/usr/include\"\n                                \"/usr/local/cuda/targets/aarch64-linux/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/aarch64-linux-gnu/lib\"\n        \"/usr/lib/aarch64-linux-gnu\" \"/usr/lib/aarch64-linux-gnu/tegra\"\n        \"/usr/lib\")\n  elseif(_trt_arch MATCHES \"^(x86_64|amd64)$\")\n    set(_trt_include_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/include\"\n        \"/usr/include/x86_64-linux-gnu\" \"/usr/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/lib\"\n        \"/usr/lib/x86_64-linux-gnu\" \"/usr/lib\")\n  else()\n    message(FATAL_ERROR \"Unknown architecture\")\n  endif()\n\n  set(_modules nvinfer nvinfer_plugin)\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 8)\n    list(APPEND _modules nvinfer_vc_plugin nvinfer_dispatch nvinfer_lean)\n  endif()\n\n  _guess_path(TensorRT_LIBRARY_DIR \"libnvinfer.so;libnvinfer_plugin.so\"\n              ${_trt_library_candidates})\n  message(STATUS \"TensorRT libraries: ${TensorRT_LIBRARY_DIR}\")\n  _guess_path(TensorRT_INCLUDE_DIR \"NvInfer.h\" ${_trt_include_candidates})\n  message(STATUS \"TensorRT includes: ${TensorRT_INCLUDE_DIR}\")\nendif()\n\nforeach(lib IN LISTS _modules)\n  find_library(\n    TensorRT_${lib}_LIBRARY\n    NAMES ${lib}\n    HINTS ${TensorRT_LIBRARY_DIR})\n  list(APPEND TensorRT_LIBRARIES ${TensorRT_${lib}_LIBRARY})\nendforeach()\n\ntarget_link_libraries(TensorRT INTERFACE ${TensorRT_LIBRARIES})\n\nmessage(STATUS \"Found TensorRT libs: ${TensorRT_LIBRARIES}\")\n\nset_target_properties(\n  TensorRT\n  PROPERTIES C_STANDARD 17\n             CXX_STANDARD 17\n             POSITION_INDEPENDENT_CODE ON\n             SKIP_BUILD_RPATH TRUE\n             BUILD_WITH_INSTALL_RPATH TRUE\n             INSTALL_RPATH \"$ORIGIN\"\n             INTERFACE_INCLUDE_DIRECTORIES \"${TensorRT_INCLUDE_DIR}\")\n\nunset(TRT_MAJOR_VERSION)\nunset(_modules)\nunset(_trt_include_candidates)\nunset(_trt_library_candidates)\nunset(_trt_arch)\n"
  },
  {
    "path": "mnasnet/README.md",
    "content": "# mnasnet\n\nMNASNet with depth multiplier of 0.5 from\n\"MnasNet: Platform-Aware Neural Architecture Search for Mobile\" <https://arxiv.org/pdf/1807.11626.pdf>\n\nFor the Pytorch implementation, you can refer to [pytorchx/mnasnet](https://github.com/wang-xinyu/pytorchx/tree/master/mnasnet)\n\nFollowing tricks are used in this mnasnet, nothing special, group conv and batchnorm are used.\n\n- Batchnorm layer, implemented by scale layer.\n\n## Usage\n\n1. use `gen_wts.py` to generate wts file\n\n```bash\npython gen_wts.py\n```\n\n2. build C++ code\n\n```bash\npushd tensorrtx/mnasnet\ncmake -S . -B build -G Ninja --fresh\ncmake --build build\n```\n\n3. serialize wts model to engine file\n\n```bash\n./build/mnasnet -s\n```\n\n4. run inference\n\n```bash\n./build/mnasnet -d\n```\n\nThe output looks like:\n\n```bash\n...\n====\nExecution time: 0ms\n-2.024, -1.266, -1.602, -1.465, -0.7756, -0.2096, 0.05945, 1.342, -0.2382, 1.279, 1.251, 0.2579, 1.836, -0.5296, 0.3196, 0.9055, -0.4915, 0.1604, -0.6305, -0.1019, -0.8816,\n====\nprediction result:\nTop: 0 idx: 285, logits: 4.869, label: Egyptian cat\nTop: 1 idx: 281, logits: 4.837, label: tabby, tabby cat\nTop: 2 idx: 282, logits: 4.019, label: tiger cat\n```\n"
  },
  {
    "path": "mnasnet/gen_wts.py",
    "content": "import struct\n\nimport cv2\nimport numpy as np\nimport torch\nfrom torchvision.models import mnasnet0_5\n\n\nMODELS = [(\"mnasnet0_5\", mnasnet0_5(pretrained=True))]\n\n\ndef read_imagenet_labels() -> dict[int, str]:\n    \"\"\"\n    read ImageNet 1000 labels\n\n    Returns:\n        dict[int, str]: labels dict\n    \"\"\"\n    clsid2label = {}\n    with open(\"../assets/imagenet1000_clsidx_to_labels.txt\", \"r\") as f:\n        for i in f.readlines():\n            k, v = i.split(\": \")\n            clsid2label.setdefault(int(k), v[1:-3])\n    return clsid2label\n\n\ndef preprocess(img: np.array) -> torch.Tensor:\n    \"\"\"\n    a preprocess method align with ImageNet dataset\n\n    Args:\n        img (np.array): input image\n\n    Returns:\n        torch.Tensor: preprocessed image in `NCHW` layout\n    \"\"\"\n    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0\n    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_LINEAR)\n    mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)\n    std = np.array([0.229, 0.224, 0.225], dtype=np.float32)\n    img = (img - mean) / std\n    img = img.transpose(2, 0, 1)[None, ...]\n    return torch.from_numpy(img)\n\n\ndef main():\n    labels = read_imagenet_labels()\n\n    img = preprocess(cv2.imread(\"../assets/cats.jpg\", cv2.IMREAD_COLOR))\n    for name, model in MODELS:\n        model.eval()\n        with torch.inference_mode():\n            output = model(img)\n        for i, batch in enumerate(torch.topk(output, k=3).indices):\n            for j, idx in enumerate(batch):\n                print(f\"\\tBatch: {i}, Top: {j}, logits: {output[i][idx]:.4f}, label: {labels[int(idx)]}\")\n        print(f\"{'=' * 32}\")\n\n        with open(f\"../models/{name}.wts\", \"w\") as f:\n            f.write(\"{}\\n\".format(len(model.state_dict().keys())))\n            for k, v in model.state_dict().items():\n                print(\"key: \", k)\n                print(\"value: \", v.shape)\n                vr = v.reshape(-1).cpu().numpy()\n                f.write(\"{} {}\".format(k, len(vr)))\n                for vv in vr:\n                    f.write(\" \")\n                    f.write(struct.pack(\">f\", float(vv)).hex())\n                f.write(\"\\n\")\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "mnasnet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <cstdint>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include <utility>\n#include \"NvInferRuntime.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(std::move(prefix)), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) noexcept\n        : mOutput(other.mOutput), mPrefix(std::move(other.mPrefix)), mShouldLog(other.mShouldLog) {}\n\n    ~LogStreamConsumerBuffer() override {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    int sync() override {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mBuffer(stream, std::move(prefix), shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other) noexcept\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   private:\n    struct TestInfo;\n\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult : std::uint8_t {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << '\\n';\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, TestInfo info)\n            : mStarted(started), mName(std::move(info.name)), mCmdline(std::move(info.cmdline)) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom{false, TestInfo{name, cmdline}};\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    [[nodiscard]] Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    struct TestInfo {\n        std::string name;\n        std::string cmdline;\n    };\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << '\\n';\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kVERBOSE};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINFO};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kWARNING};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kERROR};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINTERNAL_ERROR};\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "mnasnet/macros.h",
    "content": "#pragma once\n#include <NvInfer.h>\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#define TRT_VERSION \\\n    ((NV_TENSORRT_MAJOR * 1000) + (NV_TENSORRT_MINOR * 100) + (NV_TENSORRT_PATCH * 10) + NV_TENSORRT_BUILD)\n\n#if TRT_VERSION < 7220\n#error \"TensorRT >= 7.2.2 is required for this demo.\"\n#endif\n\n#if TRT_VERSION >= 8000\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n"
  },
  {
    "path": "mnasnet/mnasnet.cpp",
    "content": "#include <NvInfer.h>\n#include <chrono>\n#include <cmath>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"logging.h\"\n\n#include \"utils.h\"\n\n// stuff we know about mnasnet and the input/output blobs\nstatic constexpr const int INPUT_H = 224;\nstatic constexpr const int INPUT_W = 224;\nstatic constexpr const int OUTPUT_SIZE = 1000;\nstatic constexpr int N = 1;\nstatic constexpr const std::array<const char*, 2> NAMES = {\"data\", \"prob\"};\nstatic constexpr const std::array<const int, 2> SIZES = {3 * INPUT_H * INPUT_W, OUTPUT_SIZE};\nstatic const std::string WTS_PATH = \"../models/mnasnet0_5.wts\";\nstatic const std::string ENGINE_PATH = \"../models/mnasnet0_5.engine\";\nstatic constexpr const char* LABELS_PATH = \"../assets/imagenet1000_clsidx_to_labels.txt\";\nstatic constexpr const bool TRT_PREPROCESS = TRT_VERSION >= 8510 ? true : false;\nstatic constexpr const std::array<const float, 3> mean = {0.485f, 0.456f, 0.406f};\nstatic constexpr const std::array<const float, 3> stdv = {0.229f, 0.224f, 0.225f};\n\nusing namespace nvinfer1;\nusing WeightMap = std::map<std::string, Weights>;\nusing M = nvinfer1::MatrixOperation;\nusing E = nvinfer1::ElementWiseOperation;\nusing NDCF = nvinfer1::NetworkDefinitionCreationFlag;\n\nstatic Logger gLogger;\n\nstruct ConvParams {\n    int o;\n    int k;\n    int s;\n    int p;\n    int d;\n    int g;\n    float eps = 1e-5f;\n};\n\nstruct InvertedResParams {\n    int inch;\n    int o;\n    int k;\n    int s;\n    int exp;\n};\n\nILayer* addBatchNorm2d(INetworkDefinition* network, WeightMap& weightMap, ITensor& input, const std::string& lname,\n                       float eps) {\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\n    auto len = weightMap[lname + \".running_var\"].count;\n    std::cout << lname << \" running_var's len: \" << len << \"\\n\";\n\n    auto* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n\n    auto* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    auto* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* CBR(INetworkDefinition* net, WeightMap& map, const std::string& name, ITensor& input, const ConvParams& cp,\n            int start_index = 0, bool has_relu = true) {\n    Weights bias{DataType::kFLOAT, nullptr, 0};\n\n    // conv -> bn -> relu\n    auto conv_name = name + \".\" + std::to_string(start_index++) + \".weight\";\n    if (map.find(conv_name) == map.end()) {\n        std::cerr << \"KeyError: \" << name << \"is not in weight map\";\n        std::abort();\n    }\n    auto* conv = net->addConvolutionNd(input, cp.o, DimsHW{cp.k, cp.k}, map[conv_name], bias);\n    if (conv == nullptr) {\n        std::cerr << \"build conv layer failed in \" << name;\n        std::abort();\n    }\n    conv->setStrideNd(DimsHW{cp.s, cp.s});\n    conv->setPaddingNd(DimsHW{cp.p, cp.p});\n    conv->setDilationNd(DimsHW{cp.d, cp.d});\n    conv->setNbGroups(cp.g);\n    conv->setName(conv_name.c_str());\n\n    std::string bn_name = name + \".\" + std::to_string(start_index);\n    auto* bn = addBatchNorm2d(net, map, *conv->getOutput(0), bn_name, cp.eps);\n    if (has_relu) {\n        auto* relu = net->addActivation(*bn->getOutput(0), ActivationType::kRELU);\n        if (relu == nullptr) {\n            std::cerr << \"build relu layer failed in \" << name;\n            std::abort();\n        }\n        return relu;\n    } else {\n        return bn;\n    }\n}\n\nILayer* invertedRes(INetworkDefinition* network, WeightMap& w, ITensor& input, const std::string& lname,\n                    const InvertedResParams& irp) {\n    std::cout << \"Building layer: \" << lname << \"\\n\";\n    static const Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    int midch = irp.inch * irp.exp;\n    auto* conv1 = network->addConvolutionNd(input, midch, DimsHW{1, 1}, w[lname + \"layers.0.weight\"], emptywts);\n    assert(conv1);\n    auto* bn1 = addBatchNorm2d(network, w, *conv1->getOutput(0), lname + \"layers.1\", 1e-5f);\n    auto* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    auto* conv2 = network->addConvolutionNd(*relu1->getOutput(0), midch, DimsHW{irp.k, irp.k},\n                                            w[lname + \"layers.3.weight\"], emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{irp.s, irp.s});\n    conv2->setPaddingNd(DimsHW{irp.k / 2, irp.k / 2});\n    conv2->setNbGroups(midch);\n    auto* bn2 = addBatchNorm2d(network, w, *conv2->getOutput(0), lname + \"layers.4\", 1e-5f);\n    auto* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n    auto* conv3 = network->addConvolutionNd(*relu2->getOutput(0), irp.o, DimsHW{1, 1}, w[lname + \"layers.6.weight\"],\n                                            emptywts);\n    assert(conv3);\n    auto* bn3 = addBatchNorm2d(network, w, *conv3->getOutput(0), lname + \"layers.7\", 1e-5f);\n\n    if (irp.inch == irp.o && irp.s == 1) {\n        auto* ew1 = network->addElementWise(*bn3->getOutput(0), input, ElementWiseOperation::kSUM);\n        assert(ew1);\n        return ew1;\n    }\n    return bn3;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IRuntime* runtime, IBuilder* builder, IBuilderConfig* config,\n                          DataType dt) {\n    auto weightMap = loadWeights(WTS_PATH);\n\n#if TRT_VERSION >= 11200\n    auto flag = 1U << static_cast<int>(NDCF::kSTRONGLY_TYPED);\n#elif TRT_VERSION >= 10000\n    auto flag = 0U;\n#else\n    auto flag = 1U << static_cast<int>(NDCF::kEXPLICIT_BATCH);\n#endif\n    auto* network = builder->createNetworkV2(flag);\n\n    ITensor* data{nullptr};\n    if constexpr (TRT_PREPROCESS) {\n        dt = DataType::kUINT8;\n        data = network->addInput(NAMES[0], dt, Dims4{N, INPUT_H, INPUT_W, 3});\n        auto* trans = addTransformLayer(network, *data, true, mean, stdv);\n        data = trans->getOutput(0);\n    } else {\n        data = network->addInput(NAMES[0], dt, Dims4{N, 3, INPUT_H, INPUT_W});\n    }\n    assert(data);\n\n    int start_idx = 0;\n    auto* cbr_0 = CBR(network, weightMap, \"layers\", *data, {16, 3, 2, 1, 1, 1}, start_idx, true);\n    start_idx += 3;\n    auto* cbr_1 = CBR(network, weightMap, \"layers\", *cbr_0->getOutput(0), {16, 3, 1, 1, 1, 16}, start_idx, true);\n    start_idx += 3;\n    auto* cbr_2 = CBR(network, weightMap, \"layers\", *cbr_1->getOutput(0), {8, 1, 1, 1, 1, 1}, start_idx, false);\n\n    ILayer* ir1 = invertedRes(network, weightMap, *cbr_2->getOutput(0), \"layers.8.0.\", {8, 16, 3, 2, 3});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.8.1.\", {16, 16, 3, 1, 3});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.8.2.\", {16, 16, 3, 1, 3});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.9.0.\", {16, 24, 5, 2, 3});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.9.1.\", {24, 24, 5, 1, 3});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.9.2.\", {24, 24, 5, 1, 3});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.10.0.\", {24, 40, 5, 2, 6});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.10.1.\", {40, 40, 5, 1, 6});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.10.2.\", {40, 40, 5, 1, 6});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.11.0.\", {40, 48, 3, 1, 6});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.11.1.\", {48, 48, 3, 1, 6});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.12.0.\", {48, 96, 5, 2, 6});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.12.1.\", {96, 96, 5, 1, 6});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.12.2.\", {96, 96, 5, 1, 6});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.12.3.\", {96, 96, 5, 1, 6});\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"layers.13.0.\", {96, 160, 3, 1, 6});\n\n    auto* cbr_3 = CBR(network, weightMap, \"layers\", *ir1->getOutput(0), {1280, 1, 1, 0, 1, 1}, 14, true);\n\n    auto* avg = network->addReduce(*cbr_3->getOutput(0), ReduceOperation::kAVG, 0xc, false);\n    auto* _fcw = network->addConstant(DimsHW{1000, 1280}, weightMap[\"classifier.1.weight\"]);\n    auto* _fcb = network->addConstant(DimsHW{1, 1000}, weightMap[\"classifier.1.bias\"]);\n    auto* _fc1 = network->addMatrixMultiply(*avg->getOutput(0), M::kNONE, *_fcw->getOutput(0), M::kTRANSPOSE);\n    auto* fc1 = network->addElementWise(*_fc1->getOutput(0), *_fcb->getOutput(0), E::kSUM);\n    assert(fc1);\n\n    fc1->getOutput(0)->setName(NAMES[1]);\n    network->markOutput(*fc1->getOutput(0));\n\n    // Build engine\n#if TRT_VERSION >= 8000\n    config->setMemoryPoolLimit(MemoryPoolType::kWORKSPACE, WORKSPACE_SIZE);\n    auto* _serialized = builder->buildSerializedNetwork(*network, *config);\n    auto* engine = runtime->deserializeCudaEngine(_serialized->data(), _serialized->size());\n    delete _serialized;\n    delete network;\n#else\n    builder->setMaxBatchSize(N);\n    config->setMaxWorkspaceSize(WORKSPACE_SIZE);\n    auto* engine = builder->buildEngineWithConfig(*network, *config);\n    network->destroy();\n#endif\n    std::cout << \"build out\\n\";\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IRuntime* runtime, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, runtime, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n#if TRT_VERSION >= 8000\n    delete engine;\n    delete config;\n    delete builder;\n#else\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n#endif\n}\n\nstd::vector<std::vector<float>> do_inference(IExecutionContext& context, void* input, std::size_t batch_size) {\n    const ICudaEngine& engine = context.getEngine();\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n    std::vector<void*> buffers;\n\n#if TRT_VERSION >= 8000\n    const int32_t nIO = engine.getNbIOTensors();\n#else\n    const int32_t nIO = engine.getNbBindings();\n#endif\n\n    buffers.resize(nIO);\n    for (auto i = 0; i < nIO; ++i) {\n        std::size_t size = 0;\n#if TRT_VERSION >= 8000\n        auto* tensor_name = engine.getIOTensorName(i);\n        auto s = getSize(engine.getTensorDataType(tensor_name));\n        size = s * batch_size * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n        context.setTensorAddress(tensor_name, buffers[i]);\n#else\n        const int32_t idx = engine.getBindingIndex(NAMES[i]);\n        auto s = getSize(engine.getBindingDataType(idx));\n        assert(idx == i);\n        size = s * batch_size * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n#endif\n    }\n\n#if TRT_VERSION >= 8000\n    assert(context.enqueueV3(stream));\n#else\n    assert(context.enqueueV2(buffers.data(), stream, nullptr));\n#endif\n\n    std::vector<std::vector<float>> prob;\n    for (int i = 1; i < nIO; ++i) {\n        std::vector<float> tmp(batch_size * SIZES[i], std::nanf(\"\"));\n        std::size_t size = batch_size * SIZES[i] * sizeof(float);\n        CHECK(cudaMemcpyAsync(tmp.data(), buffers[i], size, cudaMemcpyDeviceToHost, stream));\n        prob.emplace_back(tmp);\n    }\n    CHECK(cudaStreamSynchronize(stream));\n\n    cudaStreamDestroy(stream);\n    for (auto i = 0; i < nIO; ++i) {\n        CHECK(cudaFree(buffers[i]));\n    }\n    return prob;\n}\n\nint main(int argc, char** argv) {\n    checkTrtEnv();\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\\n\";\n        std::cerr << \"./mnasnet -s   // serialize model to plan file\\n\";\n        std::cerr << \"./mnasnet -d   // deserialize plan file and run inference\\n\";\n        return -1;\n    }\n\n    auto* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n\n    // create a model using the API directly and serialize it to a stream\n    char* trt_model_stream{nullptr};\n    std::streamsize size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(N, runtime, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(ENGINE_PATH, std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\\n\";\n            return -1;\n        }\n        if (modelStream->size() > static_cast<std::size_t>(std::numeric_limits<std::streamsize>::max())) {\n            std::cerr << \"this model is too large to serialize\\n\";\n            return -1;\n        }\n        const auto* data_ptr = reinterpret_cast<const char*>(modelStream->data());\n        auto data_size = static_cast<std::streamsize>(modelStream->size());\n        p.write(data_ptr, data_size);\n#if TRT_VERSION >= 8000\n        delete modelStream;\n#else\n        modelStream->destroy();\n#endif\n        return 0;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(ENGINE_PATH, std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trt_model_stream = new char[size];\n            assert(trt_model_stream);\n            file.read(trt_model_stream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n#if TRT_VERSION >= 8000\n    auto* engine = runtime->deserializeCudaEngine(trt_model_stream, size);\n#else\n    auto* engine = runtime->deserializeCudaEngine(trt_model_stream, size, nullptr);\n#endif\n    assert(engine != nullptr);\n    auto* context = engine->createExecutionContext();\n    assert(context != nullptr);\n\n    void* input = nullptr;\n    std::vector<float> flat_img;\n    cv::Mat img;\n    if constexpr (TRT_PREPROCESS) {\n        // for simplicity, resize image on cpu side\n        img = cv::imread(\"../assets/cats.jpg\", cv::IMREAD_COLOR);\n        cv::resize(img, img, cv::Size(INPUT_W, INPUT_H), 0, 0, cv::INTER_LINEAR);\n        input = static_cast<void*>(img.data);\n    } else {\n        img = cv::imread(\"../assets/cats.jpg\", cv::IMREAD_COLOR);\n        flat_img = preprocess_img(img, true, mean, stdv, N, INPUT_H, INPUT_W);\n        input = flat_img.data();\n    }\n\n    for (int32_t i = 0; i < 100; ++i) {\n        auto _start = std::chrono::system_clock::now();\n        auto prob = do_inference(*context, input, 1);\n        auto _end = std::chrono::system_clock::now();\n        auto _time = std::chrono::duration_cast<std::chrono::milliseconds>(_end - _start).count();\n        std::cout << \"Execution time: \" << _time << \"ms\\n\";\n\n        for (const auto& vector : prob) {\n            int idx = 0;\n            for (auto v : vector) {\n                std::cout << std::setprecision(4) << v << \", \" << std::flush;\n                if (++idx > 20) {\n                    std::cout << \"\\n====\\n\";\n                    break;\n                }\n            }\n        }\n\n        if (i == 99) {\n            std::cout << \"prediction result:\\n\";\n            auto labels = loadImagenetLabelMap(LABELS_PATH);\n            int _top = 0;\n            for (auto& [idx, logits] : topk(prob[0], 3)) {\n                std::cout << \"Top: \" << _top++ << \" idx: \" << idx << \", logits: \" << logits\n                          << \", label: \" << labels[idx] << \"\\n\";\n            }\n        }\n    }\n\n    delete[] trt_model_stream;\n    return 0;\n}\n"
  },
  {
    "path": "mnasnet/utils.h",
    "content": "#pragma once\n#include <cuda_runtime_api.h>\n#include <algorithm>\n#include <cassert>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <numeric>\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\nusing namespace nvinfer1;\n\nconstexpr const std::size_t WORKSPACE_SIZE = 16 << 20;\n\n#define CHECK(status)                                     \\\n    do {                                                  \\\n        auto ret = (status);                              \\\n        if (ret != cudaSuccess) {                         \\\n            std::cerr << \"Cuda failure: \" << ret << \"\\n\"; \\\n            std::abort();                                 \\\n        }                                                 \\\n    } while (0)\n\nstatic void checkTrtEnv(int device = 0) {\n#if TRT_VERSION < 8000\n    CHECK(cudaGetDevice(&device));\n    cudaDeviceProp prop{};\n    CHECK(cudaGetDeviceProperties(&prop, device));\n    const int sm = prop.major * 10 + prop.minor;\n    if (sm > 86) {\n        std::cerr << \"TensorRT < 8 does not support SM > 86 on this GPU.\";\n        std::abort();\n    }\n#endif\n}\n\n/**\n * @brief TensorRT weight files have a simple space delimited format:\n * [type] [size] <data x size in hex>\n * \n * @param file input weight file path\n * @return std::map<std::string, nvinfer1::Weights> \n */\nstatic std::map<std::string, nvinfer1::Weights> loadWeights(const std::string& file) {\n    std::cout << \"Loading weights: \" << file << \"\\n\";\n    std::map<std::string, nvinfer1::Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> wt.count;\n\n        // Load blob\n        auto* val = new uint32_t[wt.count];\n        input >> std::hex;\n        for (auto x = 0ll; x < wt.count; ++x) {\n            input >> val[x];\n        }\n        wt.values = val;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\n/**\n * @brief a preprocess function aligning with ImageNet preprocess in torchvision, only support 3-channel image\n * \n * @param img opencv image with BGR layout\n * @param bgr2rgb whether to convert BGR to RGB\n * @param mean subtract mean\n * @param std divide std\n * @param n batch size\n * @param h resize height\n * @param w resize width\n * @return std::vector<float> contiguous flatten image data in float32 type\n */\nstatic std::vector<float> preprocess_img(cv::Mat& img, bool bgr2rgb, const std::array<const float, 3>& mean,\n                                         const std::array<const float, 3>& std, int n, int h, int w) {\n    const auto c = img.channels();\n    const auto size = c * h * w;\n    if (c != 3) {\n        std::cerr << \"this demo only supports 3 channel input image.\\n\";\n        std::abort();\n    }\n    if (bgr2rgb) {\n        cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n    }\n    cv::resize(img, img, cv::Size(w, h), 0, 0, cv::INTER_LINEAR);\n    img.convertTo(img, CV_32FC3, 1.f / 255);\n    img = (img - cv::Scalar(mean[0], mean[1], mean[2])) / cv::Scalar(std[0], std[1], std[2]);\n    std::vector<float> chw(static_cast<std::size_t>(n) * c * h * w, 0.f);\n\n    // fill all batch with the same input image\n    for (int i = 0; i < n; ++i) {\n        for (int y = 0; y < h; ++y) {\n            for (int x = 0; x < w; ++x) {\n                const cv::Vec3f v = img.at<cv::Vec3f>(y, x);\n                chw[i * size + 0 * h * w + y * w + x] = v[0];\n                chw[i * size + 1 * h * w + y * w + x] = v[1];\n                chw[i * size + 2 * h * w + y * w + x] = v[2];\n            }\n        }\n    }\n    return chw;\n}\n\nstatic auto topk(const std::vector<float>& v, int k) -> std::vector<std::pair<int, float>> {\n    if (k <= 0)\n        return {};\n    auto stride = std::min<std::ptrdiff_t>(k, static_cast<int64_t>(v.size()));\n\n    std::vector<int> idx(v.size());\n    std::iota(idx.begin(), idx.end(), 0);\n\n    std::partial_sort(idx.begin(), idx.begin() + k, idx.end(), [&](int a, int b) { return v[a] > v[b]; });\n\n    std::vector<std::pair<int, float>> out;\n    out.reserve(stride);\n    for (auto i = 0; i < stride; ++i)\n        out.emplace_back(idx[i], v[idx[i]]);\n    return out;\n}\n\nstatic std::map<int, std::string> loadImagenetLabelMap(const std::string& path) {\n    std::map<int, std::string> labels;\n    std::ifstream in(path);\n    if (!in.is_open()) {\n        return labels;\n    }\n    std::string line;\n    while (std::getline(in, line)) {\n        auto colon = line.find(':');\n        if (colon == std::string::npos) {\n            continue;\n        }\n        auto first_quote = line.find('\\'', colon);\n        if (first_quote == std::string::npos) {\n            continue;\n        }\n        auto second_quote = line.find('\\'', first_quote + 1);\n        if (second_quote == std::string::npos) {\n            continue;\n        }\n        int idx = std::stoi(line.substr(0, colon));\n        labels[idx] = line.substr(first_quote + 1, second_quote - first_quote - 1);\n    }\n    return labels;\n}\n\nstatic ILayer* addTransformLayer(INetworkDefinition* network, ITensor& input, bool bgr2rgb,\n                                 const std::array<const float, 3>& mean, const std::array<const float, 3>& std) {\n    struct ScaleParams {\n        std::array<float, 3> shift;\n        std::array<float, 3> scale;\n    };\n    static std::vector<std::unique_ptr<ScaleParams>> gScaleParams;\n    auto params = std::make_unique<ScaleParams>();\n    params->shift = {-mean[0] / std[0], -mean[1] / std[1], -mean[2] / std[2]};\n    params->scale = {1.f / (std[0] * 255.f), 1.f / (std[1] * 255.f), 1.f / (std[2] * 255.f)};\n\n    static const Weights empty{DataType::kFLOAT, nullptr, 0ll};\n    const Weights shift{DataType::kFLOAT, params->shift.data(), 3ll};\n    const Weights scale{DataType::kFLOAT, params->scale.data(), 3ll};\n\n    gScaleParams.emplace_back(std::move(params));\n\n    ITensor* in = &input;\n    if (input.getType() != DataType::kFLOAT) {\n#if TRT_VERSION >= 8000\n        auto* cast = network->addCast(input, DataType::kFLOAT);\n        assert(cast);\n        cast->setName(\"Cast to FP32\");\n        in = cast->getOutput(0);\n#else\n        auto* identity = network->addIdentity(input);\n        assert(identity);\n        identity->setName(\"Convert to FP32\");\n        identity->setOutputType(0, DataType::kFLOAT);\n        in = identity->getOutput(0);\n#endif\n    }\n    // Convert from NHWC to NCHW\n    auto* perm = network->addShuffle(*in);\n    assert(perm);\n    perm->setName(\"NHWC -> NCHW\");\n    perm->setFirstTranspose(Permutation{0, 3, 1, 2});\n\n    // Convert from BGR to RGB (optional)\n    ITensor* data{nullptr};\n    if (bgr2rgb) {\n        auto add_slice = [&](int c, const char* name) -> ITensor* {\n            auto dims = perm->getOutput(0)->getDimensions();\n            Dims4 start = {0, c, 0, 0}, stride = {1, 1, 1, 1};\n            Dims4 size = {dims.d[0], 1, dims.d[2], dims.d[3]};\n            auto* _slice = network->addSlice(*perm->getOutput(0), start, size, stride);\n            _slice->setName(name);\n            assert(_slice && _slice->getNbOutputs() == 1);\n            return _slice->getOutput(0);\n        };\n        std::array<ITensor*, 3> channels = {add_slice(2, \"R\"), add_slice(1, \"G\"), add_slice(0, \"B\")};\n        auto* cat = network->addConcatenation(channels.data(), 3);\n        assert(cat);\n        cat->setName(\"RGB\");\n        cat->setAxis(1);\n        data = cat->getOutput(0);\n    } else {\n        data = perm->getOutput(0);\n    }\n\n    // Normalize\n    auto* trans = network->addScale(*data, ScaleMode::kCHANNEL, shift, scale, empty);\n    assert(trans);\n    trans->setName(\"mean & std\");\n#if TRT_VERSION >= 8000\n    trans->setChannelAxis(1);\n#endif\n    return trans;\n}\n\nstatic size_t getSize(DataType dt) {\n    switch (dt) {\n#if TRT_VERSION >= 8510\n        case DataType::kUINT8:\n#endif\n        case DataType::kINT8:\n            return sizeof(int8_t);\n        case DataType::kFLOAT:\n            return sizeof(float);\n        case DataType::kHALF:\n            return sizeof(int16_t);\n        case DataType::kINT32:\n            return sizeof(int32_t);\n        default: {\n            std::cerr << \"Unsupported data type\\n\";\n            std::abort();\n        }\n    }\n}\n"
  },
  {
    "path": "mobilenet/mobilenetv2/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(mobilenet)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nadd_executable(mobilenet ${PROJECT_SOURCE_DIR}/mobilenet_v2.cpp)\ntarget_link_libraries(mobilenet nvinfer)\ntarget_link_libraries(mobilenet cudart)\n\nadd_definitions(-O2 -pthread)\n"
  },
  {
    "path": "mobilenet/mobilenetv2/README.md",
    "content": "# mobilenet v2\n\nMobileNetV2 architecture from\n     \"MobileNetV2: Inverted Residuals and Linear Bottlenecks\" <https://arxiv.org/abs/1801.04381>.\n\nFor the Pytorch implementation, you can refer to [pytorchx/mobilenet](https://github.com/wang-xinyu/pytorchx/tree/master/mobilenet)\n\nFollowing tricks are used in this mobilenet,\n\n- Relu6 is used in mobilenet v2. We use `Relu6(x) = Relu(x) - Relu(x-6)` in tensorrt.\n- Batchnorm layer, implemented by scale layer.\n\n```\n// 1. generate mobilenet.wts from [pytorchx/mobilenet](https://github.com/wang-xinyu/pytorchx/tree/master/mobilenet)\n\n// 2. put mobilenet.wts into tensorrtx/mobilenet\n\n// 3. build and run\n\ncd tensorrtx/mobilenet/mobilenetv2\n\nmkdir build\n\ncd build\n\ncmake ..\n\nmake\n\nsudo ./mobilenet -s   // serialize model to plan file i.e. 'mobilenet.engine'\n\nsudo ./mobilenet -d   // deserialize plan file and run inference\n\n// 4. see if the output is same as pytorchx/mobilenet\n```\n\n### TensorRT Python API\n\n```\n# 1. generate mobilenetv2.wts from [pytorchx/mobilenet](https://github.com/wang-xinyu/pytorchx/tree/master/mobilenet)\n\n# 2. put mobilenetv2.wts into tensorrtx/mobilenet/mobilenetv2\n\n# 3. install Python dependencies (tensorrt/pycuda/numpy)\n\ncd tensorrtx/mobilenet/mobilenetv2\n\npython mobilenet_v2.py -s   // serialize model to plan file i.e. 'mobilenetv2.engine'\npython mobilenet_v2.py -d   // deserialize plan file and run inference\n\n# 4. see if the output is same as pytorchx/mobilenet\n```\n"
  },
  {
    "path": "mobilenet/mobilenetv2/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"NvInferRuntimeCommon.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "mobilenet/mobilenetv2/mobilenet_v2.cpp",
    "content": "#include <chrono>\n#include <cmath>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n\n#define CHECK(status)                                          \\\n    do {                                                       \\\n        auto ret = (status);                                   \\\n        if (ret != 0) {                                        \\\n            std::cerr << \"Cuda failure: \" << ret << std::endl; \\\n            abort();                                           \\\n        }                                                      \\\n    } while (0)\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 1000;\n\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\n                            std::string lname, float eps) {\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n    std::cout << \"len \" << len << std::endl;\n\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nIElementWiseLayer* convBnRelu(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\n                              int outch, int ksize, int s, int g, std::string lname) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    int p = (ksize - 1) / 2;\n    IConvolutionLayer* conv1 =\n            network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[lname + \"0.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n    conv1->setNbGroups(g);\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    shval[0] = -6.0;\n    scval[0] = 1.0;\n    pval[0] = 1.0;\n    Weights shift{DataType::kFLOAT, shval, 1};\n    Weights scale{DataType::kFLOAT, scval, 1};\n    Weights power{DataType::kFLOAT, pval, 1};\n    weightMap[lname + \"cbr.scale\"] = scale;\n    weightMap[lname + \"cbr.shift\"] = shift;\n    weightMap[lname + \"cbr.power\"] = power;\n    IScaleLayer* scale1 = network->addScale(*bn1->getOutput(0), ScaleMode::kUNIFORM, shift, scale, power);\n    assert(scale1);\n\n    IActivationLayer* relu2 = network->addActivation(*scale1->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n\n    IElementWiseLayer* ew1 =\n            network->addElementWise(*relu1->getOutput(0), *relu2->getOutput(0), ElementWiseOperation::kSUB);\n    assert(ew1);\n    return ew1;\n}\n\nILayer* invertedRes(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\n                    std::string lname, int inch, int outch, int s, int exp) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    int hidden = inch * exp;\n    bool use_res_connect = (s == 1 && inch == outch);\n\n    IScaleLayer* bn1 = nullptr;\n    if (exp != 1) {\n        IElementWiseLayer* ew1 = convBnRelu(network, weightMap, input, hidden, 1, 1, 1, lname + \"conv.0.\");\n        IElementWiseLayer* ew2 =\n                convBnRelu(network, weightMap, *ew1->getOutput(0), hidden, 3, s, hidden, lname + \"conv.1.\");\n        IConvolutionLayer* conv1 = network->addConvolutionNd(*ew2->getOutput(0), outch, DimsHW{1, 1},\n                                                             weightMap[lname + \"conv.2.weight\"], emptywts);\n        assert(conv1);\n        bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"conv.3\", 1e-5);\n    } else {\n        IElementWiseLayer* ew1 = convBnRelu(network, weightMap, input, hidden, 3, s, hidden, lname + \"conv.0.\");\n        IConvolutionLayer* conv1 = network->addConvolutionNd(*ew1->getOutput(0), outch, DimsHW{1, 1},\n                                                             weightMap[lname + \"conv.1.weight\"], emptywts);\n        assert(conv1);\n        bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"conv.2\", 1e-5);\n    }\n    if (!use_res_connect)\n        return bn1;\n    IElementWiseLayer* ew3 = network->addElementWise(input, *bn1->getOutput(0), ElementWiseOperation::kSUM);\n    assert(ew3);\n    return ew3;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape { 3, INPUT_H, INPUT_W } with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../mobilenet.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    auto ew1 = convBnRelu(network, weightMap, *data, 32, 3, 2, 1, \"features.0.\");\n    ILayer* ir1 = invertedRes(network, weightMap, *ew1->getOutput(0), \"features.1.\", 32, 16, 1, 1);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.2.\", 16, 24, 2, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.3.\", 24, 24, 1, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.4.\", 24, 32, 2, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.5.\", 32, 32, 1, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.6.\", 32, 32, 1, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.7.\", 32, 64, 2, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.8.\", 64, 64, 1, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.9.\", 64, 64, 1, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.10.\", 64, 64, 1, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.11.\", 64, 96, 1, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.12.\", 96, 96, 1, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.13.\", 96, 96, 1, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.14.\", 96, 160, 2, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.15.\", 160, 160, 1, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.16.\", 160, 160, 1, 6);\n    ir1 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.17.\", 160, 320, 1, 6);\n    IElementWiseLayer* ew2 = convBnRelu(network, weightMap, *ir1->getOutput(0), 1280, 1, 1, 1, \"features.18.\");\n\n    IPoolingLayer* pool1 = network->addPoolingNd(*ew2->getOutput(0), PoolingType::kAVERAGE, DimsHW{7, 7});\n    assert(pool1);\n\n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool1->getOutput(0), 1000, weightMap[\"classifier.1.weight\"],\n                                                           weightMap[\"classifier.1.bias\"]);\n    assert(fc1);\n\n    fc1->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n    network->markOutput(*fc1->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float),\n                          cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost,\n                          stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv) {\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./mobilenet -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./mobilenet -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    char* trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(\"mobilenet.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"mobilenet.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n    // Subtract mean from image\n    static float data[3 * INPUT_H * INPUT_W];\n    for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n        data[i] = 1.0;\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    static float prob[OUTPUT_SIZE];\n    for (int i = 0; i < 100; i++) {\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < OUTPUT_SIZE; i++) {\n        std::cout << prob[i] << \", \";\n        if (i % 10 == 0)\n            std::cout << i / 10 << std::endl;\n    }\n    std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "mobilenet/mobilenetv2/mobilenet_v2.py",
    "content": "import os\nimport sys\nimport struct\nimport argparse\n\nimport numpy as np\nimport pycuda.driver as cuda\nimport pycuda.autoinit  # noqa: F401\nimport tensorrt as trt\n\nBATCH_SIZE = 1\nINPUT_H = 224\nINPUT_W = 224\nOUTPUT_SIZE = 1000\nINPUT_BLOB_NAME = \"data\"\nOUTPUT_BLOB_NAME = \"prob\"\nEPS = 1e-5\n\nWEIGHT_PATH = \"./mobilenetv2.wts\"\nENGINE_PATH = \"./mobilenetv2.engine\"\n\nTRT_LOGGER = trt.Logger(trt.Logger.INFO)\n\n\ndef load_weights(file):\n    print(f\"Loading weights: {file}\")\n\n    assert os.path.exists(file), 'Unable to load weight file.'\n\n    weight_map = {}\n    with open(file, \"r\") as f:\n        lines = [line.strip() for line in f]\n    count = int(lines[0])\n    assert count == len(lines) - 1\n    for i in range(1, count + 1):\n        splits = lines[i].split(\" \")\n        name = splits[0]\n        cur_count = int(splits[1])\n        assert cur_count + 2 == len(splits)\n        values = []\n        for j in range(2, len(splits)):\n            # hex string to bytes to float\n            values.append(struct.unpack(\">f\", bytes.fromhex(splits[j])))\n        weight_map[name] = np.array(values, dtype=np.float32)\n\n    return weight_map\n\n\ndef add_batch_norm_2d(network, weight_map, input, layer_name, eps):\n    gamma = weight_map[layer_name + \".weight\"]\n    beta = weight_map[layer_name + \".bias\"]\n    mean = weight_map[layer_name + \".running_mean\"]\n    var = weight_map[layer_name + \".running_var\"]\n    var = np.sqrt(var + eps)\n\n    scale = gamma / var\n    shift = -mean / var * gamma + beta\n    return network.add_scale(input=input,\n                             mode=trt.ScaleMode.CHANNEL,\n                             shift=shift,\n                             scale=scale)\n\n\ndef conv_bn_relu(network, weight_map, input, outch, ksize, s, g, lname):\n    p = (ksize - 1) // 2\n\n    conv1 = network.add_convolution(input=input,\n                                    num_output_maps=outch,\n                                    kernel_shape=(ksize, ksize),\n                                    kernel=weight_map[lname + \"0.weight\"],\n                                    bias=trt.Weights())\n    assert conv1\n    conv1.stride = (s, s)\n    conv1.padding = (p, p)\n    conv1.num_groups = g\n\n    bn1 = add_batch_norm_2d(network, weight_map, conv1.get_output(0), lname + \"1\", EPS)\n    assert bn1\n\n    relu1 = network.add_activation(bn1.get_output(0), type=trt.ActivationType.RELU)\n    assert relu1\n\n    shift = np.array(-6.0, dtype=np.float32)\n    scale = np.array(1.0, dtype=np.float32)\n    power = np.array(1.0, dtype=np.float32)\n    scale1 = network.add_scale(input=bn1.get_output(0),\n                               mode=trt.ScaleMode.UNIFORM,\n                               shift=shift,\n                               scale=scale,\n                               power=power)\n    assert scale1\n\n    relu2 = network.add_activation(scale1.get_output(0), type=trt.ActivationType.RELU)\n    assert relu2\n\n    ew1 = network.add_elementwise(relu1.get_output(0), relu2.get_output(0), trt.ElementWiseOperation.SUB)\n    assert ew1\n\n    return ew1\n\n\ndef inverted_res(network, weight_map, input, lname, inch, outch, s, exp):\n    hidden = inch * exp\n    use_res_connect = (s == 1 and inch == outch)\n\n    if exp != 1:\n        ew1 = conv_bn_relu(network, weight_map, input, hidden, 1, 1, 1, lname + \"conv.0.\")\n        ew2 = conv_bn_relu(network, weight_map, ew1.get_output(0), hidden, 3, s, hidden, lname + \"conv.1.\")\n        conv1 = network.add_convolution(input=ew2.get_output(0),\n                                        num_output_maps=outch,\n                                        kernel_shape=(1, 1),\n                                        kernel=weight_map[lname + \"conv.2.weight\"],\n                                        bias=trt.Weights())\n        assert conv1\n        bn1 = add_batch_norm_2d(network, weight_map, conv1.get_output(0), lname + \"conv.3\", EPS)\n    else:\n        ew1 = conv_bn_relu(network, weight_map, input, hidden, 3, s, hidden, lname + \"conv.0.\")\n        conv1 = network.add_convolution(input=ew1.get_output(0),\n                                        num_output_maps=outch,\n                                        kernel_shape=(1, 1),\n                                        kernel=weight_map[lname + \"conv.1.weight\"],\n                                        bias=trt.Weights())\n        assert conv1\n        bn1 = add_batch_norm_2d(network, weight_map, conv1.get_output(0), lname + \"conv.2\", EPS)\n\n    if not use_res_connect:\n        return bn1\n\n    ew3 = network.add_elementwise(input, bn1.get_output(0), trt.ElementWiseOperation.SUM)\n    assert ew3\n\n    return ew3\n\n\ndef create_engine(max_batch_size, builder, config, dt):\n    weight_map = load_weights(WEIGHT_PATH)\n    network = builder.create_network()\n\n    data = network.add_input(INPUT_BLOB_NAME, dt, (3, INPUT_H, INPUT_W))\n    assert data\n\n    ew1 = conv_bn_relu(network, weight_map, data, 32, 3, 2, 1, \"features.0.\")\n    ir1 = inverted_res(network, weight_map, ew1.get_output(0), \"features.1.\", 32, 16, 1, 1)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.2.\", 16, 24, 2, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.3.\", 24, 24, 1, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.4.\", 24, 32, 2, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.5.\", 32, 32, 1, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.6.\", 32, 32, 1, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.7.\", 32, 64, 2, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.8.\", 64, 64, 1, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.9.\", 64, 64, 1, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.10.\", 64, 64, 1, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.11.\", 64, 96, 1, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.12.\", 96, 96, 1, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.13.\", 96, 96, 1, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.14.\", 96, 160, 2, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.15.\", 160, 160, 1, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.16.\", 160, 160, 1, 6)\n    ir1 = inverted_res(network, weight_map, ir1.get_output(0), \"features.17.\", 160, 320, 1, 6)\n    ew2 = conv_bn_relu(network, weight_map, ir1.get_output(0), 1280, 1, 1, 1, \"features.18.\")\n\n    pool1 = network.add_pooling(input=ew2.get_output(0),\n                                type=trt.PoolingType.AVERAGE,\n                                window_size=trt.DimsHW(7, 7))\n    assert pool1\n\n    fc1 = network.add_fully_connected(input=pool1.get_output(0),\n                                      num_outputs=OUTPUT_SIZE,\n                                      kernel=weight_map[\"classifier.1.weight\"],\n                                      bias=weight_map[\"classifier.1.bias\"])\n    assert fc1\n\n    fc1.get_output(0).name = OUTPUT_BLOB_NAME\n    network.mark_output(fc1.get_output(0))\n\n    # Build Engine\n    builder.max_batch_size = max_batch_size\n    builder.max_workspace_size = 1 << 32\n    engine = builder.build_engine(network, config)\n\n    del network\n    del weight_map\n\n    return engine\n\n\ndef API_to_model(max_batch_size):\n    builder = trt.Builder(TRT_LOGGER)\n    config = builder.create_builder_config()\n    engine = create_engine(max_batch_size, builder, config, trt.float32)\n    assert engine\n    with open(ENGINE_PATH, \"wb\") as f:\n        f.write(engine.serialize())\n\n    del engine\n    del builder\n    del config\n\n\nclass HostDeviceMem(object):\n    def __init__(self, host_mem, device_mem):\n        self.host = host_mem\n        self.device = device_mem\n\n    def __str__(self):\n        return \"Host:\\n\" + str(self.host) + \"\\nDevice:\\n\" + str(self.device)\n\n    def __repr__(self):\n        return self.__str__()\n\n\ndef allocate_buffers(engine):\n    inputs = []\n    outputs = []\n    bindings = []\n    stream = cuda.Stream()\n    for binding in engine:\n        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n        dtype = trt.nptype(engine.get_binding_dtype(binding))\n        # Allocate host and device buffers\n        host_mem = cuda.pagelocked_empty(size, dtype)\n        device_mem = cuda.mem_alloc(host_mem.nbytes)\n        # Append the device buffer to device bindings.\n        bindings.append(int(device_mem))\n        # Append to the appropriate list.\n        if engine.binding_is_input(binding):\n            inputs.append(HostDeviceMem(host_mem, device_mem))\n        else:\n            outputs.append(HostDeviceMem(host_mem, device_mem))\n    return inputs, outputs, bindings, stream\n\n\ndef do_inference(context, bindings, inputs, outputs, stream, batch_size=1):\n    # Transfer input data to the GPU.\n    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]\n    # Run inference.\n    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)\n    # Transfer predictions back from the GPU.\n    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]\n    # Synchronize the stream\n    stream.synchronize()\n    # Return only the host outputs.\n    return [out.host for out in outputs]\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"-s\", action='store_true')\n    parser.add_argument(\"-d\", action='store_true')\n    args = parser.parse_args()\n\n    if not (args.s ^ args.d):\n        print(\n            \"arguments not right!\\n\"\n            \"python mobilenet_v2.py -s   # serialize model to plan file\\n\"\n            \"python mobilenet_v2.py -d   # deserialize plan file and run inference\"\n        )\n        sys.exit()\n\n    if args.s:\n        API_to_model(BATCH_SIZE)\n    else:\n        runtime = trt.Runtime(TRT_LOGGER)\n        assert runtime\n\n        with open(ENGINE_PATH, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        assert engine\n\n        context = engine.create_execution_context()\n        assert context\n\n        data = np.ones((BATCH_SIZE * 3 * INPUT_H * INPUT_W), dtype=np.float32)\n        inputs, outputs, bindings, stream = allocate_buffers(engine)\n        inputs[0].host = data\n\n        trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)\n\n        print(f'Output: \\n{trt_outputs[0][:10]}\\n{trt_outputs[0][-10:]}')\n"
  },
  {
    "path": "mobilenet/mobilenetv3/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(mobilenetv3)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nadd_executable(mobilenetv3  ${PROJECT_SOURCE_DIR}/mobilenet_v3.cpp)\ntarget_link_libraries(mobilenetv3 nvinfer)\ntarget_link_libraries(mobilenetv3 cudart)\n\nadd_definitions(-O2 -pthread)\n"
  },
  {
    "path": "mobilenet/mobilenetv3/README.md",
    "content": "# mobilenet v3\n\nMobileNetV3 architecture from\n     \"Searching for MobileNetV3\" <https://arxiv.org/abs/1905.02244?context=cs>.\n\nFor the Pytorch implementation, you can refer to [mobilenetv3.pytorch](https://github.com/chufei1995/mobilenetv3.pytorch)\n\n## Run\n\n1. generate mbv3_small.wts/mbv3_large.wts from pytorch implementation\n\n2. put mbv3_small.wts/mbv3_large.wts into tensorrtx/mobilenet/mobilenetv3\n\n3. build and run\n\n```\ncd tensorrtx/mobilenet/mobilenetv3\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./mobilenetv3 -s small(or large) // serialize model to plan file i.e. 'mobilenetv3_small.engine'\nsudo ./mobilenetv3 -d small(or large)  // deserialize plan file and run inference\n```\n\n4. see if the output is same as pytorch side\n\n### TensorRT Python API\n\n```\n# 1. generate mobilenetv3.wts from [mobilenetv3.pytorch](https://github.com/chufei1995/mobilenetv3.pytorch)\n\n# 2. put mobilenetv3.wts into tensorrtx/mobilenet/mobilenetv3\n\n# 3. install Python dependencies (tensorrt/pycuda/numpy)\n\ncd tensorrtx/mobilenet/mobilenetv3\n\npython mobilenet_v2.py -s small(or large)  // serialize model to plan file i.e. 'mobilenetv2.engine'\npython mobilenet_v2.py -d small(or large)  // deserialize plan file and run inference\n\n```\n"
  },
  {
    "path": "mobilenet/mobilenetv3/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"NvInferRuntimeCommon.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "mobilenet/mobilenetv3/mobilenet_v3.cpp",
    "content": "#include <chrono>\n#include <cmath>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n\n#define CHECK(status)                                          \\\n    do {                                                       \\\n        auto ret = (status);                                   \\\n        if (ret != 0) {                                        \\\n            std::cerr << \"Cuda failure: \" << ret << std::endl; \\\n            abort();                                           \\\n        }                                                      \\\n    } while (0)\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 1000;\nstatic const int BS = 1;\n\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\n                          std::string lname, float eps) {\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n    std::cout << \"len \" << len << std::endl;\n\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* hSwish(INetworkDefinition* network, ITensor& input, std::string name) {\n    auto hsig = network->addActivation(input, ActivationType::kHARD_SIGMOID);\n    assert(hsig);\n    hsig->setAlpha(1.0 / 6.0);\n    hsig->setBeta(0.5);\n    ILayer* hsw = network->addElementWise(input, *hsig->getOutput(0), ElementWiseOperation::kPROD);\n    assert(hsw);\n    return hsw;\n}\n\nILayer* convBnHswish(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch,\n                     int ksize, int s, int g, std::string lname) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    int p = (ksize - 1) / 2;\n    IConvolutionLayer* conv1 =\n            network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[lname + \"0.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n    conv1->setNbGroups(g);\n\n    IScaleLayer* bn1 = addBatchNorm(network, weightMap, *conv1->getOutput(0), lname + \"1\", 1e-5);\n    ILayer* hsw = hSwish(network, *bn1->getOutput(0), lname + \"2\");\n    assert(hsw);\n    return hsw;\n}\n\nILayer* seLayer(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c, int w,\n                std::string lname) {\n    int h = w;\n    IPoolingLayer* l1 = network->addPoolingNd(input, PoolingType::kAVERAGE, DimsHW(w, h));\n    assert(l1);\n    l1->setStrideNd(DimsHW{w, h});\n    IFullyConnectedLayer* l2 = network->addFullyConnected(\n            *l1->getOutput(0), BS * c / 4, weightMap[lname + \"fc.0.weight\"], weightMap[lname + \"fc.0.bias\"]);\n    IActivationLayer* relu1 = network->addActivation(*l2->getOutput(0), ActivationType::kRELU);\n    IFullyConnectedLayer* l4 = network->addFullyConnected(\n            *relu1->getOutput(0), BS * c, weightMap[lname + \"fc.2.weight\"], weightMap[lname + \"fc.2.bias\"]);\n\n    auto hsig = network->addActivation(*l4->getOutput(0), ActivationType::kHARD_SIGMOID);\n    assert(hsig);\n    hsig->setAlpha(1.0 / 6.0);\n    hsig->setBeta(0.5);\n\n    ILayer* se = network->addElementWise(input, *hsig->getOutput(0), ElementWiseOperation::kPROD);\n    assert(se);\n    return se;\n}\n\nILayer* convSeq1(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int output,\n                 int hdim, int k, int s, bool use_se, bool use_hs, int w, std::string lname) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    int p = (k - 1) / 2;\n    IConvolutionLayer* conv1 =\n            network->addConvolutionNd(input, hdim, DimsHW{k, k}, weightMap[lname + \"0.weight\"], emptywts);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n    conv1->setNbGroups(hdim);\n\n    IScaleLayer* bn1 = addBatchNorm(network, weightMap, *conv1->getOutput(0), lname + \"1\", 1e-5);\n    ITensor *tensor3, *tensor4;\n    tensor3 = nullptr;\n    tensor4 = nullptr;\n    if (use_hs) {\n        ILayer* hsw = hSwish(network, *bn1->getOutput(0), lname + \"2\");\n        tensor3 = hsw->getOutput(0);\n    } else {\n        IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n        tensor3 = relu1->getOutput(0);\n    }\n    if (use_se) {\n        ILayer* se1 = seLayer(network, weightMap, *tensor3, hdim, w, lname + \"3.\");\n        tensor4 = se1->getOutput(0);\n    } else {\n        tensor4 = tensor3;\n    }\n    IConvolutionLayer* conv2 =\n            network->addConvolutionNd(*tensor4, output, DimsHW{1, 1}, weightMap[lname + \"4.weight\"], emptywts);\n    IScaleLayer* bn2 = addBatchNorm(network, weightMap, *conv2->getOutput(0), lname + \"5\", 1e-5);\n    assert(bn2);\n    return bn2;\n}\n\nILayer* convSeq2(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int output,\n                 int hdim, int k, int s, bool use_se, bool use_hs, int w, std::string lname) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    int p = (k - 1) / 2;\n    IConvolutionLayer* conv1 =\n            network->addConvolutionNd(input, hdim, DimsHW{1, 1}, weightMap[lname + \"0.weight\"], emptywts);\n    IScaleLayer* bn1 = addBatchNorm(network, weightMap, *conv1->getOutput(0), lname + \"1\", 1e-5);\n    ITensor *tensor3, *tensor6, *tensor7;\n    tensor3 = nullptr;\n    tensor6 = nullptr;\n    tensor7 = nullptr;\n    if (use_hs) {\n        ILayer* hsw1 = hSwish(network, *bn1->getOutput(0), lname + \"2\");\n        tensor3 = hsw1->getOutput(0);\n    } else {\n        IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n        tensor3 = relu1->getOutput(0);\n    }\n    IConvolutionLayer* conv2 =\n            network->addConvolutionNd(*tensor3, hdim, DimsHW{k, k}, weightMap[lname + \"3.weight\"], emptywts);\n    conv2->setStrideNd(DimsHW{s, s});\n    conv2->setPaddingNd(DimsHW{p, p});\n    conv2->setNbGroups(hdim);\n    IScaleLayer* bn2 = addBatchNorm(network, weightMap, *conv2->getOutput(0), lname + \"4\", 1e-5);\n    if (use_se) {\n        ILayer* se1 = seLayer(network, weightMap, *bn2->getOutput(0), hdim, w, lname + \"5.\");\n        tensor6 = se1->getOutput(0);\n    } else {\n        tensor6 = bn2->getOutput(0);\n    }\n    if (use_hs) {\n        ILayer* hsw2 = hSwish(network, *tensor6, lname + \"6\");\n        tensor7 = hsw2->getOutput(0);\n    } else {\n        IActivationLayer* relu2 = network->addActivation(*tensor6, ActivationType::kRELU);\n        tensor7 = relu2->getOutput(0);\n    }\n    IConvolutionLayer* conv3 =\n            network->addConvolutionNd(*tensor7, output, DimsHW{1, 1}, weightMap[lname + \"7.weight\"], emptywts);\n    IScaleLayer* bn3 = addBatchNorm(network, weightMap, *conv3->getOutput(0), lname + \"8\", 1e-5);\n    assert(bn3);\n    return bn3;\n}\n\nILayer* invertedRes(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\n                    std::string lname, int inch, int outch, int s, int hidden, int k, bool use_se, bool use_hs, int w) {\n    bool use_res_connect = (s == 1 && inch == outch);\n    ILayer* conv = nullptr;\n    if (inch == hidden) {\n        conv = convSeq1(network, weightMap, input, outch, hidden, k, s, use_se, use_hs, w, lname + \"conv.\");\n    } else {\n        conv = convSeq2(network, weightMap, input, outch, hidden, k, s, use_se, use_hs, w, lname + \"conv.\");\n    }\n\n    if (!use_res_connect)\n        return conv;\n    IElementWiseLayer* ew3 = network->addElementWise(input, *conv->getOutput(0), ElementWiseOperation::kSUM);\n    assert(ew3);\n    return ew3;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngineSmall(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../mbv3_small.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    //auto test1 = network->addActivation(*data, ActivationType::kRELU);\n    auto ew1 = convBnHswish(network, weightMap, *data, 16, 3, 2, 1, \"features.0.\");\n    auto ir1 = invertedRes(network, weightMap, *ew1->getOutput(0), \"features.1.\", 16, 16, 2, 16, 3, 1, 0, 56);\n    auto ir2 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.2.\", 16, 24, 2, 72, 3, 0, 0, 28);\n    auto ir3 = invertedRes(network, weightMap, *ir2->getOutput(0), \"features.3.\", 24, 24, 1, 88, 3, 0, 0, 28);\n    auto ir4 = invertedRes(network, weightMap, *ir3->getOutput(0), \"features.4.\", 24, 40, 2, 96, 5, 1, 1, 14);\n    auto ir5 = invertedRes(network, weightMap, *ir4->getOutput(0), \"features.5.\", 40, 40, 1, 240, 5, 1, 1, 14);\n    auto ir6 = invertedRes(network, weightMap, *ir5->getOutput(0), \"features.6.\", 40, 40, 1, 240, 5, 1, 1, 14);\n    auto ir7 = invertedRes(network, weightMap, *ir6->getOutput(0), \"features.7.\", 40, 48, 1, 120, 5, 1, 1, 14);\n    auto ir8 = invertedRes(network, weightMap, *ir7->getOutput(0), \"features.8.\", 48, 48, 1, 144, 5, 1, 1, 14);\n    auto ir9 = invertedRes(network, weightMap, *ir8->getOutput(0), \"features.9.\", 48, 96, 2, 288, 5, 1, 1, 7);\n    auto ir10 = invertedRes(network, weightMap, *ir9->getOutput(0), \"features.10.\", 96, 96, 1, 576, 5, 1, 1, 7);\n    auto ir11 = invertedRes(network, weightMap, *ir10->getOutput(0), \"features.11.\", 96, 96, 1, 576, 5, 1, 1, 7);\n    ILayer* ew2 = convBnHswish(network, weightMap, *ir11->getOutput(0), 576, 1, 1, 1, \"conv.0.\");\n    ILayer* se1 = seLayer(network, weightMap, *ew2->getOutput(0), 576, 7, \"conv.1.\");\n\n    IPoolingLayer* pool1 = network->addPoolingNd(*se1->getOutput(0), PoolingType::kAVERAGE, DimsHW{7, 7});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{7, 7});\n    ILayer* sw1 = hSwish(network, *pool1->getOutput(0), \"hSwish.0\");\n\n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*sw1->getOutput(0), 1280, weightMap[\"classifier.0.weight\"],\n                                                           weightMap[\"classifier.0.bias\"]);\n    assert(fc1);\n    ILayer* bn1 = addBatchNorm(network, weightMap, *fc1->getOutput(0), \"classifier.1\", 1e-5);\n    ILayer* sw2 = hSwish(network, *bn1->getOutput(0), \"hSwish.1\");\n    IFullyConnectedLayer* fc2 = network->addFullyConnected(*sw2->getOutput(0), 1000, weightMap[\"classifier.3.weight\"],\n                                                           weightMap[\"classifier.3.bias\"]);\n    ILayer* bn2 = addBatchNorm(network, weightMap, *fc2->getOutput(0), \"classifier.4\", 1e-5);\n    ILayer* sw3 = hSwish(network, *bn2->getOutput(0), \"hSwish.2\");\n\n    sw3->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n    network->markOutput(*sw3->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\nICudaEngine* createEngineLarge(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../mbv3_large.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    //auto test1 = network->addActivation(*data, ActivationType::kRELU);\n    auto ew1 = convBnHswish(network, weightMap, *data, 16, 3, 2, 1, \"features.0.\");\n    auto ir1 = invertedRes(network, weightMap, *ew1->getOutput(0), \"features.1.\", 16, 16, 1, 16, 3, 0, 0, 112);\n    auto ir2 = invertedRes(network, weightMap, *ir1->getOutput(0), \"features.2.\", 16, 24, 2, 64, 3, 0, 0, 56);\n    auto ir3 = invertedRes(network, weightMap, *ir2->getOutput(0), \"features.3.\", 24, 24, 1, 72, 3, 0, 0, 56);\n    auto ir4 = invertedRes(network, weightMap, *ir3->getOutput(0), \"features.4.\", 24, 40, 2, 72, 5, 1, 0, 28);\n    auto ir5 = invertedRes(network, weightMap, *ir4->getOutput(0), \"features.5.\", 40, 40, 1, 120, 5, 1, 0, 28);\n    auto ir6 = invertedRes(network, weightMap, *ir5->getOutput(0), \"features.6.\", 40, 40, 1, 120, 5, 1, 0, 28);\n    auto ir7 = invertedRes(network, weightMap, *ir6->getOutput(0), \"features.7.\", 40, 80, 2, 240, 3, 0, 1, 14);\n    auto ir8 = invertedRes(network, weightMap, *ir7->getOutput(0), \"features.8.\", 80, 80, 1, 200, 3, 0, 1, 14);\n    auto ir9 = invertedRes(network, weightMap, *ir8->getOutput(0), \"features.9.\", 80, 80, 1, 184, 3, 0, 1, 14);\n    auto ir10 = invertedRes(network, weightMap, *ir9->getOutput(0), \"features.10.\", 80, 80, 1, 184, 3, 0, 1, 14);\n    auto ir11 = invertedRes(network, weightMap, *ir10->getOutput(0), \"features.11.\", 80, 112, 1, 480, 3, 1, 1, 14);\n    auto ir12 = invertedRes(network, weightMap, *ir11->getOutput(0), \"features.12.\", 112, 112, 1, 672, 3, 1, 1, 14);\n    auto ir13 = invertedRes(network, weightMap, *ir12->getOutput(0), \"features.13.\", 112, 160, 1, 672, 5, 1, 1, 14);\n    auto ir14 = invertedRes(network, weightMap, *ir13->getOutput(0), \"features.14.\", 160, 160, 2, 672, 5, 1, 1, 7);\n    auto ir15 = invertedRes(network, weightMap, *ir14->getOutput(0), \"features.15.\", 160, 160, 1, 960, 5, 1, 1, 7);\n    ILayer* ew2 = convBnHswish(network, weightMap, *ir15->getOutput(0), 960, 1, 1, 1, \"conv.0.\");\n\n    IPoolingLayer* pool1 = network->addPoolingNd(*ew2->getOutput(0), PoolingType::kAVERAGE, DimsHW{7, 7});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{7, 7});\n    ILayer* sw1 = hSwish(network, *pool1->getOutput(0), \"hSwish.0\");\n\n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*sw1->getOutput(0), 1280, weightMap[\"classifier.0.weight\"],\n                                                           weightMap[\"classifier.0.bias\"]);\n    assert(fc1);\n    ILayer* sw2 = hSwish(network, *fc1->getOutput(0), \"hSwish.1\");\n    IFullyConnectedLayer* fc2 = network->addFullyConnected(*sw2->getOutput(0), 1000, weightMap[\"classifier.3.weight\"],\n                                                           weightMap[\"classifier.3.bias\"]);\n\n    fc2->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n    network->markOutput(*fc2->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream, std::string mode) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine;\n\n    if (mode == \"small\") {\n        std::cout << \"create engine small\" << std::endl;\n        engine = createEngineSmall(maxBatchSize, builder, config, DataType::kFLOAT);\n    } else if (mode == \"large\") {\n        engine = createEngineLarge(maxBatchSize, builder, config, DataType::kFLOAT);\n    }\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float),\n                          cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost,\n                          stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv) {\n    if (argc != 3) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./mobilenet -s small  // serialize small model to plan file\" << std::endl;\n        std::cerr << \"./mobilenet -s large  // serialize large model to plan file\" << std::endl;\n        std::cerr << \"./mobilenet -d small  // deserialize small model plan file and run inference\" << std::endl;\n        std::cerr << \"./mobilenet -d large  // deserialize large model plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    char* trtModelStream{nullptr};\n    size_t size{0};\n    std::string mode = std::string(argv[2]);\n    std::cout << mode << std::endl;\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream, mode);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(\"mobilenetv3_\" + mode + \".engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"mobilenetv3_\" + mode + \".engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n    // Subtract mean from image\n    static float data[3 * INPUT_H * INPUT_W];\n    for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n        data[i] = 1.0;\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    static float prob[OUTPUT_SIZE];\n    for (int i = 0; i < 10; i++) {\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < OUTPUT_SIZE; i++) {\n        std::cout << prob[i] << \", \";\n        //if (i % 10 == 0) std::cout << i / 10 << std::endl;\n    }\n    std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "mobilenet/mobilenetv3/mobilenet_v3.py",
    "content": "import os\nimport sys\nimport struct\nimport argparse\n\nimport numpy as np\nimport pycuda.driver as cuda\nimport pycuda.autoinit  # noqa: F401\nimport tensorrt as trt\n\nBATCH_SIZE = 1\nINPUT_H = 224\nINPUT_W = 224\nOUTPUT_SIZE = 1000\nBS = 1\nINPUT_BLOB_NAME = \"data\"\nOUTPUT_BLOB_NAME = \"prob\"\nEPS = 1e-5\n\nWEIGHT_PATH_SMALL = \"./mobilenetv3.wts\"\nENGINE_PATH = \"./mobilenetv3.engine\"\n\nTRT_LOGGER = trt.Logger(trt.Logger.INFO)\n\n\ndef load_weights(file):\n    print(f\"Loading weights: {file}\")\n\n    assert os.path.exists(file), 'Unable to load weight file.'\n\n    weight_map = {}\n    with open(file, \"r\") as f:\n        lines = [line.strip() for line in f]\n    count = int(lines[0])\n    assert count == len(lines) - 1\n    for i in range(1, count + 1):\n        splits = lines[i].split(\" \")\n        name = splits[0]\n        cur_count = int(splits[1])\n        assert cur_count + 2 == len(splits)\n        values = []\n        for j in range(2, len(splits)):\n            # hex string to bytes to float\n            values.append(struct.unpack(\">f\", bytes.fromhex(splits[j])))\n        weight_map[name] = np.array(values, dtype=np.float32)\n\n    return weight_map\n\n\ndef add_batch_norm_2d(network, weight_map, input, layer_name, eps):\n    gamma = weight_map[layer_name + \".weight\"]\n    beta = weight_map[layer_name + \".bias\"]\n    mean = weight_map[layer_name + \".running_mean\"]\n    var = weight_map[layer_name + \".running_var\"]\n    var = np.sqrt(var + eps)\n\n    scale = gamma / var\n    shift = -mean / var * gamma + beta\n    return network.add_scale(input=input,\n                             mode=trt.ScaleMode.CHANNEL,\n                             shift=shift,\n                             scale=scale)\n\n\ndef add_h_swish(network, input):\n    h_sig = network.add_activation(input, type=trt.ActivationType.HARD_SIGMOID)\n    assert h_sig\n    h_sig.alpha = 1.0 / 6.0\n    h_sig.beta = 0.5\n    hsw = network.add_elementwise(input, h_sig.get_output(0), trt.ElementWiseOperation.PROD)\n    assert hsw\n\n    return hsw\n\n\ndef conv_bn_h_swish(network, weight_map, input, outch, ksize, s, g, lname):\n    p = (ksize - 1) // 2\n    conv1 = network.add_convolution(input=input,\n                                    num_output_maps=outch,\n                                    kernel_shape=(ksize, ksize),\n                                    kernel=weight_map[lname + \"0.weight\"],\n                                    bias=trt.Weights()\n                                    )\n    assert conv1\n    conv1.stride = (s, s)\n    conv1.padding = (p, p)\n    conv1.num_groups = g\n\n    bn1 = add_batch_norm_2d(network, weight_map, conv1.get_output(0), lname + \"1\", EPS)\n    hsw = add_h_swish(network, bn1.get_output(0))\n    assert hsw\n\n    return hsw\n\n\ndef add_se_layer(network, weight_map, input, c, w, lname):\n    h = w\n    l1 = network.add_pooling(input=input,\n                             type=trt.PoolingType.AVERAGE,\n                             window_size=trt.DimsHW(w, h))\n    assert l1\n    l1.stride_nd = (w, h)\n\n    l2 = network.add_fully_connected(input=l1.get_output(0),\n                                     num_outputs=BS * c // 4,\n                                     kernel=weight_map[lname + \"fc.0.weight\"],\n                                     bias=weight_map[lname + \"fc.0.bias\"])\n    relu1 = network.add_activation(l2.get_output(0), type=trt.ActivationType.RELU)\n    l4 = network.add_fully_connected(input=relu1.get_output(0),\n                                     num_outputs=BS * c,\n                                     kernel=weight_map[lname + \"fc.2.weight\"],\n                                     bias=weight_map[lname + \"fc.2.bias\"])\n\n    se = add_h_swish(network, l4.get_output(0))\n\n    return se\n\n\ndef conv_seq_1(network, weight_map, input, output, hdim, k, s, use_se, use_hs, w, lname):\n    p = (k - 1) // 2\n    conv1 = network.add_convolution(input=input,\n                                    num_output_maps=hdim,\n                                    kernel_shape=(k, k),\n                                    kernel=weight_map[lname + \"0.weight\"],\n                                    bias=trt.Weights())\n    assert conv1\n    conv1.stride = (s, s)\n    conv1.padding = (p, p)\n    conv1.num_groups = hdim\n\n    bn1 = add_batch_norm_2d(network, weight_map, conv1.get_output(0), lname + \"1\", EPS)\n\n    if use_hs:\n        hsw = add_h_swish(network, bn1.get_output(0))\n        tensor3 = hsw.get_output(0)\n    else:\n        relu1 = network.add_activation(bn1.get_output(0), type=trt.ActivationType.RELU)\n        tensor3 = relu1.get_output(0)\n\n    if use_se:\n        se1 = add_se_layer(network, weight_map, tensor3, hdim, w, lname + \"3.\")\n        tensor4 = se1.get_output(0)\n    else:\n        tensor4 = tensor3\n\n    conv2 = network.add_convolution(input=tensor4,\n                                    num_output_maps=output,\n                                    kernel_shape=(1, 1),\n                                    kernel=weight_map[lname + \"4.weight\"],\n                                    bias=trt.Weights())\n    bn2 = add_batch_norm_2d(network, weight_map, conv2.get_output(0), lname + \"5\", EPS)\n    assert bn2\n\n    return bn2\n\n\ndef conv_seq_2(network, weight_map, input, output, hdim, k, s, use_se, use_hs, w, lname):\n    p = (k - 1) // 2\n    conv1 = network.add_convolution(input=input,\n                                    num_output_maps=hdim,\n                                    kernel_shape=(1, 1),\n                                    kernel=weight_map[lname + \"0.weight\"],\n                                    bias=trt.Weights())\n    bn1 = add_batch_norm_2d(network, weight_map, conv1.get_output(0), lname + \"1\", EPS)\n\n    if use_hs:\n        hsw1 = add_h_swish(network, bn1.get_output(0))\n        tensor3 = hsw1.get_output(0)\n    else:\n        relu1 = network.add_activation(bn1.get_output(0), type=trt.ActivationType.RELU)\n        tensor3 = relu1.get_output(0)\n\n    conv2 = network.add_convolution(input=tensor3,\n                                    num_output_maps=hdim,\n                                    kernel_shape=(k, k),\n                                    kernel=weight_map[lname + \"3.weight\"],\n                                    bias=trt.Weights())\n    conv2.stride = (s, s)\n    conv2.padding = (p, p)\n    conv2.num_groups = hdim\n    bn2 = add_batch_norm_2d(network, weight_map, conv2.get_output(0), lname + \"4\", EPS)\n\n    if use_se:\n        se1 = add_se_layer(network, weight_map, bn2.get_output(0), hdim, w, lname + \"5.\")\n        tensor6 = se1.get_output(0)\n    else:\n        tensor6 = bn2.get_output(0)\n\n    if use_hs:\n        hsw2 = add_h_swish(network, tensor6)\n        tensor7 = hsw2.get_output(0)\n    else:\n        relu2 = network.add_activation(tensor6, type=trt.ActivationType.RELU)\n        tensor7 = relu2.get_output(0)\n\n    conv3 = network.add_convolution(input=tensor7,\n                                    num_output_maps=output,\n                                    kernel_shape=(1, 1),\n                                    kernel=weight_map[lname + \"7.weight\"],\n                                    bias=trt.Weights())\n    bn3 = add_batch_norm_2d(network, weight_map, conv3.get_output(0), lname + \"8\", EPS)\n    assert bn3\n\n    return bn3\n\n\ndef inverted_res(network, weight_map, input, lname, inch, outch, s, hidden, k, use_se, use_hs, w):\n    use_res_connect = (s == 1 and inch == outch)\n\n    if inch == hidden:\n        conv = conv_seq_1(network, weight_map, input, outch, hidden, k, s, use_se, use_hs, w, lname + \"conv.\")\n    else:\n        conv = conv_seq_2(network, weight_map, input, outch, hidden, k, s, use_se, use_hs, w, lname + \"conv.\")\n\n    if not use_res_connect:\n        return conv\n\n    ew3 = network.add_elementwise(input, conv.get_output(0), trt.ElementWiseOperation.SUM)\n    assert ew3\n\n    return ew3\n\n\ndef create_engine_small(max_batch_size, builder, config, dt):\n    weight_map = load_weights(WEIGHT_PATH_SMALL)\n    network = builder.create_network()\n\n    data = network.add_input(INPUT_BLOB_NAME, dt, (3, INPUT_H, INPUT_W))\n    assert data\n\n    ew1 = conv_bn_h_swish(network, weight_map, data, 16, 3, 2, 1, \"features.0.\")\n    ir1 = inverted_res(network, weight_map, ew1.get_output(0), \"features.1.\", 16, 16, 2, 16, 3, 1, 0, 56)\n    ir2 = inverted_res(network, weight_map, ir1.get_output(0), \"features.2.\", 16, 24, 2, 72, 3, 0, 0, 28)\n    ir3 = inverted_res(network, weight_map, ir2.get_output(0), \"features.3.\", 24, 24, 1, 88, 3, 0, 0, 28)\n    ir4 = inverted_res(network, weight_map, ir3.get_output(0), \"features.4.\", 24, 40, 2, 96, 5, 1, 1, 14)\n    ir5 = inverted_res(network, weight_map, ir4.get_output(0), \"features.5.\", 40, 40, 1, 240, 5, 1, 1, 14)\n    ir6 = inverted_res(network, weight_map, ir5.get_output(0), \"features.6.\", 40, 40, 1, 240, 5, 1, 1, 14)\n    ir7 = inverted_res(network, weight_map, ir6.get_output(0), \"features.7.\", 40, 48, 1, 120, 5, 1, 1, 14)\n    ir8 = inverted_res(network, weight_map, ir7.get_output(0), \"features.8.\", 48, 48, 1, 144, 5, 1, 1, 14)\n    ir9 = inverted_res(network, weight_map, ir8.get_output(0), \"features.9.\", 48, 96, 2, 288, 5, 1, 1, 7)\n    ir10 = inverted_res(network, weight_map, ir9.get_output(0), \"features.10.\", 96, 96, 1, 576, 5, 1, 1, 7)\n    ir11 = inverted_res(network, weight_map, ir10.get_output(0), \"features.11.\", 96, 96, 1, 576, 5, 1, 1, 7)\n    ew2 = conv_bn_h_swish(network, weight_map, ir11.get_output(0), 576, 1, 1, 1, \"conv.0.\")\n    se1 = add_se_layer(network, weight_map, ew2.get_output(0), 576, 7, \"conv.1.\")\n\n    pool1 = network.add_pooling(input=se1.get_output(0),\n                                type=trt.PoolingType.AVERAGE,\n                                window_size=trt.DimsHW(7, 7))\n    assert pool1\n    pool1.stride_nd = (7, 7)\n    sw1 = add_h_swish(network, pool1.get_output(0))\n\n    fc1 = network.add_fully_connected(input=sw1.get_output(0),\n                                      num_outputs=1280,\n                                      kernel=weight_map[\"classifier.0.weight\"],\n                                      bias=weight_map[\"classifier.0.bias\"])\n    assert fc1\n    bn1 = add_batch_norm_2d(network, weight_map, fc1.get_output(0), \"classifier.1\", EPS)\n    sw2 = add_h_swish(network, bn1.get_output(0))\n\n    fc2 = network.add_fully_connected(input=sw2.get_output(0),\n                                      num_outputs=OUTPUT_SIZE,\n                                      kernel=weight_map[\"classifier.3.weight\"],\n                                      bias=weight_map[\"classifier.3.bias\"])\n    bn2 = add_batch_norm_2d(network, weight_map, fc2.get_output(0), \"classifier.4\", EPS)\n    sw3 = add_h_swish(network, bn2.get_output(0))\n\n    sw3.get_output(0).name = OUTPUT_BLOB_NAME\n    network.mark_output(sw3.get_output(0))\n\n    # Build Engine\n    builder.max_batch_size = max_batch_size\n    builder.max_workspace_size = 1 << 20\n    engine = builder.build_engine(network, config)\n\n    del network\n    del weight_map\n\n    return engine\n\n\ndef create_engine_large(max_batch_size, builder, config, dt):\n    weight_map = load_weights(WEIGHT_PATH_SMALL)\n    network = builder.create_network()\n\n    data = network.add_input(INPUT_BLOB_NAME, dt, (3, INPUT_H, INPUT_W))\n    assert data\n\n    ew1 = conv_bn_h_swish(network, weight_map, data, 16, 3, 2, 1, \"features.0.\")\n    ir1 = inverted_res(network, weight_map, ew1.get_output(0), \"features.1.\", 16, 16, 1, 16, 3, 0, 0, 112)\n    ir2 = inverted_res(network, weight_map, ir1.get_output(0), \"features.2.\", 16, 24, 2, 64, 3, 0, 0, 56)\n    ir3 = inverted_res(network, weight_map, ir2.get_output(0), \"features.3.\", 24, 24, 1, 72, 3, 0, 0, 56)\n    ir4 = inverted_res(network, weight_map, ir3.get_output(0), \"features.4.\", 24, 40, 2, 72, 5, 1, 0, 28)\n    ir5 = inverted_res(network, weight_map, ir4.get_output(0), \"features.5.\", 40, 40, 1, 120, 5, 1, 0, 28)\n    ir6 = inverted_res(network, weight_map, ir5.get_output(0), \"features.6.\", 40, 40, 1, 120, 5, 1, 0, 28)\n    ir7 = inverted_res(network, weight_map, ir6.get_output(0), \"features.7.\", 40, 80, 2, 240, 3, 0, 1, 14)\n    ir8 = inverted_res(network, weight_map, ir7.get_output(0), \"features.8.\", 80, 80, 1, 200, 3, 0, 1, 14)\n    ir9 = inverted_res(network, weight_map, ir8.get_output(0), \"features.9.\", 80, 80, 1, 184, 3, 0, 1, 14)\n    ir10 = inverted_res(network, weight_map, ir9.get_output(0), \"features.10.\", 80, 80, 1, 184, 3, 0, 1, 14)\n    ir11 = inverted_res(network, weight_map, ir10.get_output(0), \"features.11.\", 80, 112, 1, 480, 3, 1, 1, 14)\n    ir12 = inverted_res(network, weight_map, ir11.get_output(0), \"features.12.\", 112, 112, 1, 672, 3, 1, 1, 14)\n    ir13 = inverted_res(network, weight_map, ir12.get_output(0), \"features.13.\", 112, 160, 1, 672, 5, 1, 1, 14)\n    ir14 = inverted_res(network, weight_map, ir13.get_output(0), \"features.14.\", 160, 160, 2, 672, 5, 1, 1, 7)\n    ir15 = inverted_res(network, weight_map, ir14.get_output(0), \"features.15.\", 160, 160, 1, 960, 5, 1, 1, 7)\n    ew2 = conv_bn_h_swish(network, weight_map, ir15.get_output(0), 960, 1, 1, 1, \"conv.0.\")\n\n    pool1 = network.add_pooling(input=ew2.get_output(0),\n                                type=trt.PoolingType.AVERAGE,\n                                window_size=trt.DimsHW(7, 7))\n    assert pool1\n    pool1.stride_nd = (7, 7)\n    sw1 = add_h_swish(network, pool1.get_output(0))\n\n    fc1 = network.add_fully_connected(input=sw1.get_output(0),\n                                      num_outputs=1280,\n                                      kernel=weight_map[\"classifier.0.weight\"],\n                                      bias=weight_map[\"classifier.0.bias\"])\n    assert fc1\n    sw2 = add_h_swish(network, fc1.get_output(0))\n\n    fc2 = network.add_fully_connected(input=sw2.get_output(0),\n                                      num_outputs=OUTPUT_SIZE,\n                                      kernel=weight_map[\"classifier.3.weight\"],\n                                      bias=weight_map[\"classifier.3.bias\"])\n\n    fc2.get_output(0).name = OUTPUT_BLOB_NAME\n    network.mark_output(fc2.get_output(0))\n\n    # Build Engine\n    builder.max_batch_size = max_batch_size\n    builder.max_workspace_size = 1 << 20\n    engine = builder.build_engine(network, config)\n\n    del network\n    del weight_map\n\n    return engine\n\n\ndef API_to_model(max_batch_size, model_type):\n    builder = trt.Builder(TRT_LOGGER)\n    config = builder.create_builder_config()\n    if model_type == \"small\":\n        engine = create_engine_small(max_batch_size, builder, config, trt.float32)\n        assert engine\n    else:\n        engine = create_engine_large(max_batch_size, builder, config, trt.float32)\n        assert engine\n\n    with open(ENGINE_PATH, \"wb\") as f:\n        f.write(engine.serialize())\n\n    del engine\n    del builder\n    del config\n\n\nclass HostDeviceMem(object):\n    def __init__(self, host_mem, device_mem):\n        self.host = host_mem\n        self.device = device_mem\n\n    def __str__(self):\n        return \"Host:\\n\" + str(self.host) + \"\\nDevice:\\n\" + str(self.device)\n\n    def __repr__(self):\n        return self.__str__()\n\n\ndef allocate_buffers(engine):\n    inputs = []\n    outputs = []\n    bindings = []\n    stream = cuda.Stream()\n    for binding in engine:\n        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n        dtype = trt.nptype(engine.get_binding_dtype(binding))\n        # Allocate host and device buffers\n        host_mem = cuda.pagelocked_empty(size, dtype)\n        device_mem = cuda.mem_alloc(host_mem.nbytes)\n        # Append the device buffer to device bindings.\n        bindings.append(int(device_mem))\n        # Append to the appropriate list.\n        if engine.binding_is_input(binding):\n            inputs.append(HostDeviceMem(host_mem, device_mem))\n        else:\n            outputs.append(HostDeviceMem(host_mem, device_mem))\n    return inputs, outputs, bindings, stream\n\n\ndef do_inference(context, bindings, inputs, outputs, stream, batch_size=1):\n    # Transfer input data to the GPU.\n    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]\n    # Run inference.\n    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)\n    # Transfer predictions back from the GPU.\n    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]\n    # Synchronize the stream\n    stream.synchronize()\n    # Return only the host outputs.\n    return [out.host for out in outputs]\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"-s\", action='store_true')\n    parser.add_argument(\"-d\", action='store_true')\n    parser.add_argument(\"-t\", help='indicate small or large model')\n    args = parser.parse_args()\n\n    if not (args.s ^ args.d):\n        print(\n            \"arguments not right!\\n\"\n            \"python mobilenet_v2.py -s   # serialize model to plan file\\n\"\n            \"python mobilenet_v2.py -d   # deserialize plan file and run inference\"\n        )\n        sys.exit()\n\n    if args.s:\n        API_to_model(BATCH_SIZE, args.t)\n    else:\n        runtime = trt.Runtime(TRT_LOGGER)\n        assert runtime\n\n        with open(ENGINE_PATH, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        assert engine\n\n        context = engine.create_execution_context()\n        assert context\n\n        data = np.ones((BATCH_SIZE * 3 * INPUT_H * INPUT_W), dtype=np.float32)\n        inputs, outputs, bindings, stream = allocate_buffers(engine)\n        inputs[0].host = data\n\n        trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)\n\n        print(f'Output: \\n{trt_outputs[0][:10]}\\n{trt_outputs[0][-10:]}')\n"
  },
  {
    "path": "psenet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(PSENet)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\n\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nfile(GLOB SOURCE_FILES \"*.h\" \"*.cpp\")\n\nadd_executable(psenet ${SOURCE_FILES})\ntarget_link_libraries(psenet nvinfer)\ntarget_link_libraries(psenet cudart)\ntarget_link_libraries(psenet ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "psenet/README.md",
    "content": "# PSENet\n\n**preprocessing + inference + postprocessing = 30ms** with fp32 on Tesla P40. \nThe original Tensorflow implementation is [tensorflow_PSENet](https://github.com/liuheng92/tensorflow_PSENet). A TensorRT Python api implementation is [TensorRT-Python-PSENet](https://github.com/upczww/TensorRT-Python-PSENet).\n\n## Key Features\n- Generating `.wts` from `Tensorflow`.\n- Dynamic batch and dynamic shape input.\n- Object-Oriented Programming.\n- Practice with C++ 11.\n\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/105487078-821d6800-5cea-11eb-87dc-e3317a941763.jpeg\">\n</p>\n\n## How to Run\n\n* 1. generate .wts\n\n  Download pretrained model from https://github.com/liuheng92/tensorflow_PSENet\n  and put `model.ckpt.*` to `model` dir. Add a file `model/checkpoint` with content\n    ```\n    model_checkpoint_path: \"model.ckpt\"\n    all_model_checkpoint_paths: \"model.ckpt\"\n    ```\n    Then run\n\n    ```\n    python gen_tf_wts.py\n    ```\n    which will gengerate a `psenet.wts`.\n* 2. cmake and make\n\n  ```\n  mkdir build\n  cd build\n  cmake ..\n  make\n  ```\n* 3. build engine and run detection\n  ```\n  cp ../psenet.wts ./\n  cp ../test.jpg ./\n  ./psenet -s  // serialize model to plan file\n  ./psenet -d  // deserialize plan file and run inference\n  ```\n\n## Known Issues\nNone\n\n## Todo\n\n* use `ExponentialMovingAverage` weight.\n"
  },
  {
    "path": "psenet/gen_tf_wts.py",
    "content": "from sys import prefix\r\nimport tensorflow as tf\r\nfrom tensorflow.python import pywrap_tensorflow\r\nimport numpy as np\r\nimport struct\r\n\r\nmodel_dir = \"model\"\r\n\r\nckpt = tf.train.get_checkpoint_state(model_dir)\r\nckpt_path = ckpt.model_checkpoint_path\r\n\r\nreader = pywrap_tensorflow.NewCheckpointReader(ckpt_path)\r\nparam_dict = reader.get_variable_to_shape_map()\r\n\r\n\r\nf = open(r\"psenet.wts\", \"w\")\r\nkeys = param_dict.keys()\r\nf.write(\"{}\\n\".format(len(keys)))\r\n\r\nfor key in keys:\r\n    weight = reader.get_tensor(key)\r\n    print(key, weight.shape)\r\n    if len(weight.shape) == 4:\r\n        weight = np.transpose(weight, (3, 2, 0, 1))\r\n        print(weight.shape)\r\n    weight = np.reshape(weight, -1)\r\n    f.write(\"{} {} \".format(key, len(weight)))\r\n    for w in weight:\r\n        f.write(\" \")\r\n        f.write(struct.pack(\">f\", float(w)).hex())\r\n    f.write(\"\\n\")"
  },
  {
    "path": "psenet/layers.cpp",
    "content": "#include \"layers.h\"\r\n\r\nIScaleLayer* addBatchNorm2d(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps)\r\n{\r\n    float* gamma = (float*)weightMap[lname + \"gamma\"].values; // scale\r\n    float* beta = (float*)weightMap[lname + \"beta\"].values;   // offset\r\n    float* mean = (float*)weightMap[lname + \"moving_mean\"].values;\r\n    float* var = (float*)weightMap[lname + \"moving_variance\"].values;\r\n    int len = weightMap[lname + \"moving_variance\"].count;\r\n\r\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (auto i = 0; i < len; i++)\r\n    {\r\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights scale{ DataType::kFLOAT, scval, len };\r\n\r\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (auto i = 0; i < len; i++)\r\n    {\r\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights shift{ DataType::kFLOAT, shval, len };\r\n\r\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (auto i = 0; i < len; i++)\r\n    {\r\n        pval[i] = 1.0;\r\n    }\r\n    Weights power{ DataType::kFLOAT, pval, len };\r\n\r\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\r\n    assert(scale_1);\r\n    return scale_1;\r\n}\r\n\r\nIActivationLayer* bottleneck(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int ch, int stride, std::string lname, int branch_type)\r\n{\r\n\r\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\r\n\r\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, ch, DimsHW{ 1, 1 }, weightMap[lname + \"conv1/weights\"], emptywts);\r\n    assert(conv1);\r\n\r\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"conv1/BatchNorm/\", 1e-5);\r\n    assert(bn1);\r\n\r\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\r\n    assert(relu1);\r\n\r\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), ch, DimsHW{ 3, 3 }, weightMap[lname + \"conv2/weights\"], emptywts);\r\n    conv2->setStrideNd(DimsHW{ stride, stride });\r\n    conv2->setPaddingNd(DimsHW{ 1, 1 });\r\n    assert(conv2);\r\n\r\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"conv2/BatchNorm/\", 1e-5);\r\n    assert(bn2);\r\n\r\n    IActivationLayer* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\r\n    assert(relu2);\r\n\r\n    IConvolutionLayer* conv3 = network->addConvolutionNd(*relu2->getOutput(0), ch * 4, DimsHW{ 1, 1 }, weightMap[lname + \"conv3/weights\"], emptywts);\r\n    assert(conv3);\r\n\r\n    IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"conv3/BatchNorm/\", 1e-5);\r\n    assert(bn3);\r\n    IElementWiseLayer* ew1;\r\n    // branch_type 0:shortcut,1:conv+bn+shortcut,2:maxpool+shortcut\r\n    if (branch_type == 0)\r\n    {\r\n        ew1 = network->addElementWise(input, *bn3->getOutput(0), ElementWiseOperation::kSUM);\r\n        assert(ew1);\r\n    }\r\n    else if (branch_type == 1)\r\n    {\r\n        IConvolutionLayer* conv4 = network->addConvolutionNd(input, ch * 4, DimsHW{ 1, 1 }, weightMap[lname + \"shortcut/weights\"], emptywts);\r\n        conv4->setStrideNd(DimsHW{ stride, stride });\r\n        assert(conv4);\r\n        IScaleLayer* bn4 = addBatchNorm2d(network, weightMap, *conv4->getOutput(0), lname + \"shortcut/BatchNorm/\", 1e-5);\r\n        assert(bn4);\r\n        ew1 = network->addElementWise(*bn4->getOutput(0), *bn3->getOutput(0), ElementWiseOperation::kSUM);\r\n        assert(ew1);\r\n    }\r\n    else\r\n    {\r\n        IPoolingLayer* pool = network->addPoolingNd(input, PoolingType::kMAX, DimsHW{ 1, 1 });\r\n        pool->setStrideNd(DimsHW{ 2, 2 });\r\n        assert(pool);\r\n        ew1 = network->addElementWise(*pool->getOutput(0), *bn3->getOutput(0), ElementWiseOperation::kSUM);\r\n        assert(ew1);\r\n    }\r\n    IActivationLayer* relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\r\n    assert(relu3);\r\n    return relu3;\r\n}\r\n\r\nIActivationLayer* addConvRelu(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int kernel, int stride, std::string lname)\r\n{\r\n    IConvolutionLayer* conv = network->addConvolutionNd(input, 256, DimsHW{ kernel, kernel }, weightMap[lname + \"weights\"], weightMap[lname + \"biases\"]);\r\n    conv->setStrideNd(DimsHW{ stride, stride });\r\n    if (kernel == 3)\r\n    {\r\n        conv->setPaddingNd(DimsHW{ 1, 1 });\r\n    }\r\n    assert(conv);\r\n\r\n    IActivationLayer* ac = network->addActivation(*conv->getOutput(0), ActivationType::kRELU);\r\n    assert(ac);\r\n    return ac;\r\n}"
  },
  {
    "path": "psenet/layers.h",
    "content": "#ifndef TENSORRTX_LAYERS_H\n#define TENSORRTX_LAYERS_H\n\n#include <map>\n#include <math.h>\n#include <assert.h>\n\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\nusing namespace nvinfer1;\n\nIScaleLayer *addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights> &weightMap, ITensor &input, std::string lname, float eps);\n\nIActivationLayer *bottleneck(INetworkDefinition *network, std::map<std::string, Weights> &weightMap, ITensor &input, int ch, int stride, std::string lname, int branch_type);\n\nIActivationLayer *addConvRelu(INetworkDefinition *network, std::map<std::string, Weights> &weightMap, ITensor &input, int outch, int kernel, int stride, std::string lname);\n\n#endif\n"
  },
  {
    "path": "psenet/main.cpp",
    "content": "#include \"psenet.h\"\r\n\r\nint main(int argc, char** argv)\r\n{\r\n    PSENet psenet(1200, 640, 0.90, 6, 4);\r\n\r\n    if (argc == 2 && std::string(argv[1]) == \"-s\")\r\n    {\r\n        std::cout << \"Serializling Engine\" << std::endl;\r\n        psenet.serializeEngine();\r\n        return 0;\r\n    }\r\n    else if (argc == 2 && std::string(argv[1]) == \"-d\")\r\n    {\r\n        psenet.init();\r\n        std::vector<std::string> files;\r\n        for (int i = 0; i < 10; i++)\r\n            files.emplace_back(\"test.jpg\");\r\n        for (auto file : files)\r\n        {\r\n            std::cout << \"Detect \" << file << std::endl;\r\n            psenet.detect(file);\r\n        }\r\n\r\n        return 0;\r\n    }\r\n    else\r\n    {\r\n        std::cerr << \"arguments not right!\" << std::endl;\r\n        std::cerr << \"./psenet -s  // serialize model to plan file\" << std::endl;\r\n        std::cerr << \"./psenet -d  // deserialize plan file and run inference\" << std::endl;\r\n        return -1;\r\n    }\r\n}\r\n"
  },
  {
    "path": "psenet/psenet.cpp",
    "content": "#include \"psenet.h\"\r\n#include <string>\r\n#include <queue>\r\n#define MAX_INPUT_SIZE 1200\r\n#define MIN_INPUT_SIZE 128\r\n#define OPT_INPUT_W 640\r\n#define OPT_INPUT_H 640\r\n\r\nPSENet::PSENet(int max_side_len, int min_side_len, float threshold, int num_kernel, int stride) : max_side_len_(max_side_len), min_side_len_(min_side_len),\r\npost_threshold_(threshold),\r\nnum_kernels_(num_kernel),\r\nstride_(stride)\r\n{\r\n}\r\n\r\nPSENet::~PSENet()\r\n{\r\n}\r\n\r\n// create the engine using only the API and not any parser.\r\nICudaEngine* PSENet::createEngine(IBuilder* builder, IBuilderConfig* config)\r\n{\r\n    std::map<std::string, Weights> weightMap = loadWeights(\"./psenet.wts\");\r\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\r\n    const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);\r\n    INetworkDefinition* network = builder->createNetworkV2(explicitBatch);\r\n\r\n    ITensor* data = network->addInput(input_name_, dt, Dims4{ -1, 3, -1, -1 });\r\n    assert(data);\r\n\r\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 64, DimsHW{ 7, 7 }, weightMap[\"resnet_v1_50/conv1/weights\"], emptywts);\r\n    conv1->setStrideNd(DimsHW{ 2, 2 });\r\n    conv1->setPaddingNd(DimsHW{ 3, 3 });\r\n    assert(conv1);\r\n\r\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"resnet_v1_50/conv1/BatchNorm/\", 1e-5);\r\n    assert(bn1);\r\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\r\n    assert(relu1);\r\n\r\n    // C2\r\n    IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{ 3, 3 });\r\n    pool1->setStrideNd(DimsHW{ 2, 2 });\r\n    pool1->setPrePadding(DimsHW{ 0, 0 });\r\n    pool1->setPostPadding(DimsHW{ 1, 1 });\r\n    assert(pool1);\r\n\r\n    IActivationLayer* x;\r\n\r\n    x = bottleneck(network, weightMap, *pool1->getOutput(0), 64, 1, \"resnet_v1_50/block1/unit_1/bottleneck_v1/\", 1);\r\n    x = bottleneck(network, weightMap, *x->getOutput(0), 64, 1, \"resnet_v1_50/block1/unit_2/bottleneck_v1/\", 0);\r\n    // C3\r\n    IActivationLayer* block1 = bottleneck(network, weightMap, *x->getOutput(0), 64, 2, \"resnet_v1_50/block1/unit_3/bottleneck_v1/\", 2);\r\n\r\n    x = bottleneck(network, weightMap, *block1->getOutput(0), 128, 1, \"resnet_v1_50/block2/unit_1/bottleneck_v1/\", 1);\r\n    x = bottleneck(network, weightMap, *x->getOutput(0), 128, 1, \"resnet_v1_50/block2/unit_2/bottleneck_v1/\", 0);\r\n    x = bottleneck(network, weightMap, *x->getOutput(0), 128, 1, \"resnet_v1_50/block2/unit_3/bottleneck_v1/\", 0);\r\n    // C4\r\n    IActivationLayer* block2 = bottleneck(network, weightMap, *x->getOutput(0), 128, 2, \"resnet_v1_50/block2/unit_4/bottleneck_v1/\", 2);\r\n\r\n    x = bottleneck(network, weightMap, *block2->getOutput(0), 256, 1, \"resnet_v1_50/block3/unit_1/bottleneck_v1/\", 1);\r\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 1, \"resnet_v1_50/block3/unit_2/bottleneck_v1/\", 0);\r\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 1, \"resnet_v1_50/block3/unit_3/bottleneck_v1/\", 0);\r\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 1, \"resnet_v1_50/block3/unit_4/bottleneck_v1/\", 0);\r\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 1, \"resnet_v1_50/block3/unit_5/bottleneck_v1/\", 0);\r\n    IActivationLayer* block3 = bottleneck(network, weightMap, *x->getOutput(0), 256, 2, \"resnet_v1_50/block3/unit_6/bottleneck_v1/\", 2);\r\n\r\n    x = bottleneck(network, weightMap, *block3->getOutput(0), 512, 1, \"resnet_v1_50/block4/unit_1/bottleneck_v1/\", 1);\r\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 1, \"resnet_v1_50/block4/unit_2/bottleneck_v1/\", 0);\r\n    // C5\r\n    IActivationLayer* block4 = bottleneck(network, weightMap, *x->getOutput(0), 512, 1, \"resnet_v1_50/block4/unit_3/bottleneck_v1/\", 0);\r\n\r\n    IActivationLayer* build_p5_r1 = addConvRelu(network, weightMap, *block4->getOutput(0), 256, 1, 1, \"build_feature_pyramid/build_P5/\");\r\n    assert(build_p5_r1);\r\n    IActivationLayer* build_p4_r1 = addConvRelu(network, weightMap, *block2->getOutput(0), 256, 1, 1, \"build_feature_pyramid/build_P4/reduce_dimension/\");\r\n    assert(build_p4_r1);\r\n\r\n    IResizeLayer* bfp_layer4_resize = network->addResize(*build_p5_r1->getOutput(0));\r\n    auto build_p4_r1_shape = network->addShape(*build_p4_r1->getOutput(0))->getOutput(0);\r\n    bfp_layer4_resize->setInput(1, *build_p4_r1_shape);\r\n    bfp_layer4_resize->setResizeMode(ResizeMode::kNEAREST);\r\n    bfp_layer4_resize->setAlignCorners(false);\r\n    assert(bfp_layer4_resize);\r\n\r\n    IElementWiseLayer* bfp_add = network->addElementWise(*bfp_layer4_resize->getOutput(0), *build_p4_r1->getOutput(0), ElementWiseOperation::kSUM);\r\n    assert(bfp_add);\r\n\r\n    IActivationLayer* build_p4_r2 = addConvRelu(network, weightMap, *bfp_add->getOutput(0), 256, 3, 1, \"build_feature_pyramid/build_P4/avoid_aliasing/\");\r\n    assert(build_p4_r2);\r\n\r\n    IActivationLayer* build_p3_r1 = addConvRelu(network, weightMap, *block1->getOutput(0), 256, 1, 1, \"build_feature_pyramid/build_P3/reduce_dimension/\");\r\n    assert(build_p3_r1);\r\n\r\n    IResizeLayer* bfp_layer3_resize = network->addResize(*build_p4_r2->getOutput(0));\r\n    bfp_layer3_resize->setResizeMode(ResizeMode::kNEAREST);\r\n    auto build_p3_r1_shape = network->addShape(*build_p3_r1->getOutput(0))->getOutput(0);\r\n    bfp_layer3_resize->setInput(1, *build_p3_r1_shape);\r\n    bfp_layer3_resize->setAlignCorners(false);\r\n    assert(bfp_layer3_resize);\r\n    IElementWiseLayer* bfp_add1 = network->addElementWise(*bfp_layer3_resize->getOutput(0), *build_p3_r1->getOutput(0), ElementWiseOperation::kSUM);\r\n    assert(bfp_add1);\r\n\r\n    IActivationLayer* build_p3_r2 = addConvRelu(network, weightMap, *bfp_add1->getOutput(0), 256, 3, 1, \"build_feature_pyramid/build_P3/avoid_aliasing/\");\r\n    assert(build_p3_r2);\r\n\r\n    IActivationLayer* build_p2_r1 = addConvRelu(network, weightMap, *pool1->getOutput(0), 256, 1, 1, \"build_feature_pyramid/build_P2/reduce_dimension/\");\r\n    assert(build_p2_r1);\r\n    IResizeLayer* bfp_layer2_resize = network->addResize(*build_p3_r2->getOutput(0));\r\n    bfp_layer2_resize->setResizeMode(ResizeMode::kNEAREST);\r\n    auto build_p2_r1_shape = network->addShape(*build_p2_r1->getOutput(0))->getOutput(0);\r\n    bfp_layer2_resize->setInput(1, *build_p2_r1_shape);\r\n    bfp_layer2_resize->setAlignCorners(false);\r\n    assert(bfp_layer2_resize);\r\n    IElementWiseLayer* bfp_add2 = network->addElementWise(*bfp_layer2_resize->getOutput(0), *build_p2_r1->getOutput(0), ElementWiseOperation::kSUM);\r\n    assert(bfp_add2);\r\n\r\n    // P2\r\n    IActivationLayer* build_p2_r2 = addConvRelu(network, weightMap, *bfp_add2->getOutput(0), 256, 3, 1, \"build_feature_pyramid/build_P2/avoid_aliasing/\");\r\n    assert(build_p2_r2);\r\n    auto build_p2_r2_shape = network->addShape(*build_p2_r2->getOutput(0))->getOutput(0);\r\n    // P3 x2\r\n    IResizeLayer* layer1_resize = network->addResize(*build_p3_r2->getOutput(0));\r\n    layer1_resize->setResizeMode(ResizeMode::kLINEAR);\r\n    layer1_resize->setInput(1, *build_p2_r2_shape);\r\n    layer1_resize->setAlignCorners(false);\r\n    assert(layer1_resize);\r\n\r\n    // P4 x4\r\n    IResizeLayer* layer2_resize = network->addResize(*build_p4_r2->getOutput(0));\r\n    layer2_resize->setResizeMode(ResizeMode::kLINEAR);\r\n    layer2_resize->setInput(1, *build_p2_r2_shape);\r\n    layer2_resize->setAlignCorners(false);\r\n    assert(layer2_resize);\r\n\r\n    // P5 x8\r\n    IResizeLayer* layer3_resize = network->addResize(*build_p5_r1->getOutput(0));\r\n    layer3_resize->setResizeMode(ResizeMode::kLINEAR);\r\n    layer3_resize->setInput(1, *build_p2_r2_shape);\r\n    layer3_resize->setAlignCorners(false);\r\n    assert(layer3_resize);\r\n\r\n    // C(P5,P4,P3,P2)\r\n    ITensor* inputTensors[] = { layer3_resize->getOutput(0), layer2_resize->getOutput(0), layer1_resize->getOutput(0), build_p2_r2->getOutput(0) };\r\n\r\n    IConcatenationLayer* concat = network->addConcatenation(inputTensors, 4);\r\n    assert(concat);\r\n\r\n    IConvolutionLayer* feature_result_conv = network->addConvolutionNd(*concat->getOutput(0), 256, DimsHW{ 3, 3 }, weightMap[\"feature_results/Conv/weights\"], emptywts);\r\n    feature_result_conv->setPaddingNd(DimsHW{ 1, 1 });\r\n    assert(feature_result_conv);\r\n\r\n    IScaleLayer* feature_result_bn = addBatchNorm2d(network, weightMap, *feature_result_conv->getOutput(0), \"feature_results/Conv/BatchNorm/\", 1e-5);\r\n    assert(feature_result_bn);\r\n\r\n    IActivationLayer* feature_result_relu = network->addActivation(*feature_result_bn->getOutput(0), ActivationType::kRELU);\r\n    assert(feature_result_relu);\r\n    IConvolutionLayer* feature_result_conv_1 = network->addConvolutionNd(*feature_result_relu->getOutput(0), 6, DimsHW{ 1, 1 }, weightMap[\"feature_results/Conv_1/weights\"], weightMap[\"feature_results/Conv_1/biases\"]);\r\n    assert(feature_result_conv_1);\r\n\r\n    IActivationLayer* sigmoid = network->addActivation(*feature_result_conv_1->getOutput(0), ActivationType::kSIGMOID);\r\n    assert(sigmoid);\r\n\r\n    sigmoid->getOutput(0)->setName(output_name_);\r\n    std::cout << \"Set name out\" << std::endl;\r\n    network->markOutput(*sigmoid->getOutput(0));\r\n\r\n    // Set profile\r\n    IOptimizationProfile* profile = builder->createOptimizationProfile();\r\n    profile->setDimensions(input_name_, OptProfileSelector::kMIN, Dims4(1, 3, MIN_INPUT_SIZE, MIN_INPUT_SIZE));\r\n    profile->setDimensions(input_name_, OptProfileSelector::kOPT, Dims4(1, 3, OPT_INPUT_H, OPT_INPUT_W));\r\n    profile->setDimensions(input_name_, OptProfileSelector::kMAX, Dims4(1, 3, MAX_INPUT_SIZE, MAX_INPUT_SIZE));\r\n    config->addOptimizationProfile(profile);\r\n\r\n    // Build engine\r\n    config->setMaxWorkspaceSize(1 << 30); // 1G\r\n#ifdef USE_FP16\r\n    config->setFlag(BuilderFlag::kFP16);\r\n#endif\r\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\r\n    ;\r\n    std::cout << \"Build out\" << std::endl;\r\n\r\n    // Don't need the network any more\r\n    network->destroy();\r\n\r\n    // Release host memory\r\n    for (auto& mem : weightMap)\r\n    {\r\n        free((void*)(mem.second.values));\r\n    }\r\n    return engine;\r\n}\r\n\r\nvoid PSENet::serializeEngine()\r\n{\r\n    // Create builder\r\n    IBuilder* builder = createInferBuilder(gLogger);\r\n    IBuilderConfig* config = builder->createBuilderConfig();\r\n    // Create model to populate the network, then set the outputs and create an engine\r\n    ICudaEngine* engine = createEngine(builder, config);\r\n    assert(engine != nullptr);\r\n\r\n    // Serialize the engine\r\n    IHostMemory* modelStream{ nullptr };\r\n    modelStream = engine->serialize();\r\n    assert(modelStream != nullptr);\r\n\r\n    std::ofstream p(\"./psenet.engine\", std::ios::binary | std::ios::out);\r\n    if (!p)\r\n    {\r\n        std::cerr << \"Could not open plan output file\" << std::endl;\r\n        return;\r\n    }\r\n    p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\r\n\r\n    return;\r\n}\r\n\r\nvoid PSENet::deserializeEngine()\r\n{\r\n    std::ifstream file(\"./psenet.engine\", std::ios::binary | std::ios::in);\r\n    if (file.good())\r\n    {\r\n        file.seekg(0, file.end);\r\n        size_t size = file.tellg();\r\n        file.seekg(0, file.beg);\r\n        char* trtModelStream = new char[size];\r\n        assert(trtModelStream);\r\n        file.read(trtModelStream, size);\r\n        file.close();\r\n        mCudaEngine = std::shared_ptr<nvinfer1::ICudaEngine>(mRuntime->deserializeCudaEngine(trtModelStream, size), InferDeleter());\r\n        assert(mCudaEngine != nullptr);\r\n    }\r\n}\r\n\r\nvoid PSENet::inferenceOnce(IExecutionContext& context, float* input, float* output, int input_h, int input_w)\r\n{\r\n    const ICudaEngine& engine = context.getEngine();\r\n    // Pointers to input and output device buffers to pass to engine.\r\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\r\n    assert(engine.getNbBindings() == 2);\r\n    void* buffers[2];\r\n\r\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\r\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\r\n    const int inputIndex = engine.getBindingIndex(input_name_);\r\n    const int outputIndex = engine.getBindingIndex(output_name_);\r\n\r\n    context.setBindingDimensions(inputIndex, Dims4(1, 3, input_h, input_w));\r\n\r\n    int input_size = 3 * input_h * input_w * sizeof(float);\r\n    int output_size = input_h * input_w * 6 / 16 * sizeof(float);\r\n\r\n    // Create GPU buffers on device\r\n    CHECK(cudaMalloc(&buffers[inputIndex], input_size));\r\n    CHECK(cudaMalloc(&buffers[outputIndex], output_size));\r\n\r\n    // Create stream\r\n    cudaStream_t stream;\r\n    CHECK(cudaStreamCreate(&stream));\r\n\r\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\r\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, input_size, cudaMemcpyHostToDevice, stream));\r\n    context.enqueueV2(buffers, stream, nullptr);\r\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], output_size, cudaMemcpyDeviceToHost, stream));\r\n    cudaStreamSynchronize(stream);\r\n\r\n    // Release stream and buffers\r\n    cudaStreamDestroy(stream);\r\n    CHECK(cudaFree(buffers[inputIndex]));\r\n    CHECK(cudaFree(buffers[outputIndex]));\r\n}\r\n\r\nvoid PSENet::init()\r\n{\r\n    mRuntime = std::shared_ptr<nvinfer1::IRuntime>(createInferRuntime(gLogger), InferDeleter());\r\n    assert(mRuntime != nullptr);\r\n\r\n    std::cout << \"Deserialize Engine\" << std::endl;\r\n    deserializeEngine();\r\n\r\n    mContext = std::shared_ptr<nvinfer1::IExecutionContext>(mCudaEngine->createExecutionContext(), InferDeleter());\r\n    assert(mContext != nullptr);\r\n\r\n    mContext->setOptimizationProfile(0);\r\n\r\n    std::cout << \"Finished init\" << std::endl;\r\n}\r\nvoid PSENet::detect(std::string image_path)\r\n{\r\n    // Run inference\r\n    cv::Mat image = cv::imread(image_path);\r\n    int resize_h, resize_w;\r\n    float ratio_h, ratio_w;\r\n\r\n    auto start = std::chrono::system_clock::now();\r\n\r\n    float* input = preProcess(image, resize_h, resize_w, ratio_h, ratio_w);\r\n    float* output = new float[resize_h * resize_w * 6 / 16];\r\n\r\n    inferenceOnce(*mContext, input, output, resize_h, resize_w);\r\n\r\n    std::vector<cv::RotatedRect> boxes = postProcess(output, resize_h, resize_w);\r\n    drawRects(image, boxes, stride_, ratio_h, ratio_w, 1.0);\r\n    auto end = std::chrono::system_clock::now();\r\n\r\n    cv::imwrite(\"result_\" + image_path, image);\r\n\r\n    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\r\n    delete input;\r\n    delete output;\r\n}\r\n\r\nfloat* PSENet::preProcess(cv::Mat image, int& resize_h, int& resize_w, float& ratio_h, float& ratio_w)\r\n{\r\n    cv::Mat imageRGB;\r\n    cv::cvtColor(image, imageRGB, cv::COLOR_BGR2RGB);\r\n    cv::Mat imageProcessed;\r\n    int h = imageRGB.size().height;\r\n    int w = imageRGB.size().width;\r\n    resize_w = w;\r\n    resize_h = h;\r\n\r\n    float ratio = 1.0;\r\n    // limit the max side and min side\r\n    if (resize_h > max_side_len_ || resize_w > max_side_len_)\r\n    {\r\n        if (resize_h > resize_w)\r\n            ratio = float(max_side_len_) / float(resize_h);\r\n        else\r\n            ratio = float(max_side_len_) / float(resize_w);\r\n    }\r\n    if (resize_h < min_side_len_ || resize_w < min_side_len_)\r\n    {\r\n        if (resize_h < resize_w)\r\n            ratio = float(min_side_len_) / float(resize_h);\r\n        else\r\n            ratio = float(min_side_len_) / float(resize_w);\r\n    }\r\n    resize_h = int(resize_h * ratio);\r\n    resize_w = int(resize_w * ratio);\r\n\r\n    if (resize_h % 32 != 0)\r\n        resize_h = (resize_h / 32 + 1) * 32;\r\n    if (resize_w % 32 != 0)\r\n        resize_w = (resize_w / 32 + 1) * 32;\r\n    ratio_h = resize_h / float(h);\r\n    ratio_w = resize_w / float(w);\r\n\r\n    cv::resize(imageRGB, imageProcessed, cv::Size(resize_w, resize_h));\r\n    float* input = new float[3 * resize_h * resize_w];\r\n    cv::Mat imgFloat;\r\n    imageProcessed.convertTo(imgFloat, CV_32FC3);\r\n    cv::subtract(imgFloat, cv::Scalar(123.68, 116.78, 103.94), imgFloat, cv::noArray(), -1);\r\n    std::vector<cv::Mat> chw;\r\n    for (auto i = 0; i < 3; ++i)\r\n        chw.emplace_back(cv::Mat(cv::Size(resize_w, resize_h), CV_32FC1, input + i * resize_w * resize_h));\r\n    cv::split(imgFloat, chw);\r\n    return input;\r\n}\r\n\r\nstd::vector<cv::RotatedRect> PSENet::postProcess(float* origin_output, int resize_h, int resize_w)\r\n{\r\n    // BxCxHxW  S0 ===> S5  small ===> large\r\n    const int h = resize_h / stride_;\r\n    const int w = resize_w / stride_;\r\n    const int length = h * w;\r\n    // get kernels, sequence: 0->n, max -> min\r\n    std::vector<cv::Mat> kernels(num_kernels_);\r\n    for (auto i = num_kernels_ - 1; i >= 0; --i)\r\n    {\r\n        cv::Mat tmp_kernel(h, w, CV_32FC1, (void*)(origin_output + i * length), 0);\r\n        cv::threshold(tmp_kernel, tmp_kernel, post_threshold_, 255, cv::THRESH_BINARY);\r\n        tmp_kernel.convertTo(tmp_kernel, CV_8UC1);\r\n        assert(tmp_kernel.rows == h && tmp_kernel.cols == w);\r\n        kernels[num_kernels_ - 1 - i] = tmp_kernel;\r\n    }\r\n    cv::Mat stats, centroids, label_image;\r\n    int label_num = cv::connectedComponents(kernels[num_kernels_ - 1], label_image, 4);\r\n\r\n    label_image.convertTo(label_image, CV_8U);\r\n    assert(label_image.rows == h && label_image.cols == w);\r\n\r\n    cv::Mat out = cv::Mat::zeros(h, w, CV_8UC1);\r\n    std::queue<std::tuple<int, int, int>> q;\r\n    std::queue<std::tuple<int, int, int>> next_q;\r\n    for (int i = 0; i < h; i++)\r\n    {\r\n        for (int j = 0; j < w; j++)\r\n        {\r\n            auto label = *label_image.ptr(i, j);\r\n            if (label > 0)\r\n            {\r\n                q.push(std::make_tuple(i, j, label));\r\n                *out.ptr(i, j) = label;\r\n            }\r\n        }\r\n    }\r\n\r\n    int dx[4] = { -1, 1, 0, 0 };\r\n    int dy[4] = { 0, 0, -1, 1 };\r\n    for (int i = num_kernels_ - 2; i >= 0; i--)\r\n    {\r\n        //get each kernels\r\n        auto kernel = kernels[i];\r\n        while (!q.empty())\r\n        {\r\n            //get each queue menber in q\r\n            auto q_n = q.front();\r\n            q.pop();\r\n            int y = std::get<0>(q_n); //i\r\n            int x = std::get<1>(q_n); //j\r\n            int l = std::get<2>(q_n); //label\r\n            //store the edge pixel after one expansion\r\n            bool is_edge = true;\r\n            for (int idx = 0; idx < 4; idx++)\r\n            {\r\n                int index_y = y + dy[idx];\r\n                int index_x = x + dx[idx];\r\n                if (index_y < 0 || index_y >= h || index_x < 0 || index_x >= w)\r\n                    continue;\r\n                if (!*kernel.ptr(index_y, index_x) || *out.ptr(index_y, index_x) > 0)\r\n                    continue;\r\n                q.push(std::make_tuple(index_y, index_x, l));\r\n                *out.ptr(index_y, index_x) = l;\r\n                is_edge = false;\r\n            }\r\n            if (is_edge)\r\n            {\r\n                next_q.push(std::make_tuple(y, x, l));\r\n            }\r\n        }\r\n        std::swap(q, next_q);\r\n    }\r\n    std::vector<cv::RotatedRect> boxes;\r\n    for (auto n = 1; n < label_num; ++n)\r\n    {\r\n        std::vector<cv::Point> points;\r\n        cv::findNonZero(out == n, points);\r\n        cv::Mat fuck = out == n;\r\n        cv::RotatedRect rect = cv::minAreaRect(points);\r\n        boxes.emplace_back(rect);\r\n    }\r\n    return boxes;\r\n}\r\n"
  },
  {
    "path": "psenet/psenet.h",
    "content": "#ifndef TENSORRTX_PSENET_H\r\n#define TENSORRTX_PSENET_H\r\n#include <memory>\r\n#include <vector>\r\n#include <chrono>\r\n#include <opencv2/opencv.hpp>\r\n#include \"utils.h\"\r\n#include \"layers.h\"\r\nclass PSENet\r\n{\r\npublic:\r\n\tPSENet(int max_side_len, int min_side_len, float threshold, int num_kernel, int stride);\r\n\t~PSENet();\r\n\r\n\tICudaEngine* createEngine(IBuilder* builder, IBuilderConfig* config);\r\n\tvoid serializeEngine();\r\n\tvoid deserializeEngine();\r\n\tvoid init();\r\n\tvoid inferenceOnce(IExecutionContext& context, float* input, float* output, int input_h, int input_w);\r\n\tvoid detect(std::string image_path);\r\n\tfloat* preProcess(cv::Mat image, int& resize_h, int& resize_w, float& ratio_h, float& ratio_w);\r\n\tstd::vector<cv::RotatedRect> postProcess(float* origin_output, int resize_h, int resize_w);\r\n\r\nprivate:\r\n\tLogger gLogger;\r\n\tstd::shared_ptr<nvinfer1::IRuntime> mRuntime;\r\n\tstd::shared_ptr<nvinfer1::ICudaEngine> mCudaEngine;\r\n\tstd::shared_ptr<nvinfer1::IExecutionContext> mContext;\r\n\tDataType dt = DataType::kFLOAT;\r\n\tconst char* input_name_ = \"input\";\r\n\tconst char* output_name_ = \"output\";\r\n\tint max_side_len_ = 1024;\r\n\tint min_side_len_ = 640;\r\n\tfloat post_threshold_ = 0.9;\r\n\tint num_kernels_ = 6;\r\n\tint stride_ = 4;\r\n};\r\n\r\n#endif // TENSORRTX_PSENET_H\r\n"
  },
  {
    "path": "psenet/utils.cpp",
    "content": "#include \"utils.h\"\r\n\r\n// Load weights from files shared with TensorRT samples.\r\n// TensorRT weight files have a simple space delimited format:\r\n// [type] [size] <data x size in hex>\r\nstd::map<std::string, Weights> loadWeights(const std::string file)\r\n{\r\n    std::cout << \"Loading weights: \" << file << std::endl;\r\n    std::cout << \"Model weight is large, it will take some time.\" << std::endl;\r\n    std::map<std::string, Weights> weightMap;\r\n\r\n    // Open weights file\r\n    std::ifstream input(file);\r\n    assert(input.is_open() && \"Unable to load weight file.\");\r\n\r\n    // Read number of weight blobs\r\n    int32_t count;\r\n    input >> count;\r\n    assert(count > 0 && \"Invalid weight map file.\");\r\n\r\n    while (count--)\r\n    {\r\n        Weights wt{ DataType::kFLOAT, nullptr, 0 };\r\n        uint32_t size;\r\n\r\n        // Read name and type of blob\r\n        std::string name;\r\n        input >> name >> std::dec >> size;\r\n        wt.type = DataType::kFLOAT;\r\n\r\n        // Load blob\r\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\r\n        for (uint32_t x = 0, y = size; x < y; ++x)\r\n        {\r\n            input >> std::hex >> val[x];\r\n        }\r\n        wt.values = val;\r\n\r\n        wt.count = size;\r\n        weightMap[name] = wt;\r\n    }\r\n    std::cout << \"Finish load weight\" << std::endl;\r\n    return weightMap;\r\n}\r\n\r\ncv::RotatedRect expandBox(const cv::RotatedRect& inBox, float ratio)\r\n{\r\n    cv::Size size = inBox.size;\r\n    int neww = int(size.width * ratio);\r\n    int newh = int(size.height * ratio);\r\n    return cv::RotatedRect(inBox.center, cv::Size(neww, newh), inBox.angle);\r\n}\r\n\r\n\r\nvoid drawRects(cv::Mat& image, std::vector<cv::RotatedRect> boxes, float stride, float ratio_h, float ratio_w, float expand_ratio)\r\n{\r\n    cv::Point2f rect[4];\r\n    for (unsigned int i = 0; i < boxes.size(); i++)\r\n    {\r\n        cv::RotatedRect box = boxes[i];\r\n        cv::RotatedRect expandbox = expandBox(box, expand_ratio);\r\n        expandbox.points(rect);\r\n        for (auto j = 0; j < 4; j++)\r\n        {\r\n            cv::line(image, cv::Point{ int(rect[j].x / ratio_w * stride), int(rect[j].y / ratio_h * stride) }, cv::Point{ int(rect[(j + 1) % 4].x / ratio_w * stride), int(rect[(j + 1) % 4].y / ratio_h * stride) }, cv::Scalar(0, 0, 255), 2, 8);\r\n        }\r\n    }\r\n}\r\n"
  },
  {
    "path": "psenet/utils.h",
    "content": "#ifndef TENSORRTX_UTILS_H\r\n#define TENSORRTX_UTILS_H\r\n\r\n#include <map>\r\n#include <opencv2/opencv.hpp>\r\n#include \"NvInfer.h\"\r\n#include \"cuda_runtime_api.h\"\r\n#include \"assert.h\"\r\n#include <fstream>\r\n\r\nusing namespace nvinfer1;\r\n\r\nstd::map<std::string, Weights> loadWeights(const std::string file);\r\n\r\ncv::RotatedRect expandBox(const cv::RotatedRect& inBox, float ratio = 1.0);\r\n\r\nvoid drawRects(cv::Mat& image, std::vector<cv::RotatedRect> boxes, float stride, float ratio_h, float ratio_w, float expand_ratio);\r\n\r\ncv::Mat renderSegment(cv::Mat image, const cv::Mat& mask);\r\n\r\n// <============== Operator =============>\r\nstruct InferDeleter\r\n{\r\n    template <typename T>\r\n    void operator()(T* obj) const\r\n    {\r\n        if (obj)\r\n        {\r\n            obj->destroy();\r\n        }\r\n    }\r\n};\r\n\r\n#define CHECK(status)                             \\\r\n    do                                            \\\r\n    {                                             \\\r\n        auto ret = (status);                      \\\r\n        if (ret != 0)                             \\\r\n        {                                         \\\r\n            std::cout << \"Cuda failure: \" << ret; \\\r\n            abort();                              \\\r\n        }                                         \\\r\n    } while (0)\r\n\r\n// Logger for TensorRT info/warning/errors\r\nclass Logger : public nvinfer1::ILogger\r\n{\r\npublic:\r\n    Logger() : Logger(Severity::kWARNING) {}\r\n\r\n    Logger(Severity severity) : reportableSeverity(severity) {}\r\n\r\n    void log(Severity severity, const char* msg) override\r\n    {\r\n        // suppress messages with severity enum value greater than the reportable\r\n        if (severity > reportableSeverity)\r\n            return;\r\n\r\n        switch (severity)\r\n        {\r\n        case Severity::kINTERNAL_ERROR:\r\n            std::cerr << \"INTERNAL_ERROR: \";\r\n            break;\r\n        case Severity::kERROR:\r\n            std::cerr << \"ERROR: \";\r\n            break;\r\n        case Severity::kWARNING:\r\n            std::cerr << \"WARNING: \";\r\n            break;\r\n        case Severity::kINFO:\r\n            std::cerr << \"INFO: \";\r\n            break;\r\n        default:\r\n            std::cerr << \"UNKNOWN: \";\r\n            break;\r\n        }\r\n        std::cerr << msg << std::endl;\r\n    }\r\n\r\n    Severity reportableSeverity{ Severity::kWARNING };\r\n};\r\n\r\n#endif\r\n"
  },
  {
    "path": "rcnn/BatchedNms.cu",
    "content": "#include <cuda.h>\n#include <thrust/device_ptr.h>\n#include <thrust/sequence.h>\n#include <thrust/execution_policy.h>\n#include <thrust/gather.h>\n#include <cmath>\n#include <algorithm>\n#include <iostream>\n#include <stdexcept>\n#include <cstdint>\n#include <vector>\n#include \"BatchedNmsPlugin.h\"\n#include \"./cuda_utils.h\"\n#include \"macros.h\"\n\n#ifdef CUDA_11\n#include <cub/device/device_radix_sort.cuh>\n#include <cub/iterator/counting_input_iterator.cuh>\n#else\n#include <thrust/system/cuda/detail/cub/device/device_radix_sort.cuh>\n#include <thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh>\nnamespace cub = thrust::cuda_cub::cub;\n\n#endif\n\nnamespace nvinfer1 {\n\n__global__ void batched_nms_kernel(\n    const int nms_method, const float threshold, const int num_detections,\n    const int *indices, float *scores, const float *classes, const float4 *boxes) {\n\n    // Go through detections by descending score\n    for (int m = 0; m < num_detections; m++) {\n        int i = blockIdx.x * blockDim.x + threadIdx.x;\n        if (i < num_detections && m < i && scores[m] > 0.0f) {\n            int idx = indices[i];\n            int max_idx = indices[m];\n            int icls = classes[idx];\n            int mcls = classes[max_idx];\n            if (mcls == icls) {\n                float4 ibox = boxes[idx];\n                float4 mbox = boxes[max_idx];\n                float x1 = max(ibox.x, mbox.x);\n                float y1 = max(ibox.y, mbox.y);\n                float x2 = min(ibox.z, mbox.z);\n                float y2 = min(ibox.w, mbox.w);\n                float w = max(0.0f, x2 - x1);\n                float h = max(0.0f, y2 - y1);\n                float iarea = (ibox.z - ibox.x) * (ibox.w - ibox.y);\n                float marea = (mbox.z - mbox.x) * (mbox.w - mbox.y);\n                float inter = w * h;\n                float overlap = inter / (iarea + marea - inter);\n                float sigma = 0.5;  // this is an empirical value\n                // printf(\"nms_method: %d\", nms_method);\n                //nms methods selection in the second stage\n                // 0: original nms\n                // 1: soft-nms (linear)\n                // 2: soft-nms (gaussian)\n                // printf(\"nms_method: \", nms_method);\n                switch (nms_method)\n                {\n                case 0:\n                    if (overlap > threshold) {\n                        scores[i] = 0.0f;\n                    }\n                    break;\n                case 1:\n                    if (overlap > threshold) {\n                        scores[i] = (1 - overlap) * scores[i];\n                    }\n                    break;        \n                case 2:\n                    if (overlap > threshold) {\n                        scores[i] = std::exp(-(overlap * overlap) / sigma) * scores[i];\n                    }\n                    break;           \n                default:\n                    if (overlap > threshold) {\n                        scores[i] = 0.0f;\n                    }\n                    break;\n                }\n            }\n        }\n        // Sync discarded detections\n        __syncthreads();\n    }\n}\n\nint batchedNms(int nms_method, int batch_size,\n    const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\n    size_t count, int detections_per_im, float nms_thresh,\n    void *workspace, size_t workspace_size, cudaStream_t stream) {\n\n    if (!workspace || !workspace_size) {\n        // Return required scratch space size cub style\n        workspace_size += get_size_aligned<int>(count);   // indices\n        workspace_size += get_size_aligned<int>(count);   // indices_sorted\n        workspace_size += get_size_aligned<float>(count);  // scores_sorted\n\n        size_t temp_size_sort = 0;\n        cub::DeviceRadixSort::SortPairsDescending(\n            static_cast<void*>(nullptr), temp_size_sort,\n            static_cast<float*>(nullptr),\n            static_cast<float*>(nullptr),\n            static_cast<int*>(nullptr),\n            static_cast<int*>(nullptr), count);\n        workspace_size += temp_size_sort;\n\n        return workspace_size;\n    }\n\n    auto on_stream = thrust::cuda::par.on(stream);\n\n    auto indices = get_next_ptr<int>(count, workspace, workspace_size);\n    std::vector<int> indices_h(count);\n    for (int i = 0; i < count; i++)\n        indices_h[i] = i;\n    cudaMemcpyAsync(indices, indices_h.data(), count * sizeof * indices, cudaMemcpyHostToDevice, stream);\n    auto indices_sorted = get_next_ptr<int>(count, workspace, workspace_size);\n    auto scores_sorted = get_next_ptr<float>(count, workspace, workspace_size);\n\n    for (int batch = 0; batch < batch_size; batch++) {\n        auto in_scores = static_cast<const float *>(inputs[0]) + batch * count;\n        auto in_boxes = static_cast<const float4 *>(inputs[1]) + batch * count;\n        auto in_classes = static_cast<const float *>(inputs[2]) + batch * count;\n\n        auto out_scores = static_cast<float *>(outputs[0]) + batch * detections_per_im;\n        auto out_boxes = static_cast<float4 *>(outputs[1]) + batch * detections_per_im;\n        auto out_classes = static_cast<float *>(outputs[2]) + batch * detections_per_im;\n\n        // Sort scores and corresponding indices\n        int num_detections = count;\n        cub::DeviceRadixSort::SortPairsDescending(workspace, workspace_size,\n            in_scores, scores_sorted, indices, indices_sorted, num_detections, 0, sizeof(*scores_sorted) * 8, stream);\n\n        // Launch actual NMS kernel - 1 block with each thread handling n detections\n        // TODO: different device has differnet max threads\n        const int max_threads = 1024;\n        \n        int num_per_thread = ceil(static_cast<float>(num_detections) / max_threads);\n        batched_nms_kernel << <num_per_thread, max_threads, 0, stream >> > (nms_method, nms_thresh, num_detections,\n            indices_sorted, scores_sorted, in_classes, in_boxes);\n\n        // Re-sort with updated scores\n        cub::DeviceRadixSort::SortPairsDescending(workspace, workspace_size,\n            scores_sorted, scores_sorted, indices_sorted, indices,\n            num_detections, 0, sizeof(*scores_sorted) * 8, stream);\n\n        // Gather filtered scores, boxes, classes\n        num_detections = min(detections_per_im, num_detections);\n        cudaMemcpyAsync(out_scores, scores_sorted, num_detections * sizeof *scores_sorted,\n        cudaMemcpyDeviceToDevice, stream);\n        if (num_detections < detections_per_im) {\n            thrust::fill_n(on_stream, out_scores + num_detections, detections_per_im - num_detections, 0);\n        }\n        thrust::gather(on_stream, indices, indices + num_detections, in_boxes, out_boxes);\n        thrust::gather(on_stream, indices, indices + num_detections, in_classes, out_classes);\n    }\n\n    return 0;\n}\n}  // namespace nvinfer1\n"
  },
  {
    "path": "rcnn/BatchedNmsPlugin.h",
    "content": "#pragma once\n\n#include <NvInfer.h>\n\n#include <vector>\n#include <cassert>\n#include \"macros.h\"\n\nusing namespace nvinfer1;\n\n#define PLUGIN_NAME \"BatchedNms\"\n#define PLUGIN_VERSION \"1\"\n#define PLUGIN_NAMESPACE \"\"\n\nnamespace nvinfer1 {\nint batchedNms(int nms_method, int batchSize,\n    const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\n    size_t count, int detections_per_im, float nms_thresh,\n    void *workspace, size_t workspace_size, cudaStream_t stream);\n\n/*\n    input1: scores{C, 1} C->topk\n    input2: boxes{C, 4} C->topk format:XYXY\n    input3: classes{C, 1} C->topk\n    output1: scores{C, 1} C->detections_per_img\n    output2: boxes{C, 4} C->detections_per_img format:XYXY\n    output3: classes{C, 1} C->detections_per_img\n    Description: implement batched nms\n*/\nclass BatchedNmsPlugin : public IPluginV2Ext {\n    int _nms_method;\n    float _nms_thresh;\n    int _detections_per_im;\n\n    size_t _count = 1;\n\n protected:\n    void deserialize(void const* data, size_t length) {\n        const char* d = static_cast<const char*>(data);\n        read(d, _nms_method);\n        read(d, _nms_thresh);\n        read(d, _detections_per_im);\n        read(d, _count);\n    }\n\n    size_t getSerializationSize() const override {\n        return sizeof(_nms_method) + sizeof(_nms_thresh) + sizeof(_detections_per_im)\n            + sizeof(_count);\n    }\n\n    void serialize(void *buffer) const TRT_NOEXCEPT override {\n        char* d = static_cast<char*>(buffer);\n        write(d, _nms_method);\n        write(d, _nms_thresh);\n        write(d, _detections_per_im);\n        write(d, _count);\n    }\n\n public:\n    BatchedNmsPlugin(int nms_method, float nms_thresh, int detections_per_im)\n        : _nms_method(nms_method), _nms_thresh(nms_thresh), _detections_per_im(detections_per_im) {\n        assert(nms_method >= 0);\n        assert(nms_thresh > 0);\n        assert(detections_per_im > 0);\n    }\n\n    BatchedNmsPlugin(int nms_method, float nms_thresh, int detections_per_im, size_t count)\n        : _nms_method(nms_method), _nms_thresh(nms_thresh), _detections_per_im(detections_per_im), _count(count) {\n        assert(nms_method >= 0);\n        assert(nms_thresh > 0);\n        assert(detections_per_im > 0);\n        assert(count > 0);\n    }\n\n    BatchedNmsPlugin(void const* data, size_t length) {\n        this->deserialize(data, length);\n    }\n\n    const char *getPluginType() const TRT_NOEXCEPT override {\n        return PLUGIN_NAME;\n    }\n\n    const char *getPluginVersion() const TRT_NOEXCEPT override {\n        return PLUGIN_VERSION;\n    }\n\n    int getNbOutputs() const TRT_NOEXCEPT override {\n        return 3;\n    }\n\n    Dims getOutputDimensions(int index,\n        const Dims *inputs, int nbInputDims) TRT_NOEXCEPT override {\n        assert(nbInputDims == 3);\n        assert(index < this->getNbOutputs());\n        return Dims2(_detections_per_im, index == 1 ? 4 : 1);\n    }\n\n    bool supportsFormat(DataType type, PluginFormat format) const TRT_NOEXCEPT override {\n        return type == DataType::kFLOAT && format == PluginFormat::kLINEAR;\n    }\n\n    int initialize() TRT_NOEXCEPT override { return 0; }\n\n    void terminate() TRT_NOEXCEPT override {}\n\n    size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override {\n        static int size = -1;\n        if (size < 0) {\n            size = batchedNms(_nms_method, maxBatchSize, nullptr, nullptr, _count,\n                _detections_per_im, _nms_thresh,\n                nullptr, 0, nullptr);\n        }\n        return size;\n    }\n\n    int enqueue(int batchSize,\n        const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\n        void *workspace, cudaStream_t stream) TRT_NOEXCEPT override {\n        return batchedNms(_nms_method, batchSize, inputs, outputs, _count,\n            _detections_per_im, _nms_thresh,\n            workspace, getWorkspaceSize(batchSize), stream);\n    }\n\n    void destroy() TRT_NOEXCEPT override {\n        delete this;\n    }\n\n    const char *getPluginNamespace() const TRT_NOEXCEPT override {\n        return PLUGIN_NAMESPACE;\n    }\n\n    void setPluginNamespace(const char *N) TRT_NOEXCEPT override {\n    }\n\n    // IPluginV2Ext Methods\n    DataType getOutputDataType(int index, const DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override {\n        assert(index < 3);\n        return DataType::kFLOAT;\n    }\n\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n        int nbInputs) const TRT_NOEXCEPT override {\n        return false;\n    }\n\n    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override { return false; }\n\n    void configurePlugin(const Dims* inputDims, int nbInputs, const Dims* outputDims, int nbOutputs,\n        const DataType* inputTypes, const DataType* outputTypes, const bool* inputIsBroadcast,\n        const bool* outputIsBroadcast, PluginFormat floatFormat, int maxBatchSize) TRT_NOEXCEPT override {\n        assert(*inputTypes == nvinfer1::DataType::kFLOAT &&\n            floatFormat == nvinfer1::PluginFormat::kLINEAR);\n        assert(nbInputs == 3);\n        assert(inputDims[0].d[0] == inputDims[2].d[0]);\n        assert(inputDims[1].d[0] == inputDims[2].d[0]);\n        _count = inputDims[0].d[0];\n    }\n\n    IPluginV2Ext *clone() const TRT_NOEXCEPT override {\n        return new BatchedNmsPlugin(_nms_method, _nms_thresh, _detections_per_im, _count);\n    }\n\n private:\n    template<typename T> void write(char*& buffer, const T& val) const {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T> void read(const char*& buffer, T& val) {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n};\n\nclass BatchedNmsPluginCreator : public IPluginCreator {\n public:\n    BatchedNmsPluginCreator() {}\n\n    const char *getPluginNamespace() const TRT_NOEXCEPT override {\n        return PLUGIN_NAMESPACE;\n    }\n    const char *getPluginName() const TRT_NOEXCEPT override {\n        return PLUGIN_NAME;\n    }\n\n    const char *getPluginVersion() const TRT_NOEXCEPT override {\n        return PLUGIN_VERSION;\n    }\n\n    IPluginV2 *deserializePlugin(const char *name, const void *serialData, size_t serialLength) TRT_NOEXCEPT override {\n        return new BatchedNmsPlugin(serialData, serialLength);\n    }\n\n    void setPluginNamespace(const char *N) TRT_NOEXCEPT override {}\n    const PluginFieldCollection *getFieldNames() TRT_NOEXCEPT override { return nullptr; }\n    IPluginV2 *createPlugin(const char *name, const PluginFieldCollection *fc) TRT_NOEXCEPT override { return nullptr; }\n};\n\nREGISTER_TENSORRT_PLUGIN(BatchedNmsPluginCreator);\n\n}  // namespace nvinfer1\n\n#undef PLUGIN_NAME\n#undef PLUGIN_VERSION\n#undef PLUGIN_NAMESPACE\n"
  },
  {
    "path": "rcnn/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.1)\n\nproject(rcnn)\n\nadd_definitions(-std=c++14)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 14)\nset(CMAKE_BUILD_TYPE Debug)\n\nset(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};--extended-lambda)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/home/jushi/TensorRT-8.2.1.6/include)\nlink_directories(/home/jushi/TensorRT-8.2.1.6/lib)\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++14 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\ncuda_add_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/BatchedNms.cu ${PROJECT_SOURCE_DIR}/PredictorDecode.cu ${PROJECT_SOURCE_DIR}/RoiAlign.cu ${PROJECT_SOURCE_DIR}/RpnDecode.cu ${PROJECT_SOURCE_DIR}/RpnNms.cu ${PROJECT_SOURCE_DIR}/MaskRcnnInference.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(rcnn ${PROJECT_SOURCE_DIR}/rcnn.cpp)\ntarget_link_libraries(rcnn nvinfer)\ntarget_link_libraries(rcnn cudart)\ntarget_link_libraries(rcnn myplugins)\ntarget_link_libraries(rcnn ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "rcnn/MaskRcnnInference.cu",
    "content": "#include \"MaskRcnnInferencePlugin.h\"\n#include \"macros.h\"\n\nnamespace nvinfer1 {\n\n__device__ float Logist(float data) { return 1.0f / (1.0f + expf(-data)); }\n\n__global__ void MaskRcnnInferenceKernel(\n    const int nthreads,\n    const int detections_per_im,\n    const int output_size,\n    const int num_classes,\n    const float* indices,\n    const float* masks,\n    float* out_masks) {\n    size_t index = blockIdx.x * blockDim.x + threadIdx.x;\n    if (index < nthreads) {\n        int ind = index / output_size / output_size / num_classes;\n        int ind_class = indices[ind];\n        int cur_class = index / output_size / output_size % num_classes;\n        if (ind_class == cur_class) {\n            int w = index % output_size;\n            int h = index / output_size % output_size;\n            int tmp = ind * num_classes * output_size * output_size +\n              cur_class * output_size*output_size + h * output_size + w;\n            float maskVal = masks[ind * num_classes * output_size *\n              output_size + cur_class * output_size * output_size +\n              h * output_size + w];\n            out_masks[ind * output_size * output_size + h * output_size + w] = Logist(maskVal);\n        }\n    }\n}\n\nint maskRcnnInference(int batchSize,\n    const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\n    int detections_per_im, int output_size, int num_classes, cudaStream_t stream) {\n\n    for (int batch = 0; batch < batchSize; batch++) {\n        auto in_indices = static_cast<const float *>(inputs[0]) + batch * detections_per_im;\n        auto in_masks = static_cast<const float *>(inputs[1]) + batch * detections_per_im *\n          num_classes * output_size * output_size;\n\n        auto out_masks = static_cast<float *>(outputs[0]) + batch * detections_per_im * output_size * output_size;\n\n        int nthreads = detections_per_im * num_classes * output_size * output_size;\n        const int max_threads = 1024;\n        int blocksPerGrid = ceil(static_cast<float>(nthreads) / max_threads);\n        // TODO: can implement this function with thrust?\n        MaskRcnnInferenceKernel << <blocksPerGrid, max_threads, 0, stream >> > (\n            nthreads,\n            detections_per_im,\n            output_size,\n            num_classes,\n            in_indices,\n            in_masks,\n            out_masks);\n        cudaDeviceSynchronize();\n    }\n\n    return 0;\n}\n\n}  // namespace nvinfer1\n"
  },
  {
    "path": "rcnn/MaskRcnnInferencePlugin.h",
    "content": "#pragma once\n\n#include <NvInfer.h>\n\n#include <vector>\n#include <cassert>\n#include \"macros.h\"\n\nusing namespace nvinfer1;\n\n#define PLUGIN_NAME \"MaskRcnnInference\"\n#define PLUGIN_VERSION \"1\"\n#define PLUGIN_NAMESPACE \"\"\n\nnamespace nvinfer1 {\nint maskRcnnInference(int batchSize,\n    const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\n    int detections_per_im, int output_size, int num_classes, cudaStream_t stream);\n/*\n    input1: indices{C, 1} C->topk\n    input2: masks{C, NUM_CLASS, size, size} C->topk format:XYXY\n    output1: masks{C, 1, size, size} C->detections_per_img\n    Description: implement index select\n*/\n\nclass MaskRcnnInferencePlugin : public IPluginV2Ext {\n    int _detections_per_im;\n    int _output_size;\n    int _num_classes = 1;\n\n protected:\n    void deserialize(void const* data, size_t length) {\n        const char* d = static_cast<const char*>(data);\n        read(d, _detections_per_im);\n        read(d, _output_size);\n        read(d, _num_classes);\n    }\n    size_t getSerializationSize() const TRT_NOEXCEPT override {\n        return sizeof(_detections_per_im) + sizeof(_output_size) + sizeof(_num_classes);\n    }\n    void serialize(void *buffer) const TRT_NOEXCEPT override {\n        char* d = static_cast<char*>(buffer);\n        write(d, _detections_per_im);\n        write(d, _output_size);\n        write(d, _num_classes);\n    }\n\n public:\n    MaskRcnnInferencePlugin(int detections_per_im, int output_size)\n        : _detections_per_im(detections_per_im), _output_size(output_size) {\n        assert(detections_per_im > 0);\n        assert(output_size > 0);\n    }\n    MaskRcnnInferencePlugin(int detections_per_im, int output_size, int num_classes)\n        : _detections_per_im(detections_per_im), _output_size(output_size), _num_classes(num_classes) {\n        assert(detections_per_im > 0);\n        assert(output_size > 0);\n        assert(num_classes > 0);\n    }\n    MaskRcnnInferencePlugin(void const* data, size_t length) {\n        this->deserialize(data, length);\n    }\n    const char *getPluginType() const TRT_NOEXCEPT override {\n        return PLUGIN_NAME;\n    }\n    const char *getPluginVersion() const TRT_NOEXCEPT override {\n        return PLUGIN_VERSION;\n    }\n    int getNbOutputs() const TRT_NOEXCEPT override {\n        return 1;\n    }\n    Dims getOutputDimensions(int index,\n        const Dims *inputs, int nbInputDims) TRT_NOEXCEPT override {\n        assert(index < this->getNbOutputs());\n        return Dims4(_detections_per_im, 1, _output_size, _output_size);\n    }\n    bool supportsFormat(DataType type, PluginFormat format) const TRT_NOEXCEPT override {\n        return type == DataType::kFLOAT && format == PluginFormat::kLINEAR;\n    }\n    int initialize() TRT_NOEXCEPT override { return 0; }\n    void terminate() TRT_NOEXCEPT override {}\n    size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override {\n        return 0;\n    }\n    int enqueue(int batchSize,\n        const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\n        void *workspace, cudaStream_t stream) TRT_NOEXCEPT override {\n        return maskRcnnInference(batchSize, inputs, outputs,\n            _detections_per_im, _output_size, _num_classes, stream);\n    }\n    void destroy() TRT_NOEXCEPT override {\n        delete this;\n    }\n    const char *getPluginNamespace() const TRT_NOEXCEPT override {\n        return PLUGIN_NAMESPACE;\n    }\n    void setPluginNamespace(const char *N) TRT_NOEXCEPT override {\n    }\n    // IPluginV2Ext Methods\n    DataType getOutputDataType(int index, const DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override {\n        assert(index < 1);\n        return DataType::kFLOAT;\n    }\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n        int nbInputs) const TRT_NOEXCEPT override {\n        return false;\n    }\n    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override { return false; }\n    void configurePlugin(const Dims* inputDims, int nbInputs, const Dims* outputDims, int nbOutputs,\n        const DataType* inputTypes, const DataType* outputTypes, const bool* inputIsBroadcast,\n        const bool* outputIsBroadcast, PluginFormat floatFormat, int maxBatchSize) TRT_NOEXCEPT override {\n        assert(*inputTypes == nvinfer1::DataType::kFLOAT &&\n            floatFormat == nvinfer1::PluginFormat::kLINEAR);\n        assert(nbInputs == 2);\n        assert(inputDims[0].d[0] == _detections_per_im);\n        assert(inputDims[1].d[0] == _detections_per_im);\n        assert(inputDims[1].d[2] == _output_size);\n        assert(inputDims[1].d[3] == _output_size);\n        _num_classes = inputDims[1].d[1];\n    }\n    IPluginV2Ext *clone() const TRT_NOEXCEPT override {\n        return new MaskRcnnInferencePlugin(_detections_per_im, _output_size, _num_classes);\n    }\n\n private:\n    template<typename T> void write(char*& buffer, const T& val) const {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n    template<typename T> void read(const char*& buffer, T& val) {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n};\n\nclass MaskRcnnInferencePluginCreator : public IPluginCreator {\n public:\n    MaskRcnnInferencePluginCreator() {}\n    const char *getPluginNamespace() const TRT_NOEXCEPT override {\n        return PLUGIN_NAMESPACE;\n    }\n    const char *getPluginName() const TRT_NOEXCEPT override {\n        return PLUGIN_NAME;\n    }\n    const char *getPluginVersion() const TRT_NOEXCEPT override {\n        return PLUGIN_VERSION;\n    }\n    IPluginV2 *deserializePlugin(const char *name, const void *serialData, size_t serialLength) TRT_NOEXCEPT override {\n        return new MaskRcnnInferencePlugin(serialData, serialLength);\n    }\n    void setPluginNamespace(const char *N) TRT_NOEXCEPT override {}\n    const PluginFieldCollection *getFieldNames() TRT_NOEXCEPT override { return nullptr; }\n    IPluginV2 *createPlugin(const char *name, const PluginFieldCollection *fc) TRT_NOEXCEPT override { return nullptr; }\n};\n\nREGISTER_TENSORRT_PLUGIN(MaskRcnnInferencePluginCreator);\n\n}  // namespace nvinfer1\n\n#undef PLUGIN_NAME\n#undef PLUGIN_VERSION\n#undef PLUGIN_NAMESPACE\n"
  },
  {
    "path": "rcnn/PredictorDecode.cu",
    "content": "#include <thrust/device_ptr.h>\n#include <thrust/sequence.h>\n#include <thrust/execution_policy.h>\n#include <thrust/gather.h>\n\n#include <algorithm>\n#include <cstdint>\n\n#include \"PredictorDecodePlugin.h\"\n#include \"./cuda_utils.h\"\n#include \"macros.h\"\n\n#ifdef CUDA_11\n#include <cub/device/device_radix_sort.cuh>\n#include <cub/iterator/counting_input_iterator.cuh>\n#else\n#include <thrust/system/cuda/detail/cub/device/device_radix_sort.cuh>\n#include <thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh>\nnamespace cub = thrust::cuda_cub::cub;\n#endif\n\nnamespace nvinfer1 {\n\nint predictorDecode(int batchSize, const void *const *inputs,\nvoid *TRT_CONST_ENQUEUE*outputs, unsigned int num_boxes, unsigned int num_classes,\nunsigned int image_height, unsigned int image_width,\nconst std::vector<float>& bbox_reg_weights, void *workspace,\nsize_t workspace_size, cudaStream_t stream) {\n    int scores_size = num_boxes * num_classes;\n\n    if (!workspace || !workspace_size) {\n        // Return required scratch space size cub style\n        workspace_size = get_size_aligned<float>(bbox_reg_weights.size());  // anchors\n        workspace_size += get_size_aligned<int>(scores_size);      // indices\n        workspace_size += get_size_aligned<int>(scores_size);      // indices_sorted\n        workspace_size += get_size_aligned<float>(scores_size);    // scores_sorted\n\n        size_t temp_size_sort = 0;\n        cub::DeviceRadixSort::SortPairsDescending(\n            static_cast<void*>(nullptr), temp_size_sort,\n            static_cast<float*>(nullptr),\n            static_cast<float*>(nullptr),\n            static_cast<int*>(nullptr),\n            static_cast<int*>(nullptr),\n            scores_size);\n        workspace_size += temp_size_sort;\n\n        return workspace_size;\n    }\n\n    auto bbox_reg_weights_d = get_next_ptr<float>(bbox_reg_weights.size(), workspace, workspace_size);\n    cudaMemcpyAsync(bbox_reg_weights_d, bbox_reg_weights.data(),\n    bbox_reg_weights.size() * sizeof *bbox_reg_weights_d,\n    cudaMemcpyHostToDevice, stream);\n\n    auto on_stream = thrust::cuda::par.on(stream);\n\n    auto indices = get_next_ptr<int>(scores_size, workspace, workspace_size);\n    std::vector<int> indices_h(scores_size, 0);\n    for (int i = 0; i < scores_size; i++) indices_h[i] = i;\n    cudaMemcpyAsync(indices, indices_h.data(), scores_size * sizeof(int), cudaMemcpyHostToDevice, stream);\n    auto indices_sorted = get_next_ptr<int>(scores_size, workspace, workspace_size);\n    auto scores_sorted = get_next_ptr<float>(scores_size, workspace, workspace_size);\n\n    for (int batch = 0; batch < batchSize; batch++) {\n        auto in_scores = static_cast<const float *>(inputs[0]) + batch * scores_size;\n        auto in_boxes = static_cast<const float4 *>(inputs[1]) + batch * scores_size;\n        auto in_proposals = static_cast<const float4 *>(inputs[2]) + batch * num_boxes;\n\n        auto out_scores = static_cast<float *>(outputs[0]) + batch * num_boxes;\n        auto out_boxes = static_cast<float4 *>(outputs[1]) + batch * num_boxes;\n        auto out_classes = static_cast<float *>(outputs[2]) + batch * num_boxes;\n\n        // Only keep top n scores\n        cub::DeviceRadixSort::SortPairsDescending(workspace, workspace_size,\n            in_scores, scores_sorted, indices, indices_sorted, scores_size, 0, sizeof(*scores_sorted) * 8, stream);\n\n        // Gather boxes\n        thrust::transform(on_stream, indices_sorted, indices_sorted + num_boxes,\n            thrust::make_zip_iterator(thrust::make_tuple(out_scores, out_boxes, out_classes)),\n            [=] __device__(int i) {\n            int cls = i % num_classes;\n            int n = i / num_classes;\n            float4 deltas = in_boxes[i];\n\n            float4 boxes = in_proposals[n];\n\n            float w = boxes.z - boxes.x;\n            float h = boxes.w - boxes.y;\n            float pred_ctr_x = (deltas.x / bbox_reg_weights_d[0]) * w + boxes.x + 0.5f * w;\n            float pred_ctr_y = (deltas.y / bbox_reg_weights_d[1]) * h + boxes.y + 0.5f * h;\n            float pred_w = exp(deltas.z / bbox_reg_weights_d[2]) * w;\n            float pred_h = exp(deltas.w / bbox_reg_weights_d[3]) * h;\n\n            boxes = float4{\n              max(0.0f, pred_ctr_x - 0.5f * pred_w),\n              max(0.0f, pred_ctr_y - 0.5f * pred_h),\n              min(pred_ctr_x + 0.5f * pred_w, static_cast<float>(image_width)),\n              min(pred_ctr_y + 0.5f * pred_h, static_cast<float>(image_width))\n            };\n\n            // filter empty boxes\n            if (boxes.z - boxes.x <= 0.0f || boxes.w - boxes.y <= 0.0f) return thrust::make_tuple(0.0f, boxes, cls);\n            else\n                return thrust::make_tuple(in_scores[i], boxes, cls);\n        });\n    }\n\n    return 0;\n}\n\n}  // namespace nvinfer1\n"
  },
  {
    "path": "rcnn/PredictorDecodePlugin.h",
    "content": "#pragma once\n\n#include <NvInfer.h>\n\n#include <cassert>\n#include <vector>\n#include \"macros.h\"\n\nusing namespace nvinfer1;\n\n#define PLUGIN_NAME \"PredictorDecode\"\n#define PLUGIN_VERSION \"1\"\n#define PLUGIN_NAMESPACE \"\"\n\nnamespace nvinfer1 {\n\nint predictorDecode(int batchSize,\nconst void *const *inputs, void *TRT_CONST_ENQUEUE*outputs, unsigned int num_boxes,\nunsigned int num_classes, unsigned int image_height,\nunsigned int image_width, const std::vector<float>& bbox_reg_weights,\nvoid *workspace, size_t workspace_size, cudaStream_t stream);\n\n/*\n    input1: scores{N,C,1,1} N->nums C->num of classes\n    input2: boxes{N,C*4,1,1} N->nums C->num of classes\n    input3: proposals{N,4} N->nums\n    output1: scores{N, 1} N->nums\n    output2: boxes{N, 4} N->nums format:XYXY\n    output3: classes{N, 1} N->nums\n    Description: implement fast rcnn decode\n*/\nclass PredictorDecodePlugin : public IPluginV2Ext {\n    unsigned int _num_boxes;\n    unsigned int _num_classes;\n    unsigned int _image_height;\n    unsigned int _image_width;\n    std::vector<float> _bbox_reg_weights;\n    mutable int size = -1;\n\n protected:\n    void deserialize(void const* data, size_t length) {\n        const char* d = static_cast<const char*>(data);\n        read(d, _num_boxes);\n        read(d, _num_classes);\n        read(d, _image_height);\n        read(d, _image_width);\n        size_t bbox_reg_weights_size;\n        read(d, bbox_reg_weights_size);\n        while (bbox_reg_weights_size--) {\n            float val;\n            read(d, val);\n            _bbox_reg_weights.push_back(val);\n        }\n    }\n\n    size_t getSerializationSize() const TRT_NOEXCEPT override {\n        return sizeof(_num_boxes) + sizeof(_num_classes) +\n        sizeof(_image_height) + sizeof(_image_width) + sizeof(size_t) +\n        sizeof(float)*_bbox_reg_weights.size();\n    }\n\n    void serialize(void *buffer) const TRT_NOEXCEPT override {\n        char* d = static_cast<char*>(buffer);\n        write(d, _num_boxes);\n        write(d, _num_classes);\n        write(d, _image_height);\n        write(d, _image_width);\n        write(d, _bbox_reg_weights.size());\n        for (auto &val : _bbox_reg_weights) {\n            write(d, val);\n        }\n    }\n\n public:\n    PredictorDecodePlugin(unsigned int num_boxes, unsigned int image_height,\n    unsigned int image_width, std::vector<float> const& bbox_reg_weights)\n        : _num_boxes(num_boxes), _image_height(image_height),\n        _image_width(image_width), _bbox_reg_weights(bbox_reg_weights) {}\n\n    PredictorDecodePlugin(unsigned int num_boxes, unsigned int num_classes,\n    unsigned int image_height, unsigned int image_width,\n    std::vector<float> const& bbox_reg_weights)\n        : _num_boxes(num_boxes), _num_classes(num_classes),\n        _image_height(image_height), _image_width(image_width),\n        _bbox_reg_weights(bbox_reg_weights) {}\n\n    PredictorDecodePlugin(void const* data, size_t length) {\n        this->deserialize(data, length);\n    }\n\n    const char *getPluginType() const TRT_NOEXCEPT override {\n        return PLUGIN_NAME;\n    }\n\n    const char *getPluginVersion() const TRT_NOEXCEPT override {\n        return PLUGIN_VERSION;\n    }\n\n    int getNbOutputs() const TRT_NOEXCEPT override {\n        return 3;\n    }\n\n    Dims getOutputDimensions(int index,\n        const Dims *inputs, int nbInputDims) TRT_NOEXCEPT override {\n        assert(nbInputDims == 3);\n        assert(index < this->getNbOutputs());\n        return Dims2(_num_boxes, (index == 1 ? 4 : 1));\n    }\n\n    bool supportsFormat(DataType type, PluginFormat format) const TRT_NOEXCEPT override {\n        return type == DataType::kFLOAT && format == PluginFormat::kLINEAR;\n    }\n\n    int initialize() TRT_NOEXCEPT override { return 0; }\n\n    void terminate() TRT_NOEXCEPT override {}\n\n    size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override {\n        if (size < 0) {\n            size = predictorDecode(maxBatchSize, nullptr, nullptr,\n            _num_boxes, _num_classes, _image_height, _image_width,\n            _bbox_reg_weights, nullptr, 0, nullptr);\n        }\n        return size;\n    }\n\n    int enqueue(int batchSize,\n        const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\n        void *workspace, cudaStream_t stream) TRT_NOEXCEPT override {\n        return predictorDecode(batchSize, inputs, outputs, _num_boxes,\n        _num_classes, _image_height, _image_width, _bbox_reg_weights,\n        workspace, getWorkspaceSize(batchSize), stream);\n    }\n\n    void destroy() TRT_NOEXCEPT override {\n        delete this;\n    };\n\n    const char *getPluginNamespace() const TRT_NOEXCEPT override {\n        return PLUGIN_NAMESPACE;\n    }\n\n    void setPluginNamespace(const char *N) TRT_NOEXCEPT override {}\n\n    // IPluginV2Ext Methods\n    DataType getOutputDataType(int index, const DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override {\n        assert(index < this->getNbOutputs());\n        return DataType::kFLOAT;\n    }\n\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n        int nbInputs) const TRT_NOEXCEPT override {\n        return false;\n    }\n\n    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override { return false; }\n\n    void configurePlugin(const Dims* inputDims, int nbInputs, const Dims* outputDims, int nbOutputs,\n        const DataType* inputTypes, const DataType* outputTypes, const bool* inputIsBroadcast,\n        const bool* outputIsBroadcast, PluginFormat floatFormat, int maxBatchSize) TRT_NOEXCEPT override {\n        assert(*inputTypes == nvinfer1::DataType::kFLOAT &&\n            floatFormat == nvinfer1::PluginFormat::kLINEAR);\n        assert(nbInputs == 3);\n        assert(nbOutputs == 3);\n        auto const& scores_dims = inputDims[0];\n        auto const& boxes_dims = inputDims[1];\n        auto const& proposals_dims = inputDims[2];\n        assert(scores_dims.d[0] == _num_boxes);\n        assert(scores_dims.d[0] == boxes_dims.d[0]);\n        assert(scores_dims.d[0] == proposals_dims.d[0]);\n        assert(scores_dims.d[1] * 4 == boxes_dims.d[1]);\n        assert(proposals_dims.d[1] == 4);\n        _num_classes = scores_dims.d[1];\n    }\n\n    IPluginV2Ext *clone() const TRT_NOEXCEPT override {\n        return new PredictorDecodePlugin(_num_boxes, _num_classes, _image_height, _image_width, _bbox_reg_weights);\n    }\n\n private:\n    template<typename T> void write(char*& buffer, const T& val) const {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T> void read(const char*& buffer, T& val) {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n};\n\nclass PredictorDecodePluginCreator : public IPluginCreator {\n public:\n    PredictorDecodePluginCreator() {}\n\n    const char *getPluginName() const TRT_NOEXCEPT override {\n        return PLUGIN_NAME;\n    }\n\n    const char *getPluginVersion() const TRT_NOEXCEPT override {\n        return PLUGIN_VERSION;\n    }\n\n    const char *getPluginNamespace() const TRT_NOEXCEPT override {\n        return PLUGIN_NAMESPACE;\n    }\n\n    IPluginV2 *deserializePlugin(const char *name, const void *serialData, size_t serialLength) TRT_NOEXCEPT override {\n        return new PredictorDecodePlugin(serialData, serialLength);\n    }\n\n    void setPluginNamespace(const char *N) TRT_NOEXCEPT override {}\n    const PluginFieldCollection *getFieldNames() TRT_NOEXCEPT override { return nullptr; }\n    IPluginV2 *createPlugin(const char *name, const PluginFieldCollection *fc) TRT_NOEXCEPT override { return nullptr; }\n};\n\nREGISTER_TENSORRT_PLUGIN(PredictorDecodePluginCreator);\n\n}  // namespace nvinfer1\n\n#undef PLUGIN_NAME\n#undef PLUGIN_VERSION\n#undef PLUGIN_NAMESPACE\n"
  },
  {
    "path": "rcnn/README.md",
    "content": "# Rcnn\n\nThe Pytorch implementation is [facebookresearch/detectron2](https://github.com/facebookresearch/detectron2). Now, outputting instance segmentation results on the original image size and selecting different nms methods are available, which is more convenient for engineering applications.\n\n## Models\n\n- [x] Faster R-CNN(C4)\n\n- [x] Mask R-CNN(C4)\n\n## Test Environment\n\n- GTX3090 / Ubuntu20.04 / cuda11 / cudnn8.0.4 / TensorRT8.1.1 / OpenCV4.5  form docker hakuyyf/tensorrtx:trt8_cuda11\n- GTX2080Ti / Ubuntu16.04 / cuda10.2 / cudnn8.0.4 / TensorRT7.2.1 / OpenCV4.2\n- GTX2080Ti / win10 / cuda10.2 / cudnn8.0.4 / TensorRT7.2.1 / OpenCV4.2 / VS2017 (need to replace function corresponding to the dirent.h and add \"--extended-lambda\" in CUDA C/C++ -> Command Line -> Other options)\n\nTensorRT7.2 is recomended because Resize layer in 7.0 with kLINEAR mode is a little different with opencv. You can also implement data preprocess out of tensorrt if you want to use TensorRT7.0 or more previous version. \nTensorRT 8.x is supported and you can use it.\n\n**The result under fp32 is same to pytorch about 4 decimal places**!\n\n## Contributors\n\n<a href=\"https://github.com/HaiyangPeng\"><img src=\"https://avatars.githubusercontent.com/u/46739135?v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/nengwp\"><img src=\"https://avatars.githubusercontent.com/u/44516353?s=96&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/freedenS\"><img src=\"https://avatars.githubusercontent.com/u/26213470?v=4\" width=\"40px;\" alt=\"\"/></a>\n\n## How to Run\n\n1. generate .wts from pytorch with .pkl or .pth\n\n```\n// git clone -b v0.4 https://github.com/facebookresearch/detectron2.git\n// go to facebookresearch/detectron2\npython setup.py build develop // more install information see https://github.com/facebookresearch/detectron2/blob/master/INSTALL.md\n// download https://dl.fbaipublicfiles.com/detectron2/COCO-Detection/faster_rcnn_R_50_C4_1x/137257644/model_final_721ade.pkl\n// download https://raw.githubusercontent.com/freedenS/TestImage/main/demo.jpg\n// copy tensorrtx/rcnn/gen_wts.py and demo.jpg into facebookresearch/detectron2\n// ensure cfg.MODEL.WEIGHTS in gen_wts.py is correct\n// go to facebookresearch/detectron2\npython gen_wts.py\n// a file 'faster.wts' will be generated.\n```\n\n2. build tensorrtx/rcnn and run\n\n```\n// put faster.wts into tensorrtx/rcnn\n// go to tensorrtx/rcnn\n// update parameters in rcnn.cpp if your model is trained on custom dataset.The parameters are corresponding to config in detectron2.\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./rcnn -s [.wts] [m] // serialize model to plan file, add m for maskrcnn\nsudo ./rcnn -d [.engine] [image folder] [m] // deserialize and run inference, the images in [image folder] will be processed. add m for maskrcnn\n// For example\nsudo ./rcnn -s faster.wts faster.engine\nsudo ./rcnn -d faster.engine ../samples\n// sudo ./rcnn -s mask.wts mask.engine m\n// sudo ./rcnn -d mask.engine ../samples m\n```\n\n3. check the images generated, as follows. _demo.jpg and so on.\n\n## Backbone\n\n#### R18, R34, R152\n\n```\n// python\n1.download pretrained model\n  R18: https://download.pytorch.org/models/resnet18-f37072fd.pth\n  R34: https://download.pytorch.org/models/resnet34-b627a593.pth\n  R50: https://download.pytorch.org/models/resnet50-0676ba61.pth\n  R101: https://download.pytorch.org/models/resnet101-63fe2227.pth\n  R152: https://download.pytorch.org/models/resnet152-394f9c45.pth\n2.convert pth to pkl by facebookresearch/detectron2/tools/convert-torchvision-to-d2.py\n3.set merge_from_file in gen_wts.py\n  ./configs/COCO-Detections/faster_rcnn_R_50_C4_1x.yaml for fasterRcnn\n  ./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_C4_1x.yaml for maskRcnn\n4.set cfg.MODEL.RESNETS.DEPTH = 18(34,50,101,152),\n      cfg.MODEL.RESNETS.STRIDE_IN_1X1 = False,\n      cfg.MODEL.RESNETS.RES2_OUT_CHANNELS = 64, // for R18, R34; 256 for others\n      cfg.MODEL.PIXEL_MEAN = [123.675, 116.280, 103.530],\n      cfg.MODEL.PIXEL_STD = [58.395, 57.120, 57.375],\n      cfg.INPUT.FORMAT = \"RGB\"\n  and then train your own model\n5.generate your wts file.\n// c++\n6.set BACKBONE_RESNETTYPE = R18(R34,R50,R101,R152) in rcnn.cpp line 14\n7.modify PIXEL_MEAN and PIXEL_STD in rcnn.cpp\n8.set STRIDE_IN_1X1=false in backbone.hpp line 9\n9.set other parameters if it's not same with default\n10.build your engine, refer to how to run\n11.convert your image to RGB before inference\n```\n\n#### R50, R101\n\n```\n1.download pretrained model\n  R50: https://dl.fbaipublicfiles.com/detectron2/COCO-Detection/faster_rcnn_R_50_C4_1x/137257644/model_final_721ade.pkl for fasterRcnn\n       https://dl.fbaipublicfiles.com/detectron2/COCO-InstanceSegmentation/mask_rcnn_R_50_C4_1x/137259246/model_final_9243eb.pkl for maskRcnn\n  R101: https://dl.fbaipublicfiles.com/detectron2/COCO-Detection/faster_rcnn_R_101_C4_3x/138204752/model_final_298dad.pkl for fasterRcnn\n        https://dl.fbaipublicfiles.com/detectron2/COCO-InstanceSegmentation/mask_rcnn_R_101_C4_3x/138363239/model_final_a2914c.pkl for maskRcnn\n2.set merge_from_file in gen_wts.py\n  R50-faster: ./configs/COCO-Detection/faster_rcnn_R_50_C4_1x.yaml\n  R101-faster: ./configs/COCO-Detection/faster_rcnn_R_101_C4_3x.yaml\n  R50-mask: ./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_C4_1x.yaml\n  R101-mask: ./configs/COCO-InstanceSegmentation/mask_rcnn_R_101_C4_3x.yaml\n3.set BACKBONE_RESNETTYPE = R50(R101) rcnn.cpp line 14\n4.set STRIDE_IN_1X1=true in backbone.hpp\n5.follow how to run\n```\n\n## NOTE\n\n- if you meet the error below, just try to make again. The flag has been added in CMakeLists.txt\n\n  ```\n  error: __host__ or __device__ annotation on lambda requires --extended-lambda nvcc flag\n  ```\n\n- the image preprocess of sizing and padding was moved out from tensorrt, see DataPreprocess in rcnn.cpp, so the input data is {H, W, C}\n- now, left-right and top-bottom padding preprocessings are optionally available in preprocessImg of common.hpp, and you can set arbitrary sizes of INPUT_H_ and INPUT_W_\n\n- the predicted boxes is corresponding to new image size containing padding, so the final boxes need to subtract padding size and multiply with the ratio, see preprocessImg in common.hpp and calculateSize in rcnn.cpp\n\n- tensorrt use fixed input size, if the size of your data is different from the engine, you need to adjust your data and the result.\n\n- if you want to use maskrcnn with cuda10.2, please be sure that you have upgraded cuda to the latest patch. see https://github.com/NVIDIA/TensorRT/issues/1151 for detail.\n\n- you can build fasterRcnn with maskRcnn weights file.\n\n- do initializing for _pre_nms_topk in RpnNmsPlugin,  _count in BatchedNmsPlugin and _num_classes in MaskRcnnInferencePlugin inside class to prevent error assert, because the configurePlugin function is implemented after clone() and before serialize(). one can also set it through constructor.\n\n## Quantization\n\n1. quantizationType:fp32,fp16,int8. see BuildRcnnModel(rcnn.cpp line 345) for detail.\n\n2. the usage of int8 is same with [tensorrtx/yolov5](../yolov5/README.md).\n\n## Latency\n\naverage cost of doInference(in rcnn.cpp) from second time with batch=1 under the ubuntu environment above, input size: 640(w)*480(h)\n\n|               | fp32  | fp16 | int8 |\n| ------------- | ----- | ---- | ---- |\n| Faster-R50C4  | 138ms | 36ms | 30ms |\n| Faster-R101C4 | 146ms | 38ms | 32ms |\n| Mask-R50C4    | 153ms | 44ms | 33ms |\n| Mask-R101C4   | 168ms | 45ms | 35ms |\n\n## Plugins\n\ndecode and nms plugins are modified from [retinanet-examples](https://github.com/NVIDIA/retinanet-examples/tree/master/csrc/plugins)\n\n- RpnDecodePlugin: calculate coordinates of  proposals which is the first n\n\n```\nparameters:\n  top_n: num of proposals to select\n  anchors: coordinates of all anchors\n  stride: stride of current feature map\n  image_height: iamge height after DataPreprocess for clipping the box beyond the boundary\n  image_width: iamge width after DataPreprocess for clipping the box beyond the boundary\n\nInputs:\n  scores{C,H,W} C is number of anchors, H and W are the size of feature map\n  boxes{C,H,W} C is 4*number of anchors, H and W are the size of feature map\nOutputs:\n  scores{C,1} C is equal to top_n\n  boxes{C,4} C is equal to top_n\n```\n\n- RpnNmsPlugin: apply nms to proposals\n\n```\nparameters:\n  nms_thresh: thresh of nms\n  post_nms_topk: number of proposals to select\n  \nInputs:\n  scores{C,1} C is equal to top_n\n  boxes{C,4} C is equal to top_n\nOutputs:\n  boxes{C,4} C is equal to post_nms_topk\n```\n\n- RoiAlignPlugin: implement of RoiAlign(align=True). see https://github.com/facebookresearch/detectron2/blob/f50ec07cf220982e2c4861c5a9a17c4864ab5bfd/detectron2/layers/roi_align.py#L7 for detail\n\n```\nparameters:\n  pooler_resolution: output size\n  spatial_scale: scale the input boxes by this number\n  sampling_ratio: number of inputs samples to take for each output\n  num_proposals: number of proposals\n  \nInputs:\n  boxes{N,4} N is number of boxes\n  features{C,H,W} C is channels of feature map, H and W are sizes of feature map\nOutputs:\n  features{N,C,H,W} N is number of boxes, C is channels of feature map, H and W are equal to pooler_resolution\n```\n\n- PredictorDecodePlugin: calculate coordinates of predicted boxes by applying delta to proposals\n\n```\nparameters:\n  num_boxes: num of proposals\n  image_height: iamge height after DataPreprocess for clipping the box beyond the boundary\n  image_width: iamge width after DataPreprocess for clipping the box beyond the boundary\n  bbox_reg_weights: the weights for dx,dy,dw,dh. see https://github.com/facebookresearch/detectron2/blob/master/detectron2/config/defaults.py#L292 for detail\n\nInputs:\n  scores{N,C,1,1} N is euqal to num_boxes, C is the num of classes\n  boxes{N,C,1,1} N is euqal to num_boxes, C is the num of classes\n  proposals{N,4} N is equal to num_boxes\nOutputs:\n  scores{N,1} N is equal to num_boxes\n  boxes{N,4} N is equal to num_boxes\n  classes{N,1} N is equal to num_boxes\n```\n\n- BatchedNmsPlugin: apply nms to predicted boxes with different classes. same with https://github.com/facebookresearch/detectron2/blob/master/detectron2/layers/nms.py#L19\n\n```\nparameters:\n  nms_thresh: thresh of nms\n  detections_per_im: number of detections to return per image\n\nInputs:\n  scores{N,1} N is the number of the boxes\n  boxes{N,4} N is the number of the boxes\n  classes{N,1} N is the number of the boxes\nOutputs:\n  scores{N,1} N is equal to detections_per_im\n  boxes{N,4} N is equal to detections_per_im\n  classes{N,1} N is equal to detections_per_im\n```\n\n- MaskRcnnInferencePlugin:  extract the masks for the predicted classes and do sigmoid. same with https://github.com/facebookresearch/detectron2/blob/9c7f8a142216ebc52d3617c11f8fafd75b74e637/detectron2/modeling/roi_heads/mask_head.py#L114\n\n```\nparameters:\n  detections_per_im: number of detections to return per image\n  output_size: same with output size of RoiAlign\n\nInputs:\n  indices{N,1} N is the number of the predicted boxes\n  masks{N,C,H,W} N is the number of the predicted boxes\nOutputs:\n  selected_masks{N,1,H,W} N is the number of the predicted boxes, H and W is equal to output_size\n```\n\n"
  },
  {
    "path": "rcnn/RoiAlign.cu",
    "content": "#include <cuda.h>\n#include <thrust/device_ptr.h>\n#include <thrust/device_vector.h>\n#include <thrust/sequence.h>\n#include <thrust/execution_policy.h>\n#include <thrust/gather.h>\n\n#include <algorithm>\n#include <iostream>\n#include <stdexcept>\n#include <cstdint>\n#include <vector>\n#include <cmath>\n\n#include \"RoiAlignPlugin.h\"\n#include \"./cuda_utils.h\"\n#include \"macros.h\"\n\n#ifdef CUDA_11\n#include <cub/device/device_radix_sort.cuh>\n#include <cub/iterator/counting_input_iterator.cuh>\n#else\n#include <thrust/system/cuda/detail/cub/device/device_radix_sort.cuh>\n#include <thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh>\nnamespace cub = thrust::cuda_cub::cub;\n#endif\n\nnamespace nvinfer1 {\ntemplate <typename T>\n__device__ T bilinear_interpolate(\n    const T* bottom_data,\n    const int height,\n    const int width,\n    T y,\n    T x) {\n    // deal with cases that inverse elements are out of feature map boundary\n    if (y < -1.0 || y > height || x < -1.0 || x > width) {\n        // empty\n        return 0;\n    }\n\n    if (y <= 0) {\n        y = 0;\n    }\n    if (x <= 0) {\n        x = 0;\n    }\n\n    int y_low = static_cast<int>(y);\n    int x_low = static_cast<int>(x);\n    int y_high;\n    int x_high;\n\n    if (y_low >= height - 1) {\n        y_high = y_low = height - 1;\n        y = (T)y_low;\n    } else {\n        y_high = y_low + 1;\n    }\n\n    if (x_low >= width - 1) {\n        x_high = x_low = width - 1;\n        x = (T)x_low;\n    } else {\n        x_high = x_low + 1;\n    }\n\n    T ly = y - y_low;\n    T lx = x - x_low;\n    T hy = 1. - ly, hx = 1. - lx;\n    // do bilinear interpolation\n    T v1 = bottom_data[y_low * width + x_low];\n    T v2 = bottom_data[y_low * width + x_high];\n    T v3 = bottom_data[y_high * width + x_low];\n    T v4 = bottom_data[y_high * width + x_high];\n    T w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;\n\n    T val = w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4;  // mode Avg\n\n    return val;\n}\n\n__global__ void RoIAlignForward(\n    const int nthreads,\n    const float* bottom_data,\n    const float spatial_scale,\n    const int channels,\n    const int height,\n    const int width,\n    const int pooled_height,\n    const int pooled_width,\n    const int sampling_ratio,\n    const float4* bottom_rois,\n    float* top_data) {\n    for (size_t index = blockIdx.x * blockDim.x + threadIdx.x; index < nthreads; index += blockDim.x * gridDim.x) {\n        // (n, c, ph, pw) is an element in the pooled output\n        int pw = index % pooled_width;\n        int ph = (index / pooled_width) % pooled_height;\n        int c = (index / pooled_width / pooled_height) % channels;\n        int n = index / pooled_width / pooled_height / channels;\n\n        const float4* offset_bottom_rois = bottom_rois + n;\n\n        // Do not using rounding; this implementation detail is critical\n        float roi_offset = 0.5f;\n        float roi_start_w = offset_bottom_rois->x * spatial_scale - roi_offset;\n        float roi_start_h = offset_bottom_rois->y * spatial_scale - roi_offset;\n        float roi_end_w = offset_bottom_rois->z * spatial_scale - roi_offset;\n        float roi_end_h = offset_bottom_rois->w * spatial_scale - roi_offset;\n\n        float roi_width = roi_end_w - roi_start_w;\n        float roi_height = roi_end_h - roi_start_h;\n\n        float bin_size_h = static_cast<float>(roi_height) / static_cast<float>(pooled_height);\n        float bin_size_w = static_cast<float>(roi_width) / static_cast<float>(pooled_width);\n\n        const float* offset_bottom_data =\n            bottom_data + static_cast<int>(c * height * width);\n\n        // We use roi_bin_grid to sample the grid and mimic integral\n        int roi_bin_grid_h = (sampling_ratio > 0)\n            ? sampling_ratio\n            : ceil(roi_height / pooled_height);  // e.g., = 2\n        int roi_bin_grid_w =\n            (sampling_ratio > 0) ? sampling_ratio : ceil(roi_width / pooled_width);\n\n        // We do average (integral) pooling inside a bin\n        const float count = roi_bin_grid_h * roi_bin_grid_w;  // e.g. = 4\n\n        float output_val = 0.f;\n        // bool max_flag = false;\n        // e.g., iy = 0, 1\n        for (int iy = 0; iy < roi_bin_grid_h; iy++) {\n            const float y = roi_start_h + ph * bin_size_h +\n                static_cast<float>(iy + .5f) * bin_size_h /\n                static_cast<float>(roi_bin_grid_h);  // e.g., 0.5, 1.5\n            for (int ix = 0; ix < roi_bin_grid_w; ix++) {\n                const float x = roi_start_w + pw * bin_size_w +\n                    static_cast<float>(ix + .5f) * bin_size_w /\n                    static_cast<float>(roi_bin_grid_w);\n\n                float val = bilinear_interpolate(\n                    offset_bottom_data, height, width, y, x);\n\n                output_val += val;\n            }\n        }\n\n        output_val /= count;\n\n        top_data[index] = output_val;\n    }\n}\n\nint roiAlign(int batchSize, const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs, int pooler_resolution, float spatial_scale,\n    int sampling_ratio, int num_proposals, int out_channels, int feature_h, int feature_w, cudaStream_t stream) {\n    for (int batch = 0; batch < batchSize; batch++) {\n        auto in_boxes = static_cast<const float4 *>(inputs[0]) + batch * num_proposals;\n        auto in_features = static_cast<const float *>(inputs[1]) + batch * out_channels * feature_h * feature_w;\n\n        int nthreads = num_proposals * out_channels * pooler_resolution * pooler_resolution;\n        auto out_features = static_cast<float *>(outputs[0]) + batch * nthreads;\n        const int max_threads = 1024;\n\n        int blocksPerGrid = ceil(static_cast<float>(nthreads) / max_threads);\n        RoIAlignForward<< <blocksPerGrid, max_threads, 0, stream>> > (\n            nthreads,\n            in_features,\n            spatial_scale,\n            out_channels,\n            feature_h,\n            feature_w,\n            pooler_resolution,\n            pooler_resolution,\n            sampling_ratio,\n            in_boxes,\n            out_features);\n        cudaDeviceSynchronize();\n    }\n\n    return 0;\n}\n}  // namespace nvinfer1\n"
  },
  {
    "path": "rcnn/RoiAlignPlugin.h",
    "content": "#pragma once\n\n#include <NvInfer.h>\n\n#include <cassert>\n#include <vector>\n#include \"macros.h\"\n\nusing namespace nvinfer1;\n\n#define PLUGIN_NAME \"RoiAlign\"\n#define PLUGIN_VERSION \"1\"\n#define PLUGIN_NAMESPACE \"\"\n\nnamespace nvinfer1 {\nint roiAlign(int batchSize, const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\nint pooler_resolution, float spatial_scale, int sampling_ratio,\nint num_proposals, int out_channels, int feature_h, int feature_w,\ncudaStream_t stream);\n\n    /*\n        input1: boxes{N,4} N->post_nms_topk\n        input2: features{C,H,W} C->num of feature map channels\n        output1: features{N, C, H, W} N:nums of proposals C:output out_channels H,W:roialign size\n        Description: roialign\n    */\nclass RoiAlignPlugin : public IPluginV2Ext {\n    int _pooler_resolution;\n    float _spatial_scale;\n    int _sampling_ratio;\n    int _num_proposals;\n    int _out_channels;\n    int _feature_h;\n    int _feature_w;\n\n protected:\n    void deserialize(void const* data, size_t length) {\n        const char* d = static_cast<const char*>(data);\n        read(d, _pooler_resolution);\n        read(d, _spatial_scale);\n        read(d, _sampling_ratio);\n        read(d, _num_proposals);\n        read(d, _out_channels);\n        read(d, _feature_h);\n        read(d, _feature_w);\n    }\n\n    size_t getSerializationSize() const TRT_NOEXCEPT override {\n        return sizeof(_pooler_resolution) + sizeof(_spatial_scale) + sizeof(_sampling_ratio) +\n            sizeof(_num_proposals) + sizeof(_out_channels) + sizeof(_feature_h) + sizeof(_feature_w);\n    }\n\n    void serialize(void *buffer) const TRT_NOEXCEPT override {\n        char* d = static_cast<char*>(buffer);\n        write(d, _pooler_resolution);\n        write(d, _spatial_scale);\n        write(d, _sampling_ratio);\n        write(d, _num_proposals);\n        write(d, _out_channels);\n        write(d, _feature_h);\n        write(d, _feature_w);\n    }\n\n public:\n    RoiAlignPlugin(int pooler_resolution, float spatial_scale, int sampling_ratio, int num_proposals,\n        int out_channels)\n        : _pooler_resolution(pooler_resolution), _spatial_scale(spatial_scale), _sampling_ratio(sampling_ratio),\n        _num_proposals(num_proposals), _out_channels(out_channels) {}\n\n    RoiAlignPlugin(int pooler_resolution, float spatial_scale, int sampling_ratio, int num_proposals,\n        int out_channels, int feature_h, int feature_w)\n        : _pooler_resolution(pooler_resolution), _spatial_scale(spatial_scale), _sampling_ratio(sampling_ratio),\n        _num_proposals(num_proposals), _out_channels(out_channels), _feature_h(feature_h), _feature_w(feature_w) {}\n\n    RoiAlignPlugin(void const* data, size_t length) {\n        this->deserialize(data, length);\n    }\n\n    const char *getPluginType() const TRT_NOEXCEPT override {\n        return PLUGIN_NAME;\n    }\n\n    const char *getPluginVersion() const TRT_NOEXCEPT override {\n        return PLUGIN_VERSION;\n    }\n\n    int getNbOutputs() const TRT_NOEXCEPT override {\n        return 1;\n    }\n\n    Dims getOutputDimensions(int index,\n        const Dims *inputs, int nbInputDims) TRT_NOEXCEPT override {\n        assert(index < this->getNbOutputs());\n        return Dims4(_num_proposals, _out_channels, _pooler_resolution, _pooler_resolution);\n    }\n\n    bool supportsFormat(DataType type, PluginFormat format) const TRT_NOEXCEPT override {\n        return type == DataType::kFLOAT && format == PluginFormat::kLINEAR;\n    }\n\n    int initialize() TRT_NOEXCEPT override { return 0; }\n\n    void terminate() TRT_NOEXCEPT override {}\n\n    size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override {\n        return 0;\n    }\n\n    int enqueue(int batchSize,\n        const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\n        void *workspace, cudaStream_t stream) TRT_NOEXCEPT override {\n        return roiAlign(batchSize, inputs, outputs, _pooler_resolution, _spatial_scale, _sampling_ratio,\n            _num_proposals, _out_channels, _feature_h, _feature_w, stream);\n    }\n\n    void destroy() TRT_NOEXCEPT override {\n        delete this;\n    };\n\n    const char *getPluginNamespace() const TRT_NOEXCEPT override {\n        return PLUGIN_NAMESPACE;\n    }\n\n    void setPluginNamespace(const char *N) TRT_NOEXCEPT override {\n    }\n\n    // IPluginV2Ext Methods\n    DataType getOutputDataType(int index, const DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override {\n        assert(index < this->getNbOutputs());\n        return DataType::kFLOAT;\n    }\n\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n        int nbInputs) const TRT_NOEXCEPT override {\n        return false;\n    }\n\n    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override { return false; }\n\n    void configurePlugin(const Dims* inputDims, int nbInputs, const Dims* outputDims, int nbOutputs,\n        const DataType* inputTypes, const DataType* outputTypes, const bool* inputIsBroadcast,\n        const bool* outputIsBroadcast, PluginFormat floatFormat, int maxBatchSize) TRT_NOEXCEPT override {\n        assert(*inputTypes == nvinfer1::DataType::kFLOAT &&\n            floatFormat == nvinfer1::PluginFormat::kLINEAR);\n        assert(nbInputs == 2);\n        assert(nbOutputs == 1);\n        auto const& boxes_dims = inputDims[0];\n        auto const& feature_dims = inputDims[1];\n        assert(_num_proposals == boxes_dims.d[0]);\n        assert(_out_channels == feature_dims.d[0]);\n        _feature_h = feature_dims.d[1];\n        _feature_w = feature_dims.d[2];\n    }\n\n    IPluginV2Ext *clone() const TRT_NOEXCEPT override {\n        return new RoiAlignPlugin(_pooler_resolution, _spatial_scale, _sampling_ratio, _num_proposals,\n            _out_channels, _feature_h, _feature_w);\n    }\n\n private:\n    template<typename T> void write(char*& buffer, const T& val) const {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T> void read(const char*& buffer, T& val) {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n};\n\nclass RoiAlignPluginCreator : public IPluginCreator {\n public:\n    RoiAlignPluginCreator() {}\n\n    const char *getPluginName() const TRT_NOEXCEPT override {\n        return PLUGIN_NAME;\n    }\n\n    const char *getPluginVersion() const TRT_NOEXCEPT override {\n        return PLUGIN_VERSION;\n    }\n\n    const char *getPluginNamespace() const TRT_NOEXCEPT override {\n        return PLUGIN_NAMESPACE;\n    }\n\n    IPluginV2 *deserializePlugin(const char *name, const void *serialData, size_t serialLength) TRT_NOEXCEPT override {\n        return new RoiAlignPlugin(serialData, serialLength);\n    }\n\n    void setPluginNamespace(const char *N) TRT_NOEXCEPT override {}\n    const PluginFieldCollection *getFieldNames() TRT_NOEXCEPT override { return nullptr; }\n    IPluginV2 *createPlugin(const char *name, const PluginFieldCollection *fc) TRT_NOEXCEPT override { return nullptr; }\n};\n\nREGISTER_TENSORRT_PLUGIN(RoiAlignPluginCreator);\n}  // namespace nvinfer1\n\n#undef PLUGIN_NAME\n#undef PLUGIN_VERSION\n#undef PLUGIN_NAMESPACE\n"
  },
  {
    "path": "rcnn/RpnDecode.cu",
    "content": "#include <thrust/device_ptr.h>\n#include <thrust/sequence.h>\n#include <thrust/execution_policy.h>\n#include <thrust/gather.h>\n#include <thrust/tabulate.h>\n#include <thrust/count.h>\n#include <thrust/find.h>\n\n#include <algorithm>\n#include <cstdint>\n\n#include \"RpnDecodePlugin.h\"\n#include \"./cuda_utils.h\"\n#include \"macros.h\"\n\n#ifdef CUDA_11\n#include <cub/device/device_radix_sort.cuh>\n#include <cub/iterator/counting_input_iterator.cuh>\n#else\n#include <thrust/system/cuda/detail/cub/device/device_radix_sort.cuh>\n#include <thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh>\nnamespace cub = thrust::cuda_cub::cub;\n#endif\n\nnamespace nvinfer1 {\n\nint rpnDecode(int batch_size,\n    const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\n    size_t height, size_t width, size_t image_height, size_t image_width, float stride,\n    const std::vector<float> &anchors, int top_n,\n    void *workspace, size_t workspace_size, cudaStream_t stream) {\n\n    size_t num_anchors = anchors.size() / 4;\n    int scores_size = num_anchors * height * width;\n\n    if (!workspace || !workspace_size) {\n        // Return required scratch space size cub style\n        workspace_size = get_size_aligned<float>(anchors.size());  // anchors\n        workspace_size += get_size_aligned<int>(scores_size);      // indices\n        workspace_size += get_size_aligned<int>(scores_size);      // indices_sorted\n        workspace_size += get_size_aligned<float>(scores_size);    // scores_sorted\n\n        size_t temp_size_sort = 0;\n        if (scores_size > top_n) {\n            cub::DeviceRadixSort::SortPairsDescending(\n                static_cast<void*>(nullptr), temp_size_sort,\n                static_cast<float*>(nullptr),\n                static_cast<float*>(nullptr),\n                static_cast<int*>(nullptr),\n                static_cast<int*>(nullptr), scores_size);\n            workspace_size += temp_size_sort;\n        }\n\n        return workspace_size;\n    }\n\n    auto anchors_d = get_next_ptr<float>(anchors.size(), workspace, workspace_size);\n    cudaMemcpyAsync(anchors_d, anchors.data(), anchors.size() * sizeof *anchors_d, cudaMemcpyHostToDevice, stream);\n\n    auto on_stream = thrust::cuda::par.on(stream);\n\n    auto indices = get_next_ptr<int>(scores_size, workspace, workspace_size);\n    // TODO: how to generate sequence on gpu directly?\n    std::vector<int> indices_h(scores_size);\n    for (int i = 0; i < scores_size; i++)\n        indices_h[i] = i;\n    cudaMemcpyAsync(indices, indices_h.data(), scores_size * sizeof * indices, cudaMemcpyHostToDevice, stream);\n    auto indices_sorted = get_next_ptr<int>(scores_size, workspace, workspace_size);\n    auto scores_sorted = get_next_ptr<float>(scores_size, workspace, workspace_size);\n\n    for (int batch = 0; batch < batch_size; batch++) {\n        auto in_scores = static_cast<const float *>(inputs[0]) + batch * scores_size;\n        auto in_boxes = static_cast<const float *>(inputs[1]) + batch * scores_size * 4;\n\n        auto out_scores = static_cast<float *>(outputs[0]) + batch * top_n;\n        auto out_boxes = static_cast<float4 *>(outputs[1]) + batch * top_n;\n\n        // Only keep top n scores\n        int num_detections = scores_size;\n        auto indices_filtered = indices;\n        if (num_detections > top_n) {\n            cub::DeviceRadixSort::SortPairsDescending(workspace, workspace_size,\n                in_scores, scores_sorted, indices, indices_sorted, scores_size, 0, sizeof(*scores_sorted) * 8, stream);\n            indices_filtered = indices_sorted;\n            num_detections = top_n;\n        }\n\n        // Gather boxes\n        bool has_anchors = !anchors.empty();\n        thrust::transform(on_stream, indices_filtered, indices_filtered + num_detections,\n            thrust::make_zip_iterator(thrust::make_tuple(out_scores, out_boxes)),\n            [=] __device__(int i) {\n            int x = i % width;\n            int y = (i / width) % height;\n            int a = (i / height / width) % num_anchors;\n            float4 box = float4{\n              in_boxes[((a * 4 + 0) * height + y) * width + x],\n              in_boxes[((a * 4 + 1) * height + y) * width + x],\n              in_boxes[((a * 4 + 2) * height + y) * width + x],\n              in_boxes[((a * 4 + 3) * height + y) * width + x]\n            };\n\n            if (has_anchors) {\n                // Add anchors offsets to deltas\n                float x = (i % width) * stride;\n                float y = ((i / width) % height) * stride;\n                float *d = anchors_d + 4 * a;\n\n                float x1 = x + d[0];\n                float y1 = y + d[1];\n                float x2 = x + d[2];\n                float y2 = y + d[3];\n                float w = x2 - x1;\n                float h = y2 - y1;\n                float pred_ctr_x = box.x * w + x1 + 0.5f * w;\n                float pred_ctr_y = box.y * h + y1 + 0.5f * h;\n                float pred_w = exp(box.z) * w;\n                float pred_h = exp(box.w) * h;\n\n                // TODO: set image size as parameter\n                box = float4{\n                  max(0.0f, pred_ctr_x - 0.5f * pred_w),\n                  max(0.0f, pred_ctr_y - 0.5f * pred_h),\n                  min(pred_ctr_x + 0.5f * pred_w, static_cast<float>(image_width)),\n                  min(pred_ctr_y + 0.5f * pred_h, static_cast<float>(image_height))\n                };\n            }\n            // filter empty boxes\n            if (box.z - box.x <= 0.0f || box.w - box.y <= 0.0f)\n                return thrust::make_tuple(-FLT_MAX, box);\n            else\n                return thrust::make_tuple(in_scores[i], box);\n        });\n\n        // Zero-out unused scores\n        if (num_detections < top_n) {\n            thrust::fill(on_stream, out_scores + num_detections,\n                out_scores + top_n, -FLT_MAX);\n        }\n    }\n\n    return 0;\n}\n}  // namespace nvinfer1\n"
  },
  {
    "path": "rcnn/RpnDecodePlugin.h",
    "content": "#pragma once\n\n#include <NvInfer.h>\n\n#include <cassert>\n#include <vector>\n#include \"macros.h\"\n\nusing namespace nvinfer1;\n\n#define PLUGIN_NAME \"RpnDecode\"\n#define PLUGIN_VERSION \"1\"\n#define PLUGIN_NAMESPACE \"\"\n\nnamespace nvinfer1 {\n\nint rpnDecode(int batchSize, const void *const *inputs,\nvoid *TRT_CONST_ENQUEUE*outputs, size_t height, size_t width, size_t image_height,\nsize_t image_width, float stride, const std::vector<float> &anchors,\nint top_n, void *workspace, size_t workspace_size, cudaStream_t stream);\n\n/*\n    input1: scores{C,H,W} C->anchors\n    input2: boxes{C,H,W} C->4*anchors\n    output1: scores{C, 1} C->topk\n    output2: boxes{C, 4} C->topk format:XYXY\n    Description: implement anchor decode\n*/\nclass RpnDecodePlugin : public IPluginV2Ext {\n    int _top_n;\n    std::vector<float> _anchors;\n    float _stride;\n\n    size_t _height;\n    size_t _width;\n    size_t _image_height;  // for cliping the boxes by limiting y coordinates to the range [0, height]\n    size_t _image_width;  // for cliping the boxes by limiting x coordinates to the range [0, width]\n    mutable int size = -1;\n\n protected:\n    void deserialize(void const* data, size_t length) {\n        const char* d = static_cast<const char*>(data);\n        read(d, _top_n);\n        size_t anchors_size;\n        read(d, anchors_size);\n        while (anchors_size--) {\n            float val;\n            read(d, val);\n            _anchors.push_back(val);\n        }\n        read(d, _stride);\n        read(d, _height);\n        read(d, _width);\n        read(d, _image_height);\n        read(d, _image_width);\n    }\n\n    size_t getSerializationSize() const TRT_NOEXCEPT override {\n        return sizeof(_top_n)\n            + sizeof(size_t) + sizeof(float) * _anchors.size() + sizeof(_stride)\n            + sizeof(_height) + sizeof(_width) + sizeof(_image_height) + sizeof(_image_width);\n    }\n\n    void serialize(void *buffer) const TRT_NOEXCEPT override {\n        char* d = static_cast<char*>(buffer);\n        write(d, _top_n);\n        write(d, _anchors.size());\n        for (auto &val : _anchors) {\n            write(d, val);\n        }\n        write(d, _stride);\n        write(d, _height);\n        write(d, _width);\n        write(d, _image_height);\n        write(d, _image_width);\n    }\n\n public:\n    RpnDecodePlugin(int top_n, std::vector<float> const& anchors, float stride, size_t image_height, size_t image_width)\n        :  _top_n(top_n), _anchors(anchors), _stride(stride), _image_height(image_height), _image_width(image_width) {}\n\n    RpnDecodePlugin(int top_n, std::vector<float> const& anchors, float stride,\n        size_t height, size_t width, size_t image_height, size_t image_width)\n        : _top_n(top_n), _anchors(anchors), _stride(stride),\n        _height(height), _width(width), _image_height(image_height), _image_width(image_width) {}\n\n    RpnDecodePlugin(void const* data, size_t length) {\n        this->deserialize(data, length);\n    }\n\n    const char *getPluginType() const TRT_NOEXCEPT override {\n        return PLUGIN_NAME;\n    }\n\n    const char *getPluginVersion() const TRT_NOEXCEPT override {\n        return PLUGIN_VERSION;\n    }\n\n    int getNbOutputs() const TRT_NOEXCEPT override {\n        return 2;\n    }\n\n    Dims getOutputDimensions(int index,\n        const Dims *inputs, int nbInputDims) TRT_NOEXCEPT override {\n        assert(nbInputDims == 2);\n        assert(index < this->getNbOutputs());\n        return Dims2(_top_n, (index == 1 ? 4 : 1));\n    }\n\n    bool supportsFormat(DataType type, PluginFormat format) const TRT_NOEXCEPT override {\n        return type == DataType::kFLOAT && format == PluginFormat::kLINEAR;\n    }\n\n    int initialize() TRT_NOEXCEPT override { return 0; }\n\n    void terminate() TRT_NOEXCEPT override {}\n\n    size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override {\n        if (size < 0) {\n            size = rpnDecode(maxBatchSize, nullptr, nullptr, _height, _width, _image_height, _image_width, _stride,\n                _anchors, _top_n,\n                nullptr, 0, nullptr);\n        }\n        return size;\n    }\n\n    int enqueue(int batchSize,\n        const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\n        void *workspace, cudaStream_t stream) TRT_NOEXCEPT override {\n        return rpnDecode(batchSize, inputs, outputs, _height, _width, _image_height, _image_width, _stride,\n            _anchors, _top_n, workspace, getWorkspaceSize(batchSize), stream);\n    }\n\n    void destroy() TRT_NOEXCEPT override {\n        delete this;\n    };\n\n    const char *getPluginNamespace() const TRT_NOEXCEPT override {\n        return PLUGIN_NAMESPACE;\n    }\n\n    void setPluginNamespace(const char *N) TRT_NOEXCEPT override {\n    }\n\n    // IPluginV2Ext Methods\n    DataType getOutputDataType(int index, const DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override {\n        assert(index < 3);\n        return DataType::kFLOAT;\n    }\n\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n        int nbInputs) const TRT_NOEXCEPT override {\n        return false;\n    }\n\n    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override { return false; }\n\n    void configurePlugin(const Dims* inputDims, int nbInputs, const Dims* outputDims, int nbOutputs,\n        const DataType* inputTypes, const DataType* outputTypes, const bool* inputIsBroadcast,\n        const bool* outputIsBroadcast, PluginFormat floatFormat, int maxBatchSize) TRT_NOEXCEPT override {\n        assert(*inputTypes == nvinfer1::DataType::kFLOAT &&\n            floatFormat == nvinfer1::PluginFormat::kLINEAR);\n        assert(nbInputs == 2);\n        assert(nbOutputs == 2);\n        auto const& scores_dims = inputDims[0];\n        auto const& boxes_dims = inputDims[1];\n        assert(scores_dims.d[1] == boxes_dims.d[1]);\n        assert(scores_dims.d[2] == boxes_dims.d[2]);\n        _height = scores_dims.d[1];\n        _width = scores_dims.d[2];\n    }\n\n    IPluginV2Ext *clone() const TRT_NOEXCEPT override {\n        return new RpnDecodePlugin(_top_n, _anchors, _stride, _height, _width, _image_height, _image_width);\n    }\n\n private:\n    template<typename T> void write(char*& buffer, const T& val) const {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T> void read(const char*& buffer, T& val) {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n};\n\nclass RpnDecodePluginCreator : public IPluginCreator {\n public:\n    RpnDecodePluginCreator() {}\n\n    const char *getPluginName() const TRT_NOEXCEPT override {\n        return PLUGIN_NAME;\n    }\n\n    const char *getPluginVersion() const TRT_NOEXCEPT override {\n        return PLUGIN_VERSION;\n    }\n\n    const char *getPluginNamespace() const TRT_NOEXCEPT override {\n        return PLUGIN_NAMESPACE;\n    }\n\n    IPluginV2 *deserializePlugin(const char *name, const void *serialData, size_t serialLength) TRT_NOEXCEPT override {\n        return new RpnDecodePlugin(serialData, serialLength);\n    }\n\n    void setPluginNamespace(const char *N) TRT_NOEXCEPT override {}\n    const PluginFieldCollection *getFieldNames() TRT_NOEXCEPT override { return nullptr; }\n    IPluginV2 *createPlugin(const char *name, const PluginFieldCollection *fc) TRT_NOEXCEPT override { return nullptr; }\n};\n\nREGISTER_TENSORRT_PLUGIN(RpnDecodePluginCreator);\n\n}  // namespace nvinfer1\n\n#undef PLUGIN_NAME\n#undef PLUGIN_VERSION\n#undef PLUGIN_NAMESPACE\n"
  },
  {
    "path": "rcnn/RpnNms.cu",
    "content": "#include <cuda.h>\n#include <thrust/device_ptr.h>\n#include <thrust/gather.h>\n\n#include <algorithm>\n#include <iostream>\n#include <stdexcept>\n#include <cstdint>\n#include <vector>\n#include <cmath>\n\n#include \"RpnNmsPlugin.h\"\n#include \"./cuda_utils.h\"\n#include \"macros.h\"\n\n#ifdef CUDA_11\n#include <cub/device/device_radix_sort.cuh>\n#include <cub/iterator/counting_input_iterator.cuh>\n#else\n#include <thrust/system/cuda/detail/cub/device/device_radix_sort.cuh>\n#include <thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh>\nnamespace cub = thrust::cuda_cub::cub;\n#endif\n\nnamespace nvinfer1 {\n\n    __global__ void rpn_nms_kernel(\n        const float threshold, const int num_detections,\n        const int *indices, float *scores, const float4 *boxes) {\n        // Go through detections by descending score\n        for (int m = 0; m < num_detections; m++) {\n            int i = blockIdx.x * blockDim.x + threadIdx.x;\n            if (i < num_detections && m < i && scores[m] > -FLT_MAX) {\n                int idx = indices[i];\n                int max_idx = indices[m];\n\n                float4 ibox = boxes[idx];\n                float4 mbox = boxes[max_idx];\n                float x1 = max(ibox.x, mbox.x);\n                float y1 = max(ibox.y, mbox.y);\n                float x2 = min(ibox.z, mbox.z);\n                float y2 = min(ibox.w, mbox.w);\n                float w = max(0.0f, x2 - x1);\n                float h = max(0.0f, y2 - y1);\n                float iarea = (ibox.z - ibox.x) * (ibox.w - ibox.y);\n                float marea = (mbox.z - mbox.x) * (mbox.w - mbox.y);\n                float inter = w * h;\n                float overlap = inter / (iarea + marea - inter);\n                if (overlap > threshold) {\n                    scores[i] = -FLT_MAX;\n                }\n            }\n\n            // Sync discarded detections\n            __syncthreads();\n        }\n    }\n\n    int rpnNms(int batch_size,\n        const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\n        size_t pre_nms_topk, int post_nms_topk, float nms_thresh,\n        void *workspace, size_t workspace_size, cudaStream_t stream) {\n        if (!workspace || !workspace_size) {\n            // Return required scratch space size cub style\n            workspace_size += get_size_aligned<int>(pre_nms_topk);   // indices\n            workspace_size += get_size_aligned<int>(pre_nms_topk);   // indices_sorted\n            workspace_size += get_size_aligned<float>(pre_nms_topk);  // scores\n            workspace_size += get_size_aligned<float>(pre_nms_topk);  // scores_sorted\n\n            size_t temp_size_sort = 0;\n            cub::DeviceRadixSort::SortPairsDescending(\n                static_cast<void*>(nullptr), temp_size_sort,\n                static_cast<float*>(nullptr),\n                static_cast<float*>(nullptr),\n                static_cast<int*>(nullptr),\n                static_cast<int*>(nullptr), pre_nms_topk);\n            workspace_size += temp_size_sort;\n\n            return workspace_size;\n        }\n\n        auto on_stream = thrust::cuda::par.on(stream);\n\n        auto indices = get_next_ptr<int>(pre_nms_topk, workspace, workspace_size);\n        std::vector<int> indices_h(pre_nms_topk);\n        for (int i = 0; i < pre_nms_topk; i++)\n            indices_h[i] = i;\n        cudaMemcpyAsync(indices, indices_h.data(), pre_nms_topk * sizeof * indices, cudaMemcpyHostToDevice, stream);\n        auto indices_sorted = get_next_ptr<int>(pre_nms_topk, workspace, workspace_size);\n        auto scores = get_next_ptr<float>(pre_nms_topk, workspace, workspace_size);\n        auto scores_sorted = get_next_ptr<float>(pre_nms_topk, workspace, workspace_size);\n\n        for (int batch = 0; batch < batch_size; batch++) {\n            auto in_scores = static_cast<const float *>(inputs[0]) + batch * pre_nms_topk;\n            auto in_boxes = static_cast<const float4 *>(inputs[1]) + batch * pre_nms_topk;\n\n            auto out_boxes = static_cast<float4 *>(outputs[0]) + batch * post_nms_topk;\n\n            int num_detections = pre_nms_topk;\n            cub::DeviceRadixSort::SortPairsDescending(workspace, workspace_size,\n                in_scores, scores_sorted, indices, indices_sorted, num_detections, 0,\n                sizeof(*scores_sorted) * 8, stream);\n\n            // Launch actual NMS kernel - 1 block with each thread handling n detections\n            // TODO: different device has differnet max threads\n            const int max_threads = 1024;\n            int num_per_thread = ceil(static_cast<float>(num_detections) / max_threads);\n            rpn_nms_kernel << <num_per_thread, max_threads, 0, stream >> > (nms_thresh, num_detections,\n                indices_sorted, scores_sorted, in_boxes);\n\n            // Re-sort with updated scores\n            cub::DeviceRadixSort::SortPairsDescending(workspace, workspace_size,\n                scores_sorted, scores, indices_sorted, indices, num_detections, 0, sizeof(*scores_sorted) * 8, stream);\n\n            // Gather filtered scores, boxes, classes\n            num_detections = min(post_nms_topk, num_detections);\n            thrust::gather(on_stream, indices, indices + num_detections, in_boxes, out_boxes);\n        }\n\n        return 0;\n    }\n}  // namespace nvinfer1\n"
  },
  {
    "path": "rcnn/RpnNmsPlugin.h",
    "content": "#pragma once\n\n#include <NvInfer.h>\n\n#include <vector>\n#include <cassert>\n#include \"macros.h\"\n\nusing namespace nvinfer1;\n\n#define PLUGIN_NAME \"RpnNms\"\n#define PLUGIN_VERSION \"1\"\n#define PLUGIN_NAMESPACE \"\"\n\nnamespace nvinfer1 {\n\nint rpnNms(int batchSize,\n    const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\n    size_t pre_nms_topk, int post_nms_topk, float nms_thresh,\n    void *workspace, size_t workspace_size, cudaStream_t stream);\n\n/*\n    input1: scores{C, 1} C->pre_nms_topk\n    input2: boxes{C, 4} C->pre_nms_topk format:XYXY\n    output1: boxes{C, 4} C->post_nms_topk format:XYXY\n    Description: implement rpn nms\n*/\nclass RpnNmsPlugin : public IPluginV2Ext {\n    float _nms_thresh;\n    int _post_nms_topk;\n\n    size_t _pre_nms_topk = 1;\n    mutable int size = -1;\n\n protected:\n    void deserialize(void const* data, size_t length) {\n        const char* d = static_cast<const char*>(data);\n        read(d, _nms_thresh);\n        read(d, _post_nms_topk);\n        read(d, _pre_nms_topk);\n    }\n\n    size_t getSerializationSize() const TRT_NOEXCEPT override {\n        return sizeof(_nms_thresh) + sizeof(_post_nms_topk)\n            + sizeof(_pre_nms_topk);\n    }\n\n    void serialize(void *buffer) const TRT_NOEXCEPT override {\n        char* d = static_cast<char*>(buffer);\n        write(d, _nms_thresh);\n        write(d, _post_nms_topk);\n        write(d, _pre_nms_topk);\n    }\n\n public:\n    RpnNmsPlugin(float nms_thresh, int post_nms_topk)\n        : _nms_thresh(nms_thresh), _post_nms_topk(post_nms_topk) {\n        assert(nms_thresh > 0);\n        assert(post_nms_topk > 0);\n    }\n\n    RpnNmsPlugin(float nms_thresh, int post_nms_topk, size_t pre_nms_topk)\n        : _nms_thresh(nms_thresh), _post_nms_topk(post_nms_topk), _pre_nms_topk(pre_nms_topk) {\n        assert(nms_thresh > 0);\n        assert(post_nms_topk > 0);\n        assert(pre_nms_topk > 0);\n    }\n\n    RpnNmsPlugin(void const* data, size_t length) {\n        this->deserialize(data, length);\n    }\n\n    const char *getPluginType() const TRT_NOEXCEPT override {\n        return PLUGIN_NAME;\n    }\n\n    const char *getPluginVersion() const TRT_NOEXCEPT override {\n        return PLUGIN_VERSION;\n    }\n\n    int getNbOutputs() const TRT_NOEXCEPT override {\n        return 1;\n    }\n\n    Dims getOutputDimensions(int index,\n        const Dims *inputs, int nbInputDims) TRT_NOEXCEPT override {\n        assert(nbInputDims == 2);\n        assert(index < this->getNbOutputs());\n        return Dims2(_post_nms_topk, 4);\n    }\n\n    bool supportsFormat(DataType type, PluginFormat format) const TRT_NOEXCEPT override {\n        return type == DataType::kFLOAT && format == PluginFormat::kLINEAR;\n    }\n\n    int initialize() TRT_NOEXCEPT override { return 0; }\n\n    void terminate() TRT_NOEXCEPT override {}\n\n    size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override {\n        if (size < 0) {\n            size = rpnNms(maxBatchSize, nullptr, nullptr, _pre_nms_topk,\n                _post_nms_topk, _nms_thresh,\n                nullptr, 0, nullptr);\n        }\n        return size;\n    }\n\n    int enqueue(int batchSize,\n        const void *const *inputs, void *TRT_CONST_ENQUEUE*outputs,\n        void *workspace, cudaStream_t stream) TRT_NOEXCEPT override {\n        return rpnNms(batchSize, inputs, outputs, _pre_nms_topk,\n            _post_nms_topk, _nms_thresh,\n            workspace, getWorkspaceSize(batchSize), stream);\n    }\n\n    void destroy() TRT_NOEXCEPT override {\n        delete this;\n    }\n\n    const char *getPluginNamespace() const TRT_NOEXCEPT override {\n        return PLUGIN_NAMESPACE;\n    }\n\n    void setPluginNamespace(const char *N) TRT_NOEXCEPT override {\n    }\n\n    // IPluginV2Ext Methods\n    DataType getOutputDataType(int index, const DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override {\n        assert(index < 1);\n        return DataType::kFLOAT;\n    }\n\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n        int nbInputs) const TRT_NOEXCEPT override {\n        return false;\n    }\n\n    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override { return false; }\n\n    void configurePlugin(const Dims* inputDims, int nbInputs, const Dims* outputDims, int nbOutputs,\n        const DataType* inputTypes, const DataType* outputTypes, const bool* inputIsBroadcast,\n        const bool* outputIsBroadcast, PluginFormat floatFormat, int maxBatchSize) TRT_NOEXCEPT override {\n        assert(*inputTypes == nvinfer1::DataType::kFLOAT &&\n            floatFormat == nvinfer1::PluginFormat::kLINEAR);\n        assert(nbInputs == 2);\n        assert(inputDims[0].d[0] == inputDims[1].d[0]);\n        _pre_nms_topk = inputDims[0].d[0];\n    }\n\n    IPluginV2Ext *clone() const TRT_NOEXCEPT override {\n        return new RpnNmsPlugin(_nms_thresh, _post_nms_topk, _pre_nms_topk);\n    }\n\n private:\n    template<typename T> void write(char*& buffer, const T& val) const {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T> void read(const char*& buffer, T& val) {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n};\n\nclass RpnNmsPluginCreator : public IPluginCreator {\n public:\n    RpnNmsPluginCreator() {}\n\n    const char *getPluginNamespace() const TRT_NOEXCEPT override {\n        return PLUGIN_NAMESPACE;\n    }\n    const char *getPluginName() const TRT_NOEXCEPT override {\n        return PLUGIN_NAME;\n    }\n\n    const char *getPluginVersion() const TRT_NOEXCEPT override {\n        return PLUGIN_VERSION;\n    }\n\n    IPluginV2 *deserializePlugin(const char *name, const void *serialData, size_t serialLength) TRT_NOEXCEPT override {\n        return new RpnNmsPlugin(serialData, serialLength);\n    }\n\n    void setPluginNamespace(const char *N) TRT_NOEXCEPT override {}\n    const PluginFieldCollection *getFieldNames() TRT_NOEXCEPT override { return nullptr; }\n    IPluginV2 *createPlugin(const char *name, const PluginFieldCollection *fc) TRT_NOEXCEPT override { return nullptr; }\n};\n\nREGISTER_TENSORRT_PLUGIN(RpnNmsPluginCreator);\n\n}  // namespace nvinfer1\n\n#undef PLUGIN_NAME\n#undef PLUGIN_VERSION\n#undef PLUGIN_NAMESPACE\n"
  },
  {
    "path": "rcnn/backbone.hpp",
    "content": "#pragma once\n#include <vector>\n#include <map>\n#include <string>\n#include \"common.hpp\"\n\n/* when stride>1, whether to put stride in the first 1x1 convolution or the bottleneck 3x3 convolution.\nset false when use backbone from torchvision*/\n#define STRIDE_IN_1X1 true\n\nenum RESNETTYPE {\n    R18 = 0,\n    R34,\n    R50,\n    R101,\n    R152\n};\n\nconst std::map<RESNETTYPE, std::vector<int>> num_blocks_per_stage = {\n    {R18, {2, 2, 2, 2}},\n    {R34, {3, 4, 6, 3}},\n    {R50, {3, 4, 6, 3}},\n    {R101, {3, 4, 23, 3}},\n    {R152, {3, 8, 36, 3}}\n};\n\nILayer* BasicStem(INetworkDefinition *network,\nstd::map<std::string, Weights>& weightMap,\nconst std::string& lname, ITensor& input,\nint out_channels,\nint group_num = 1) {\n    // conv1\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, out_channels, DimsHW{ 7, 7 },\n    weightMap[lname + \".conv1.weight\"],\n    weightMap[lname + \".conv1.bias\"]);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ 2, 2 });\n    conv1->setPaddingNd(DimsHW{ 3, 3 });\n    conv1->setNbGroups(group_num);\n\n    auto r1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    assert(r1);\n\n    auto max_pool2d = network->addPoolingNd(*r1->getOutput(0), PoolingType::kMAX, DimsHW{ 3, 3 });\n    max_pool2d->setStrideNd(DimsHW{ 2, 2 });\n    max_pool2d->setPaddingNd(DimsHW{ 1, 1 });\n    // auto mp_dim = max_pool2d->getOutput(0)->getDimensions();\n    return max_pool2d;\n}\n\nITensor* BasicBlock(INetworkDefinition *network,\nstd::map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& input,\nint in_channels,\nint out_channels,\nint stride = 1) {\n    // conv1\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, out_channels, DimsHW{ 3, 3 },\n    weightMap[lname + \".conv1.weight\"],\n    weightMap[lname + \".conv1.bias\"]);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ stride, stride });\n    conv1->setPaddingNd(DimsHW{ 1, 1 });\n\n    auto r1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    assert(r1);\n\n    // conv2\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*r1->getOutput(0), out_channels, DimsHW{ 3, 3 },\n    weightMap[lname + \".conv2.weight\"],\n    weightMap[lname + \".conv2.bias\"]);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{ 1, 1 });\n    conv2->setPaddingNd(DimsHW{ 1, 1 });\n\n    // shortcut\n    ITensor* shortcut_value = nullptr;\n    if (in_channels != out_channels) {\n        auto shortcut = network->addConvolutionNd(input, out_channels, DimsHW{ 1, 1 },\n        weightMap[lname + \".shortcut.weight\"],\n        weightMap[lname + \".shortcut.bias\"]);\n        assert(shortcut);\n        shortcut->setStrideNd(DimsHW{ stride, stride });\n        shortcut_value = shortcut->getOutput(0);\n    } else {\n        shortcut_value = &input;\n    }\n\n    // add\n    auto ew = network->addElementWise(*conv2->getOutput(0), *shortcut_value, ElementWiseOperation::kSUM);\n    assert(ew);\n\n    auto r3 = network->addActivation(*ew->getOutput(0), ActivationType::kRELU);\n    assert(r3);\n\n    return r3->getOutput(0);\n}\n\nITensor* BottleneckBlock(INetworkDefinition *network,\nstd::map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& input,\nint in_channels,\nint bottleneck_channels,\nint out_channels,\nint stride = 1,\nint dilation = 1,\nint group_num = 1) {\n    int stride_1x1 = STRIDE_IN_1X1 ? stride : 1;\n    int stride_3x3 = STRIDE_IN_1X1 ? 1 : stride;\n    // conv1\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, bottleneck_channels, DimsHW{ 1, 1 },\n    weightMap[lname + \".conv1.weight\"],\n    weightMap[lname + \".conv1.bias\"]);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ stride_1x1, stride_1x1 });\n    conv1->setNbGroups(group_num);\n\n    auto r1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    assert(r1);\n\n    // conv2\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*r1->getOutput(0), bottleneck_channels, DimsHW{ 3, 3 },\n    weightMap[lname + \".conv2.weight\"],\n    weightMap[lname + \".conv2.bias\"]);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{ stride_3x3, stride_3x3 });\n    conv2->setPaddingNd(DimsHW{ 1 * dilation, 1 * dilation });\n    conv2->setDilationNd(DimsHW{ dilation, dilation });\n    conv2->setNbGroups(group_num);\n\n    auto r2 = network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);\n    assert(r2);\n\n    // conv3\n    IConvolutionLayer* conv3 = network->addConvolutionNd(*r2->getOutput(0), out_channels, DimsHW{ 1, 1 },\n    weightMap[lname + \".conv3.weight\"],\n    weightMap[lname + \".conv3.bias\"]);\n    assert(conv3);\n    conv3->setStrideNd(DimsHW{ 1, 1 });\n    conv3->setNbGroups(group_num);\n\n    // shortcut\n    ITensor* shortcut_value = nullptr;\n    if (in_channels != out_channels) {\n        auto shortcut = network->addConvolutionNd(input, out_channels, DimsHW{ 1, 1 },\n        weightMap[lname + \".shortcut.weight\"],\n        weightMap[lname + \".shortcut.bias\"]);\n        assert(shortcut);\n        shortcut->setStrideNd(DimsHW{stride, stride});\n        shortcut->setNbGroups(group_num);\n        shortcut_value = shortcut->getOutput(0);\n    } else {\n        shortcut_value = &input;\n    }\n\n    // add\n    auto ew = network->addElementWise(*conv3->getOutput(0), *shortcut_value, ElementWiseOperation::kSUM);\n    assert(ew);\n\n    auto r3 = network->addActivation(*ew->getOutput(0), ActivationType::kRELU);\n    assert(r3);\n\n    return r3->getOutput(0);\n}\n\nITensor* MakeStage(INetworkDefinition *network,\nstd::map<std::string, Weights>& weightMap,\nconst std::string& lname,\nITensor& input,\nint stage,\nRESNETTYPE resnet_type,\nint in_channels,\nint bottleneck_channels,\nint out_channels,\nint first_stride = 1,\nint dilation = 1) {\n    ITensor* out = &input;\n    for (int i = 0; i < stage; i++) {\n        std::string layerName = lname + \".\" + std::to_string(i);\n        int stride = i == 0 ? first_stride : 1;\n\n        if (resnet_type == R18 || resnet_type == R34)\n            out = BasicBlock(network, weightMap, layerName, *out, in_channels, out_channels, stride);\n        else\n            out = BottleneckBlock(network, weightMap, layerName, *out,\n            in_channels, bottleneck_channels, out_channels, stride, dilation);\n\n        in_channels = out_channels;\n    }\n    return out;\n}\n\nITensor* BuildResNet(INetworkDefinition *network,\nstd::map<std::string, Weights>& weightMap,\nITensor& input,\nRESNETTYPE resnet_type,\nint stem_out_channels,\nint bottleneck_channels,\nint res2_out_channels,\nint res5_dilation = 1) {\n    assert(res5_dilation == 1 || res5_dilation == 2);  // \"res5_dilation must be 1 or 2\"\n    if (resnet_type == R18 || resnet_type == R34) {\n        assert(res2_out_channels == 64);  // \"res2_out_channels must be 64 for R18/R34\"\n        assert(res5_dilation == 1);  // \"res5_dilation must be 1 for R18/R34\"\n    }\n\n    int out_channels = res2_out_channels;\n    ITensor* out = nullptr;\n    // stem\n    auto stem = BasicStem(network, weightMap, \"backbone.stem\", input, stem_out_channels);\n    out = stem->getOutput(0);\n\n    // res\n    for (int i = 0; i < 3; i++) {\n        int dilation = (i == 3) ? res5_dilation : 1;\n        int first_stride = (i == 0 || (i == 3 && dilation == 2)) ? 1 : 2;\n        out = MakeStage(network, weightMap,\n        \"backbone.res\" + std::to_string(i + 2), *out,\n        num_blocks_per_stage.at(resnet_type)[i], resnet_type,\n        stem_out_channels, bottleneck_channels, out_channels,\n        first_stride, dilation);\n        stem_out_channels = out_channels;\n        bottleneck_channels *= 2;\n        out_channels *= 2;\n    }\n    return out;\n}\n"
  },
  {
    "path": "rcnn/calibrator.hpp",
    "content": "#pragma once\n\n#include \"NvInfer.h\"\n#include <string>\n#include <vector>\n#include <iostream>\n#include <iterator>\n#include <fstream>\n#include <algorithm>\n#include \"./cuda_utils.h\"\n#include \"common.hpp\"\n#include \"macros.h\"\n\n//! \\class Int8EntropyCalibrator2\n//!\n//! \\brief Implements Entropy calibrator 2.\n//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.\n//!\nclass Int8EntropyCalibrator2 : public nvinfer1::IInt8EntropyCalibrator2 {\n public:\n    Int8EntropyCalibrator2(int batchsize, int input_w, int input_h,\n    const char* img_dir, const char* calib_table_name,\n    const char* input_blob_name, bool read_cache = true);\n\n    virtual ~Int8EntropyCalibrator2();\n    int getBatchSize() const  TRT_NOEXCEPT override;\n    bool getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT override;\n    const void* readCalibrationCache(size_t& length) TRT_NOEXCEPT override;\n    void writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT override;\n\n private:\n    int batchsize_;\n    int input_w_;\n    int input_h_;\n    int img_idx_;\n    std::string img_dir_;\n    std::vector<std::string> img_files_;\n    size_t input_count_;\n    std::string calib_table_name_;\n    const char* input_blob_name_;\n    bool read_cache_;\n    void* device_input_;\n    std::vector<char> calib_cache_;\n};\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize,\nint input_w, int input_h, const char* img_dir,\nconst char* calib_table_name, const char* input_blob_name,\nbool read_cache)\n    : batchsize_(batchsize)\n    , input_w_(input_w)\n    , input_h_(input_h)\n    , img_idx_(0)\n    , img_dir_(img_dir)\n    , calib_table_name_(calib_table_name)\n    , input_blob_name_(input_blob_name)\n    , read_cache_(read_cache) {\n    input_count_ = 3 * input_w * input_h * batchsize;\n    CUDA_CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n    read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2() {\n    CUDA_CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const TRT_NOEXCEPT {\n    return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT {\n    if (img_idx_ + batchsize_ > static_cast<int>(img_files_.size())) {\n        return false;\n    }\n\n    std::vector<float> input_imgs_(input_count_, 0);\n    for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n        std::cout << img_files_[i] << \"  \" << i << std::endl;\n        cv::Mat temp = cv::imread(img_dir_ + img_files_[i]);\n        int X_LEFT_PAD = 0;\n        int X_RIGHT_PAD = 0; \n        int Y_TOP_PAD = 0;\n        int Y_BOTTOM_PAD = 0;\n        temp = preprocessImg(temp, input_w_, input_h_, X_LEFT_PAD, X_RIGHT_PAD, Y_TOP_PAD, Y_BOTTOM_PAD);\n\n        if (temp.empty()) {\n            std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n            return false;\n        }\n        for (int ind = 0; ind < input_w_*input_h_*3; ind++)\n            input_imgs_[(i-img_idx_)*input_w_*input_h_*3 + ind] = static_cast<float>(*(temp.data + ind));\n    }\n    img_idx_ += batchsize_;\n\n    CUDA_CHECK(cudaMemcpy(device_input_, input_imgs_.data(), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n    assert(!strcmp(names[0], input_blob_name_));\n    bindings[0] = device_input_;\n    return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length)  TRT_NOEXCEPT {\n    std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n    calib_cache_.clear();\n    std::ifstream input(calib_table_name_, std::ios::binary);\n    input >> std::noskipws;\n    if (read_cache_ && input.good()) {\n        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n    }\n    length = calib_cache_.size();\n    return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length)  TRT_NOEXCEPT {\n    std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n    std::ofstream output(calib_table_name_, std::ios::binary);\n    output.write(reinterpret_cast<const char*>(cache), length);\n}\n"
  },
  {
    "path": "rcnn/common.hpp",
    "content": "#pragma once\n\n#include <NvInfer.h>\n#include <cuda_runtime_api.h>\n#include <assert.h>\n#include <dirent.h>\n\n#include <fstream>\n#include <sstream>\n#include <iostream>\n#include <string>\n#include <vector>\n#include <map>\n#include <algorithm>\n\n#include <opencv2/opencv.hpp>\n#include \"./logging.h\"\n#include \"./cuda_utils.h\"\n\nstatic Logger gLogger;\n\nusing namespace nvinfer1;\n\nvoid loadWeights(const std::string file, std::map<std::string, Weights>& weightMap) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        Weights wt{ DataType::kFLOAT, nullptr, 0 };\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n}\n\nstatic inline int read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n            strcmp(p_file->d_name, \"..\") != 0) {\n            // std::string cur_file_name(p_dir_name);\n            // cur_file_name += \"/\";\n            // cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\nstatic inline cv::Mat preprocessImg(cv::Mat& img, int input_w, int input_h, int& X_LEFT_PAD, int& X_RIGHT_PAD, int& Y_TOP_PAD, int& Y_BOTTOM_PAD) {\n    int w, h;\n    float x, y;\n    float r_w = input_w / (img.cols*1.0);\n    float r_h = input_h / (img.rows*1.0);\n\n    // this code can also support left-right and top-bottom padding if you need\n    if (r_h > r_w) {\n        w = input_w;\n        h = r_w * img.rows;\n        x = 0.0;\n        y = (input_h - h) / 2.f;\n    } else {\n        w = r_h * img.cols;\n        h = input_h;\n        x = (input_w - w) / 2.f;\n        y = 0.0;\n    }\n\n    // support both odd and even cases\n    X_LEFT_PAD = (int)(round(x - 0.1));\n    X_RIGHT_PAD = (int)(round(x + 0.1));\n    Y_TOP_PAD = (int)(round(y - 0.1));\n    Y_BOTTOM_PAD = (int)(round(y + 0.1));\n\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n    cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(X_LEFT_PAD, Y_TOP_PAD, re.cols, re.rows)));\n\n    return out;\n}"
  },
  {
    "path": "rcnn/cuda_utils.h",
    "content": "#pragma once\n\n#include <cuda_runtime_api.h>\n#include <stdexcept>\n#include <cstdint>\n\n#define CUDA_ALIGN 256\n\ntemplate <typename T>\ninline size_t get_size_aligned(size_t num_elem) {\n    size_t size = num_elem * sizeof(T);\n    size_t extra_align = 0;\n    if (size % CUDA_ALIGN != 0) {\n        extra_align = CUDA_ALIGN - size % CUDA_ALIGN;\n    }\n    return size + extra_align;\n}\n\ntemplate <typename T>\ninline T *get_next_ptr(size_t num_elem, void *&workspace, size_t &workspace_size) {\n    size_t size = get_size_aligned<T>(num_elem);\n    if (size > workspace_size) {\n        throw std::runtime_error(\"Workspace is too small!\");\n    }\n    workspace_size -= size;\n    T *ptr = reinterpret_cast<T *>(workspace);\n    workspace = reinterpret_cast<void *>(reinterpret_cast<uintptr_t>(workspace) + size);\n    return ptr;\n}\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)\\\n    {\\\n        cudaError_t error_code = callstr;\\\n        if (error_code != cudaSuccess) {\\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__;\\\n            assert(0);\\\n        }\\\n    }\n#endif  // CUDA_CHECK\n"
  },
  {
    "path": "rcnn/gen_wts.py",
    "content": "from detectron2.layers import Conv2d\nfrom torch import nn\nimport torch\nimport numpy as np\nimport struct\ndef fuse_conv_and_bn(conv):\n    # Fuse convolution and batchnorm layers https://tehnokv.com/posts/fusing-batchnorm-and-conv/\n    bn = conv.norm\n    # init\n    fusedconv = nn.Conv2d(conv.in_channels,\n                          conv.out_channels,\n                          kernel_size=conv.kernel_size,\n                          stride=conv.stride,\n                          padding=conv.padding,\n                          groups=conv.groups,\n                          bias=True).requires_grad_(False).to(conv.weight.device)\n\n    # prepare filters\n    w_conv = conv.weight.clone().view(conv.out_channels, -1)\n    w_bn = torch.diag(bn.weight.div(torch.sqrt(bn.eps + bn.running_var)))\n    fusedconv.weight.copy_(torch.mm(w_bn, w_conv).view(fusedconv.weight.size()))\n\n    # prepare spatial bias\n    b_conv = torch.zeros(conv.weight.size(0), device=conv.weight.device) if conv.bias is None else conv.bias\n    b_bn = bn.bias - bn.weight.mul(bn.running_mean).div(torch.sqrt(bn.running_var + bn.eps))\n    fusedconv.bias.copy_(torch.mm(w_bn, b_conv.reshape(-1, 1)).reshape(-1) + b_bn)\n\n    return fusedconv\n\ndef fuse_bn(model):\n    for child_name, child in model.named_children():\n        if isinstance(child, Conv2d) and child.norm is not None:\n            setattr(model, child_name, fuse_conv_and_bn(child))\n        else:\n            fuse_bn(child)\n\ndef gen_wts(model, filename):\n    f = open('./' + filename + '.wts', 'w')\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f',float(vv)).hex())\n        f.write('\\n')\n    f.close()\n\n# construct model\nfrom detectron2.config import get_cfg\nfrom detectron2.modeling import build_model\nfrom detectron2.checkpoint import DetectionCheckpointer\ncfg = get_cfg()\ncfg.merge_from_file('./configs/COCO-Detection/faster_rcnn_R_50_C4_1x.yaml')\ncfg.MODEL.WEIGHTS = './model_final_721ade.pkl'\ncfg.MODEL.DEVICE = 'cpu'\nmodel = build_model(cfg)\nDetectionCheckpointer(model).load(cfg.MODEL.WEIGHTS)\nmodel.eval()\nfuse_bn(model)\ngen_wts(model, 'faster')\n\n# test data\n# from detectron2.data.detection_utils import read_image\n# from detectron2.data import transforms as T\n# import cv2\n# original_image = cv2.imread('./demo.jpg')\n# original_image = original_image.astype('float32')\n\n# transform_gen = T.ResizeShortestEdge(\n#             [cfg.INPUT.MIN_SIZE_TEST, cfg.INPUT.MIN_SIZE_TEST], cfg.INPUT.MAX_SIZE_TEST\n#         )\n# height, width = original_image.shape[:2]\n\n# image = transform_gen.get_transform(original_image).apply_image(original_image)\n# image = torch.as_tensor(image.astype(\"float32\").transpose(2, 0, 1))\n\n# # model test\n# inputs = {\"image\": image, \"height\": height, \"width\": width}\n\n# with torch.no_grad():\n#     predictions = model([inputs])[0]\n# print (predictions)\n"
  },
  {
    "path": "rcnn/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include \"macros.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) {\n        mShouldLog = shouldLog;\n    }\n\n private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer)  // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer)  // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n public:\n    explicit Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n     public:\n        TestAtom(TestAtom&&) = default;\n\n     private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const {\n        return mReportableSeverity;\n    }\n\n private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "rcnn/macros.h",
    "content": "#pragma once\n\n#include <NvInfer.h>\n#include <cuda.h>\n\n#if CUDA_VERSION >=11000\n#define CUDA_11\n#endif\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n"
  },
  {
    "path": "rcnn/rcnn.cpp",
    "content": "#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"backbone.hpp\"\n#include \"RpnDecodePlugin.h\"\n#include \"RpnNmsPlugin.h\"\n#include \"RoiAlignPlugin.h\"\n#include \"PredictorDecodePlugin.h\"\n#include \"BatchedNmsPlugin.h\"\n#include \"MaskRcnnInferencePlugin.h\"\n#include \"calibrator.hpp\"\n\n#define DEVICE 0\n#define BATCH_SIZE 1\n#define BACKBONE_RESNETTYPE R50\n// data\nstatic const std::vector<float> PIXEL_MEAN = { 103.53, 116.28, 123.675 };\nstatic const std::vector<float> PIXEL_STD = {1.0, 1.0, 1.0};\nstatic constexpr float MIN_SIZE = 800.0;\nstatic constexpr float MAX_SIZE = 1333.0;\nstatic constexpr int NUM_CLASSES = 80;\nstatic int INPUT_H;  // size of model input\nstatic int INPUT_W;\nstatic constexpr int INPUT_H_ = 480;  // size of original image, you can change it to arbitrary size\nstatic constexpr int INPUT_W_ = 640;\nstatic int X_LEFT_PAD;  // pad in preprocessImg\nstatic int X_RIGHT_PAD;\nstatic int Y_TOP_PAD;\nstatic int Y_BOTTOM_PAD;\nstatic int h_ori;  // used when h_ori is not equal to INPUT_H_\nstatic int w_ori;\n// backbone\nstatic const int RES2_OUT_CHANNELS = (BACKBONE_RESNETTYPE == R18 ||\nBACKBONE_RESNETTYPE == R34) ? 64 : 256;\n// rpn\nstatic const std::vector<float> ANCHOR_SIZES = { 32, 64, 128, 256, 512 };\nstatic const std::vector<float> ASPECT_RATIOS = { 0.5, 1.0, 2.0 };\nstatic constexpr int PRE_NMS_TOP_K_TEST = 6000;\nstatic constexpr float RPN_NMS_THRESH = 0.7;\nstatic constexpr int POST_NMS_TOPK = 1000;\n// roialign\nstatic constexpr int STRIDES = 16;\nstatic constexpr int SAMPLING_RATIO = 0;\nstatic constexpr int POOLER_RESOLUTION = 14;\n// roihead\nstatic constexpr float NMS_THRESH_TEST = 0.5;\nstatic constexpr int DETECTIONS_PER_IMAGE = 100;\nstatic constexpr float SCORE_THRESH = 0.6;\nstatic const std::vector<float> BBOX_REG_WEIGHTS = { 10.0, 10.0, 5.0, 5.0 };\nstatic bool MASK_ON = false;\n\nstatic const char* INPUT_NODE_NAME = \"images\";\nstatic const std::vector<std::string> OUTPUT_NAMES = { \"scores\", \"boxes\",\n\"labels\", \"masks\" };\n\n//nms methods selection in the second stage\n// 0: original nms\n// 1: soft-nms (linear)\n// 2: soft-nms (gaussian) \nstatic int NMS_METHOD = 1;\nstatic std::vector<int> NMS_METHOD_VEC = {0, 1, 2};\n\nstd::vector<float> GenerateAnchors(const std::vector<float>& anchor_sizes,\nconst std::vector<float>& aspect_ratios) {\n    std::vector<float> res;\n    for (auto as : anchor_sizes) {\n        float area = as * as;\n        for (auto ar : aspect_ratios) {\n            float w = sqrt(area / ar);\n            float h = ar * w;\n            res.push_back(-w / 2.0);\n            res.push_back(-h / 2.0);\n            res.push_back(w / 2.0);\n            res.push_back(h / 2.0);\n        }\n    }\n    return res;\n}\n\n// transpose && resize && normalization && padding\nITensor* DataPreprocess(INetworkDefinition *network, ITensor& input) {\n\n    // HWC->CHW\n    auto channel_permute = network->addShuffle(input);\n    assert(channel_permute);\n    channel_permute->setFirstTranspose(Permutation{ 2, 0, 1 });\n\n    // sub pixel mean\n    auto pixel_mean = network->addConstant(Dims3{ 3, 1, 1 },\n    Weights{ DataType::kFLOAT, PIXEL_MEAN.data(), 3 });\n    assert(pixel_mean);\n    auto sub = network->addElementWise(*channel_permute->getOutput(0),\n    *pixel_mean->getOutput(0), ElementWiseOperation::kSUB);\n    assert(sub);\n    auto pixel_std = network->addConstant(Dims3{ 3, 1, 1 }, Weights{DataType::kFLOAT, PIXEL_STD.data(), 3});\n    assert(pixel_std);\n    auto div = network->addElementWise(*sub->getOutput(0), *pixel_std->getOutput(0), ElementWiseOperation::kDIV);\n    assert(div);\n\n    return div->getOutput(0);\n}\n\nITensor* RPN(INetworkDefinition *network,\nstd::map<std::string, Weights>& weightMap, ITensor& features) {\n    int num_anchors = ANCHOR_SIZES.size() * ASPECT_RATIOS.size();\n    int box_dim = 4;\n\n    // rpn head conv\n    auto rpn_head_conv = network->addConvolutionNd(features, features.getDimensions().d[0], DimsHW{ 3, 3 },\n    weightMap[\"proposal_generator.rpn_head.conv.weight\"],\n    weightMap[\"proposal_generator.rpn_head.conv.bias\"]);\n    assert(rpn_head_conv);\n    rpn_head_conv->setStrideNd(DimsHW{ 1, 1 });\n    rpn_head_conv->setPaddingNd(DimsHW{ 1, 1 });\n    auto rpn_head_relu = network->addActivation(*rpn_head_conv->getOutput(0), ActivationType::kRELU);\n    assert(rpn_head_relu);\n\n    // objectness logits\n    auto rpn_head_logits = network->addConvolutionNd(*rpn_head_relu->getOutput(0), num_anchors, DimsHW{ 1, 1 },\n    weightMap[\"proposal_generator.rpn_head.objectness_logits.weight\"],\n    weightMap[\"proposal_generator.rpn_head.objectness_logits.bias\"]);\n    assert(rpn_head_logits);\n    rpn_head_logits->setStrideNd(DimsHW{ 1, 1 });\n\n    // anchor deltas\n    auto rpn_head_deltas = network->addConvolutionNd(*rpn_head_relu->getOutput(0), num_anchors * box_dim,\n    DimsHW{ 1, 1 },\n    weightMap[\"proposal_generator.rpn_head.anchor_deltas.weight\"],\n    weightMap[\"proposal_generator.rpn_head.anchor_deltas.bias\"]);\n    assert(rpn_head_deltas);\n    auto rpn_head_deltas_dim = rpn_head_deltas->getOutput(0)->getDimensions();\n    rpn_head_deltas->setStrideNd(DimsHW{ 1, 1 });\n\n    auto anchors = GenerateAnchors(ANCHOR_SIZES, ASPECT_RATIOS);\n    auto rpnDecodePlugin = RpnDecodePlugin(PRE_NMS_TOP_K_TEST, anchors, STRIDES, INPUT_H, INPUT_W);\n    std::vector<ITensor*> faster_decode_inputs = { rpn_head_logits->getOutput(0), rpn_head_deltas->getOutput(0) };\n    auto rpnDecodeLayer = network->addPluginV2(faster_decode_inputs.data(), faster_decode_inputs.size(),\n    rpnDecodePlugin);\n\n    std::vector<ITensor*> nms_input = { rpnDecodeLayer->getOutput(0), rpnDecodeLayer->getOutput(1) };\n\n    // nms\n    auto nmsPlugin = RpnNmsPlugin(RPN_NMS_THRESH, POST_NMS_TOPK);\n    auto nmsLayer = network->addPluginV2(nms_input.data(), nms_input.size(), nmsPlugin);\n    return nmsLayer->getOutput(0);\n}\n\nITensor* SharedRoiTransform(INetworkDefinition *network, std::map<std::string, Weights>& weightMap,\nITensor* proposals, ITensor* features, int num_proposals) {\n    std::vector<ITensor*> roi_inputs = { proposals, features };\n    auto roiAlignPlugin = RoiAlignPlugin(POOLER_RESOLUTION, 1 / static_cast<float>(STRIDES),\n    SAMPLING_RATIO, num_proposals, features->getDimensions().d[0]);\n    auto roiAlignLayer = network->addPluginV2(roi_inputs.data(), roi_inputs.size(), roiAlignPlugin);\n\n    // res5\n    /* same with https://github.com/facebookresearch/detectron2/\n    blob/9246ebc3af1c023cfbdae77e5d976edbcf9a2933/detectron2/modeling/roi_heads/roi_heads.py#L430,\n    use bottleneck here, so pass R50*/\n    auto box_features = MakeStage(network, weightMap, \"roi_heads.res5\",\n    *roiAlignLayer->getOutput(0), 3, R50,\n    roiAlignLayer->getOutput(0)->getDimensions().d[1],\n    512, RES2_OUT_CHANNELS * 8, 2);\n    return box_features;\n}\n\nvoid BoxHead(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor* proposals,\n    ITensor* features, std::vector<ITensor*>& instances) {\n\n    auto box_features = SharedRoiTransform(network, weightMap, proposals, features, POST_NMS_TOPK);\n    auto box_features_mean = network->addReduce(*box_features, ReduceOperation::kAVG, 12, true);\n\n    // score\n    auto scores = network->addFullyConnected(*box_features_mean->getOutput(0), NUM_CLASSES + 1,\n    weightMap[\"roi_heads.box_predictor.cls_score.weight\"],\n    weightMap[\"roi_heads.box_predictor.cls_score.bias\"]);\n    auto probs = network->addSoftMax(*scores->getOutput(0));\n\n    auto probs_dim = probs->getOutput(0)->getDimensions();\n    auto score_slice = network->addSlice(*probs->getOutput(0), Dims4{ 0, 0, 0, 0 },\n    Dims4{ probs_dim.d[0], probs_dim.d[1] - 1, 1, 1 }, Dims4{ 1, 1, 1, 1 });\n\n    auto proposal_deltas = network->addFullyConnected(*box_features_mean->getOutput(0), NUM_CLASSES * 4,\n    weightMap[\"roi_heads.box_predictor.bbox_pred.weight\"],\n    weightMap[\"roi_heads.box_predictor.bbox_pred.bias\"]);\n\n    // decode\n    std::vector<ITensor*> predictorDecodeInput = { score_slice->getOutput(0),\n    proposal_deltas->getOutput(0), proposals };\n    auto predictorDecodePlugin = PredictorDecodePlugin(probs_dim.d[0], INPUT_H, INPUT_W, BBOX_REG_WEIGHTS);\n    auto predictorDecodeLayer = network->addPluginV2(predictorDecodeInput.data(),\n    predictorDecodeInput.size(), predictorDecodePlugin);\n\n    // nms\n    std::vector<ITensor*> nmsInput = { predictorDecodeLayer->getOutput(0),\n    predictorDecodeLayer->getOutput(1), predictorDecodeLayer->getOutput(2) };\n    auto batchedNmsPlugin = BatchedNmsPlugin(NMS_METHOD, NMS_THRESH_TEST, DETECTIONS_PER_IMAGE);\n    auto batchedNmsLayer = network->addPluginV2(nmsInput.data(), nmsInput.size(), batchedNmsPlugin);\n\n    // instances\n    instances.push_back(batchedNmsLayer->getOutput(0));\n    instances.push_back(batchedNmsLayer->getOutput(1));\n    instances.push_back(batchedNmsLayer->getOutput(2));\n}\n\nvoid MaskHead(INetworkDefinition *network, std::map<std::string, Weights>& weightMap,\n    ITensor* features, std::vector<ITensor*>& instances, int out_channels = 256) {\n\n    auto mask_features = SharedRoiTransform(network, weightMap, instances[1], features, DETECTIONS_PER_IMAGE);\n\n    // mask_fcn\n    auto mask_deconv = network->addDeconvolutionNd(*mask_features, out_channels, DimsHW{ 2, 2 },\n    weightMap[\"roi_heads.mask_head.deconv.weight\"],\n    weightMap[\"roi_heads.mask_head.deconv.bias\"]);\n    mask_deconv->setStrideNd(DimsHW{ 2, 2 });\n    auto deconv_relu = network->addActivation(*mask_deconv->getOutput(0), ActivationType::kRELU);\n    assert(deconv_relu);\n    auto predictor = network->addConvolutionNd(*deconv_relu->getOutput(0), NUM_CLASSES, DimsHW{ 1, 1 },\n    weightMap[\"roi_heads.mask_head.predictor.weight\"],\n    weightMap[\"roi_heads.mask_head.predictor.bias\"]);\n    predictor->setStrideNd(DimsHW{ 1, 1 });\n\n    ITensor* masks;\n    if (NUM_CLASSES == 1) {\n        auto mask_probs_pred = network->addActivation(*predictor->getOutput(0), ActivationType::kSIGMOID);\n        masks = mask_probs_pred->getOutput(0);\n    } else {\n        std::vector<ITensor*> mask_rcnn_inference_inputs = { instances[2], predictor->getOutput(0) };\n        auto maskRcnnInferencePlugin = MaskRcnnInferencePlugin(DETECTIONS_PER_IMAGE, POOLER_RESOLUTION);\n        auto maskRcnnInferenceLayer = network->addPluginV2(mask_rcnn_inference_inputs.data(),\n        mask_rcnn_inference_inputs.size(), maskRcnnInferencePlugin);\n        masks = maskRcnnInferenceLayer->getOutput(0);\n    }\n    instances.push_back(masks);\n}\n\nstd::vector<ITensor*> ROIHeads(INetworkDefinition *network, std::map<std::string, Weights>& weightMap,\nITensor* proposals, ITensor* features) {\n    std::vector<ITensor*> instances;\n\n    // box head\n    BoxHead(network, weightMap, proposals, features, instances);\n\n    if (MASK_ON) {\n        // mask head\n        MaskHead(network, weightMap, features, instances);\n    }\n\n    return instances;\n}\n\nICudaEngine* createEngine_rcnn(unsigned int maxBatchSize,\n    const std::string& wtsfile, IBuilder* builder, IBuilderConfig* config, DataType dt,\n    const std::string& quantizationType) {\n    /*\n    description: after fuse bn\n    */\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape {INPUT_H, INPUT_W, 3} with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_NODE_NAME, dt, Dims3{ INPUT_H, INPUT_W, 3 });\n    assert(data);\n\n    // preprocess\n    data = DataPreprocess(network, *data);\n    std::map<std::string, Weights> weightMap;\n    loadWeights(wtsfile, weightMap);\n\n    // backbone\n    ITensor* features = BuildResNet(network, weightMap, *data, BACKBONE_RESNETTYPE, 64, 64, RES2_OUT_CHANNELS);\n\n    auto proposals = RPN(network, weightMap, *features);\n    auto results = ROIHeads(network, weightMap, proposals, features);\n\n    // build output\n    for (int i = 0; i < results.size(); i++) {\n        network->markOutput(*results[i]);\n        results[i]->setName(OUTPUT_NAMES[i].c_str());\n    }\n\n    // build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1ULL << 30);\n    if (quantizationType == \"fp32\") {\n    } else if (quantizationType == \"fp16\") {\n        config->setFlag(BuilderFlag::kFP16);\n    } else if (quantizationType == \"int8\") {\n        std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n        assert(builder->platformHasFastInt8());\n        config->setFlag(BuilderFlag::kINT8);\n        Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, INPUT_W, INPUT_H, \"./coco_calib/\",\n        \"int8calib.table\", INPUT_NODE_NAME);\n        config->setInt8Calibrator(calibrator);\n    } else {\n        throw(\"does not support model type\");\n    }\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // destroy network\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        delete[] mem.second.values;\n    }\n    return engine;\n}\n\nvoid BuildRcnnModel(unsigned int maxBatchSize, IHostMemory** modelStream, const std::string& wtsfile,\nconst std::string& quantizationType = \"fp32\") {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    ICudaEngine* engine = createEngine_rcnn(maxBatchSize,\n        wtsfile, builder, config, DataType::kFLOAT, quantizationType);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, cudaStream_t& stream, std::vector<void*>& buffers,\nstd::vector<float>& input, std::vector<float*>& output) {\n    CUDA_CHECK(cudaMemcpyAsync(buffers[0], input.data(), BATCH_SIZE * INPUT_H * INPUT_W * 3 * sizeof(float),\n    cudaMemcpyHostToDevice, stream));\n\n    context.enqueue(BATCH_SIZE, buffers.data(), stream, nullptr);\n\n    CUDA_CHECK(cudaMemcpyAsync(output[0], buffers[1], BATCH_SIZE * DETECTIONS_PER_IMAGE * sizeof(float),\n    cudaMemcpyDeviceToHost, stream));\n    CUDA_CHECK(cudaMemcpyAsync(output[1], buffers[2], BATCH_SIZE * DETECTIONS_PER_IMAGE * 4 * sizeof(float),\n    cudaMemcpyDeviceToHost, stream));\n    CUDA_CHECK(cudaMemcpyAsync(output[2], buffers[3], BATCH_SIZE * DETECTIONS_PER_IMAGE * sizeof(float),\n    cudaMemcpyDeviceToHost, stream));\n    if (MASK_ON)\n        CUDA_CHECK(cudaMemcpyAsync(output[3], buffers[4],\n        BATCH_SIZE * DETECTIONS_PER_IMAGE * POOLER_RESOLUTION * POOLER_RESOLUTION * sizeof(float),\n        cudaMemcpyDeviceToHost, stream));\n\n    cudaStreamSynchronize(stream);\n}\n\nvoid calculateSize() {\n    float ratio = MIN_SIZE / static_cast<float>(std::min(INPUT_H_, INPUT_W_));\n    float newh = 0, neww = 0;\n    if (INPUT_H_ < INPUT_W_) {\n        newh = MIN_SIZE;\n        neww = ratio * INPUT_W_;\n    } else {\n        newh = ratio * INPUT_H_;\n        neww = MIN_SIZE;\n    }\n    if (std::max(newh, neww) > MAX_SIZE) {\n        ratio = MAX_SIZE / static_cast<float>(std::max(newh, neww));\n        newh = newh * ratio;\n        neww = neww * ratio;\n    }\n    INPUT_H = static_cast<int>(newh + 0.5);\n    INPUT_W = static_cast<int>(neww + 0.5);\n}\n\n\nbool parse_args(int argc, char** argv, std::string& wtsFile, std::string& engineFile, std::string& imgDir) {\n    if (argc < 4) return false;\n    if (std::string(argv[1]) == \"-s\") {\n        wtsFile = std::string(argv[2]);\n        engineFile = std::string(argv[3]);\n    } else if (std::string(argv[1]) == \"-d\") {\n        engineFile = std::string(argv[2]);\n        imgDir = std::string(argv[3]);\n    } else {\n        return false;\n    }\n    if (argc >= 5 && std::string(argv[4]) == \"m\") MASK_ON = true;\n    return true;\n}\n\nint main(int argc, char** argv) {\n    \n    int flag = 0;\n    for (int &item : NMS_METHOD_VEC) {\n        if (item == NMS_METHOD) {\n            flag = 1;\n            printf(\"The nms method %d is applied.\\n\", NMS_METHOD);\n            break;\n        }\n    }\n    if (flag == 0) {\n        printf(\"[WARNING] The nms_method %d is not supported, please choose from [0, 1, 2].\\n\", NMS_METHOD);\n        printf(\"[WARNING] To make the nms robust, the default nms method 0 is applied.\\n\");\n        NMS_METHOD = 0;\n    }\n\n    // calculate size\n    calculateSize();\n\n    cudaSetDevice(DEVICE);\n\n    std::string wtsFile = \"\";\n    std::string engineFile = \"\";\n\n    std::string imgDir;\n    if (!parse_args(argc, argv, wtsFile, engineFile, imgDir)) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./rcnn -s [.wts] [.engine] [m] // serialize model to plan file\" << std::endl;\n        std::cerr << \"./rcnn -d [.engine] ../samples [m]  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    if (!wtsFile.empty()) {\n        IHostMemory* modelStream{ nullptr };\n        BuildRcnnModel(BATCH_SIZE, &modelStream, wtsFile, \"fp32\");\n        assert(modelStream != nullptr);\n        std::ofstream p(engineFile, std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    }\n\n    // deserialize the .engine and run inference\n    std::ifstream file(engineFile, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engineFile << \" error!\" << std::endl;\n        return -1;\n    }\n\n    std::string trtModelStream;\n    size_t modelSize{ 0 };\n    file.seekg(0, file.end);\n    modelSize = file.tellg();\n    file.seekg(0, file.beg);\n    trtModelStream.resize(modelSize);\n    assert(!trtModelStream.empty());\n    file.read(const_cast<char*>(trtModelStream.c_str()), modelSize);\n    file.close();\n\n    // build engine\n    std::cout << \"build engine\" << std::endl;\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream.c_str(), modelSize);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    runtime->destroy();\n\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n\n    // prepare input file\n    std::vector<std::string> fileList;\n    if (read_files_in_dir(imgDir.c_str(), fileList) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    // prepare input data\n    std::vector<float> data(BATCH_SIZE * INPUT_H * INPUT_W * 3, 0);\n    void *data_d, *scores_d, *boxes_d, *classes_d, *masks_d;\n    CUDA_CHECK(cudaMalloc(&data_d, BATCH_SIZE * INPUT_H * INPUT_W * 3 * sizeof(float)));\n    CUDA_CHECK(cudaMalloc(&scores_d, BATCH_SIZE * DETECTIONS_PER_IMAGE * sizeof(float)));\n    CUDA_CHECK(cudaMalloc(&boxes_d, BATCH_SIZE * DETECTIONS_PER_IMAGE * 4 * sizeof(float)));\n    CUDA_CHECK(cudaMalloc(&classes_d, BATCH_SIZE * DETECTIONS_PER_IMAGE * sizeof(float)));\n\n    std::vector<float> scores_h(BATCH_SIZE * DETECTIONS_PER_IMAGE);\n    std::vector<float> boxes_h(BATCH_SIZE * DETECTIONS_PER_IMAGE * 4);\n    std::vector<float> classes_h(BATCH_SIZE * DETECTIONS_PER_IMAGE);\n    std::vector<float> masks_h;\n\n    std::vector<void*> buffers = { data_d, scores_d, boxes_d, classes_d };\n    std::vector<float*> outputs = {scores_h.data(), boxes_h.data(), classes_h.data()};\n\n    if (MASK_ON) {\n        CUDA_CHECK(cudaMalloc(&masks_d,\n        BATCH_SIZE * DETECTIONS_PER_IMAGE * POOLER_RESOLUTION * POOLER_RESOLUTION * sizeof(float)));\n        masks_h.resize(BATCH_SIZE * DETECTIONS_PER_IMAGE * POOLER_RESOLUTION * POOLER_RESOLUTION);\n        buffers.push_back(masks_d);\n        outputs.push_back(masks_h.data());\n    }\n\n    int fcount = 0;\n    int fileLen = fileList.size();\n    for (int f = 0; f < fileLen; f++) {\n        fcount++;\n        if (fcount < BATCH_SIZE && f + 1 != fileLen) continue;\n\n        for (int b = 0; b < fcount; b++) {\n            cv::Mat img = cv::imread(imgDir + \"/\" + fileList[f - fcount + 1 + b]);\n            h_ori = img.rows;\n            w_ori = img.cols;\n            img = preprocessImg(img, INPUT_W, INPUT_H, X_LEFT_PAD, X_RIGHT_PAD, Y_TOP_PAD, Y_BOTTOM_PAD);\n\n            if (img.empty()) continue;\n            for (int i = 0; i < INPUT_H * INPUT_W * 3; i++)\n                data[b*INPUT_H * INPUT_W * 3 + i] = static_cast<float>(*(img.data + i));\n        }\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n\n        doInference(*context, stream, buffers, data, outputs);\n\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n        float h_ratio = static_cast<float>(h_ori) / (INPUT_H - (Y_TOP_PAD + Y_BOTTOM_PAD));  // ratio of original image size to model input size\n        float w_ratio = static_cast<float>(w_ori) / (INPUT_W - (X_LEFT_PAD + X_RIGHT_PAD));\n\n        for (int b = 0; b < fcount; b++) {\n            cv::Mat img = cv::imread(imgDir + \"/\" + fileList[f - fcount + 1 + b]);\n            for (int i = 0; i < DETECTIONS_PER_IMAGE; i++) {\n                if (scores_h[b * DETECTIONS_PER_IMAGE + i] > SCORE_THRESH) {\n                    float x1 = (boxes_h[b * DETECTIONS_PER_IMAGE * 4 + i * 4 + 0] - X_LEFT_PAD) * w_ratio;\n                    float y1 = (boxes_h[b * DETECTIONS_PER_IMAGE * 4 + i * 4 + 1] - Y_TOP_PAD) * h_ratio;\n                    float x2 = (boxes_h[b * DETECTIONS_PER_IMAGE * 4 + i * 4 + 2] - X_LEFT_PAD) * w_ratio;\n                    float y2 = (boxes_h[b * DETECTIONS_PER_IMAGE * 4 + i * 4 + 3] - Y_TOP_PAD) * h_ratio;\n                    int label = classes_h[b * DETECTIONS_PER_IMAGE + i];\n                    float score = scores_h[b * DETECTIONS_PER_IMAGE + i];\n                    printf(\"boxes:[%.6f, %.6f, %.6f, %.6f] scores: %.4f label: %d \\n\", x1, y1, x2, y2, score, label);\n                    cv::Rect r(x1, y1, x2 - x1, y2 - y1);\n                    cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n                    cv::putText(img, std::to_string(label), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                    cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n\n\n                    if (MASK_ON) {\n                        cv::Mat maskPart = cv::Mat::zeros(cv::Size(POOLER_RESOLUTION, POOLER_RESOLUTION), CV_32FC1);\n                        memcpy(maskPart.data,\n                          &masks_h[b * DETECTIONS_PER_IMAGE * POOLER_RESOLUTION * POOLER_RESOLUTION +\n                          i * POOLER_RESOLUTION * POOLER_RESOLUTION],\n                          POOLER_RESOLUTION * POOLER_RESOLUTION * sizeof(float));\n\n                        cv::Rect r(cv::Point(floor(x1) - 1 < 0 ? 0 : floor(x1) - 1,\n                                             floor(y1) - 1 < 0 ? 0 : floor(y1) - 1),\n                                   cv::Point(ceil(x2) + 1 > w_ori ? w_ori : ceil(x2) + 1,\n                                             ceil(y2) + 1 > h_ori ? h_ori : ceil(y2) + 1));\n                        cv::resize(maskPart, maskPart, cv::Size(r.width, r.height));\n                        cv::Mat curMask = cv::Mat::zeros(cv::Size(w_ori, h_ori), CV_8UC1);\n                        cv::threshold(maskPart, maskPart, 0.5, 255, cv::THRESH_BINARY);\n                        curMask(r) += maskPart;\n                        std::vector<std::vector<cv::Point>> contours;\n                        cv::findContours(curMask, contours, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_NONE);\n                        for (int c = 0; c < contours.size(); c++)\n                            cv::drawContours(img, contours, c, cv::Scalar(0, 0, 255));\n                    }\n                }\n            }\n            cv::imwrite(\"_\" + fileList[f - fcount + 1 + b], img);\n        }\n        fcount = 0;\n    }\n\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(data_d));\n    CUDA_CHECK(cudaFree(scores_d));\n    CUDA_CHECK(cudaFree(boxes_d));\n    CUDA_CHECK(cudaFree(classes_d));\n    if (MASK_ON) CUDA_CHECK(cudaFree(masks_d));\n    context->destroy();\n    engine->destroy();\n\n    return 0;\n}\n"
  },
  {
    "path": "real-esrgan/general-x4v3/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.16)\nproject(real-esrgan)\n\nset(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} \"${CMAKE_SOURCE_DIR}/cmake/\")\n\nadd_definitions(-std=c++17)\nadd_definitions(-DAPI_EXPORTS)\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\n#set(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\n#find_package(CUDA REQUIRED)\n\nINCLUDE_DIRECTORIES(${PROJECT_SOURCE_DIR}/src/include)\n\n# cuda\nFIND_PACKAGE(CUDA REQUIRED)\n#INCLUDE_DIRECTORIES(${CUDA_INCLUDE_DIRS})\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n\n# <------------------------TensorRT Related------------------------->\ninclude_directories(YOUR_TENSORRT_INCLUDE_DIR) # TensorRT-8.6.1.6/include\nlink_directories(YOUR_TENSORRT_LIB_DIR) # TensorRT-8.6.1.6/lib\n\n# <------------------------OpenCV Related------------------------->\n# opencv\nFIND_PACKAGE(OpenCV REQUIRED)\nINCLUDE_DIRECTORIES(${OpenCV_INCLUDE_DIRS})\n\nset(CMAKE_CXX_STANDARD 17)\n\nadd_executable(${PROJECT_NAME} main.cpp)\n\ncuda_add_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/src/pixel_shuffle/pixel_shuffle.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\n\nTARGET_LINK_LIBRARIES(${PROJECT_NAME} nvinfer)\nTARGET_LINK_LIBRARIES(${PROJECT_NAME} cudart)\nTARGET_LINK_LIBRARIES(${PROJECT_NAME} ${OpenCV_LIBS})\nTARGET_LINK_LIBRARIES(${PROJECT_NAME} myplugins)\n"
  },
  {
    "path": "real-esrgan/general-x4v3/README.md",
    "content": "# Real-ESRGAN realesr-general-x4v3 model\n\n## How to Run\n0. Replace YOUR_TENSORRT_INCLUDE_DIR and YOUR_TENSORRT_LIB_DIR in CMakeLists.txt with your TensorRT include and lib directories.\n1. generate .wts from pytorch with .pt\n```\ngit clone https://github.com/xinntao/Real-ESRGAN.git\ncd Real-ESRGAN\n\n# Install basicsr - https://github.com/xinntao/BasicSR\n# We use BasicSR for both training and inference\npip install basicsr\n# facexlib and gfpgan are for face enhancement\npip install facexlib\npip install gfpgan\npip install -r requirements.txt\npython setup.py develop\n```\ndownload realesr-general-x4v3.pth (and realesr-general-wdn-x4v3.pth if needed) from\nhttps://github.com/xinntao/Real-ESRGAN/releases\n\n```\ncp {tensorrtx}/real-esrgan-general-x4v3/gen_wts.py {xinntao}/Real-ESRGAN\ncd {xinntao}/Real-ESRGAN\npython gen_wts.py\n// a file 'real-esrgan.wts' will be generated.\n```\n\n**Be aware that if you need both realesr-general-x4v3.pth and realesr-general-wdn-x4v3.pth, please write a Python script to average all weights of realesr-general-x4v3.pth and realesr-general-wdn-x4v3.pth (from {xinntao}/Real-ESRGAN), then save it as a .pth file, and use this new file to generate a .wts file.**\n\n2. build tensorrtx/real-esrgan-general-x4v3 and run\n\n```\ncd {tensorrtx}/real-esrgan-general-x4v3/\nmkdir build\ncd build\ncp {xinntao}/Real-ESRGAN/real-esrgan.wts {tensorrtx}/real-esrgan/weights/\ncmake ..\nmake\n./real-esrgan your_images_dir\n```\n"
  },
  {
    "path": "real-esrgan/general-x4v3/cmake/FindTensorRT.cmake",
    "content": "# source:\n# https://github.com/NVIDIA/tensorrt-laboratory/blob/master/cmake/FindTensorRT.cmake\n\n# This module defines the following variables:\n#\n# ::\n#\n#   TensorRT_INCLUDE_DIRS\n#   TensorRT_LIBRARIES\n#   TensorRT_FOUND\n#\n# ::\n#\n#   TensorRT_VERSION_STRING - version (x.y.z)\n#   TensorRT_VERSION_MAJOR  - major version (x)\n#   TensorRT_VERSION_MINOR  - minor version (y)\n#   TensorRT_VERSION_PATCH  - patch version (z)\n#\n# Hints\n# ^^^^^\n# A user may set ``TensorRT_DIR`` to an installation root to tell this module where to look.\n#\nset(_TensorRT_SEARCHES)\n\nif(TensorRT_DIR)\n    set(_TensorRT_SEARCH_ROOT PATHS ${TensorRT_DIR} NO_DEFAULT_PATH)\n    list(APPEND _TensorRT_SEARCHES _TensorRT_SEARCH_ROOT)\nendif()\n\n# appends some common paths\nset(_TensorRT_SEARCH_NORMAL\n        PATHS \"/usr\"\n        )\nlist(APPEND _TensorRT_SEARCHES _TensorRT_SEARCH_NORMAL)\n\n# Include dir\nforeach(search ${_TensorRT_SEARCHES})\n    find_path(TensorRT_INCLUDE_DIR NAMES NvInfer.h ${${search}} PATH_SUFFIXES include)\nendforeach()\n\nif(NOT TensorRT_LIBRARY)\n    foreach(search ${_TensorRT_SEARCHES})\n        find_library(TensorRT_LIBRARY NAMES nvinfer ${${search}} PATH_SUFFIXES lib)\n    endforeach()\nendif()\n\nif(NOT TensorRT_PARSERS_LIBRARY)\n    foreach(search ${_TensorRT_SEARCHES})\n        find_library(TensorRT_NVPARSERS_LIBRARY NAMES nvparsers ${${search}} PATH_SUFFIXES lib)\n    endforeach()\nendif()\n\nif(NOT TensorRT_NVONNXPARSER_LIBRARY)\n    foreach(search ${_TensorRT_SEARCHES})\n        find_library(TensorRT_NVONNXPARSER_LIBRARY NAMES nvonnxparser ${${search}} PATH_SUFFIXES lib)\n    endforeach()\nendif()\n\nmark_as_advanced(TensorRT_INCLUDE_DIR)\n\nif(TensorRT_INCLUDE_DIR AND EXISTS \"${TensorRT_INCLUDE_DIR}/NvInfer.h\")\n    file(STRINGS \"${TensorRT_INCLUDE_DIR}/NvInfer.h\" TensorRT_MAJOR REGEX \"^#define NV_TENSORRT_MAJOR [0-9]+.*$\")\n    file(STRINGS \"${TensorRT_INCLUDE_DIR}/NvInfer.h\" TensorRT_MINOR REGEX \"^#define NV_TENSORRT_MINOR [0-9]+.*$\")\n    file(STRINGS \"${TensorRT_INCLUDE_DIR}/NvInfer.h\" TensorRT_PATCH REGEX \"^#define NV_TENSORRT_PATCH [0-9]+.*$\")\n\n    string(REGEX REPLACE \"^#define NV_TENSORRT_MAJOR ([0-9]+).*$\" \"\\\\1\" TensorRT_VERSION_MAJOR \"${TensorRT_MAJOR}\")\n    string(REGEX REPLACE \"^#define NV_TENSORRT_MINOR ([0-9]+).*$\" \"\\\\1\" TensorRT_VERSION_MINOR \"${TensorRT_MINOR}\")\n    string(REGEX REPLACE \"^#define NV_TENSORRT_PATCH ([0-9]+).*$\" \"\\\\1\" TensorRT_VERSION_PATCH \"${TensorRT_PATCH}\")\n    set(TensorRT_VERSION_STRING \"${TensorRT_VERSION_MAJOR}.${TensorRT_VERSION_MINOR}.${TensorRT_VERSION_PATCH}\")\nendif()\n\ninclude(FindPackageHandleStandardArgs)\nFIND_PACKAGE_HANDLE_STANDARD_ARGS(TensorRT REQUIRED_VARS TensorRT_LIBRARY TensorRT_INCLUDE_DIR VERSION_VAR TensorRT_VERSION_STRING)\n\nif(TensorRT_FOUND)\n    set(TensorRT_INCLUDE_DIRS ${TensorRT_INCLUDE_DIR})\n\n    if(NOT TensorRT_LIBRARIES)\n        set(TensorRT_LIBRARIES ${TensorRT_LIBRARY} ${TensorRT_NVONNXPARSER_LIBRARY} ${TensorRT_NVPARSERS_LIBRARY})\n    endif()\n\n    if(NOT TARGET TensorRT::TensorRT)\n        add_library(TensorRT::TensorRT UNKNOWN IMPORTED)\n        set_target_properties(TensorRT::TensorRT PROPERTIES INTERFACE_INCLUDE_DIRECTORIES \"${TensorRT_INCLUDE_DIRS}\")\n        set_property(TARGET TensorRT::TensorRT APPEND PROPERTY IMPORTED_LOCATION \"${TensorRT_LIBRARY}\")\n    endif()\nendif()\n"
  },
  {
    "path": "real-esrgan/general-x4v3/gen_wts.py",
    "content": "import argparse\nimport os\nimport struct\nfrom realesrgan import RealESRGANer\nfrom realesrgan.archs.srvgg_arch import SRVGGNetCompact\n\nfrom basicsr.archs.rrdbnet_arch import RRDBNet\nfrom basicsr.utils.download_util import load_file_from_url\n\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument('-i', '--input', type=str, help='Input image or folder')\n    parser.add_argument(\n        '-n',\n        '--model_name',\n        type=str,\n        default='realesr-general-x4v3',\n        help=('RealESRGAN_x2plus Model names: '\n              'realesr-animevideov3 | realesr-general-x4v3'))\n    parser.add_argument('-o', '--output', type=str, help='Output folder')\n    parser.add_argument(\n        '-dn',\n        '--denoise_strength',\n        type=float,\n        default=0.5,\n        help=('Denoise strength. 0 for weak denoise (keep noise), 1 for strong denoise ability. '\n              'Only used for the realesr-general-x4v3 model'))\n    parser.add_argument('-s', '--outscale', type=float, default=4, help='The final upsampling scale of the image')\n    parser.add_argument(\n        '--model_path', type=str, default=None, help='[Option] Model path. Usually, you do not need to specify it')\n    parser.add_argument('--suffix', type=str, default='out', help='Suffix of the restored image')\n    parser.add_argument('-t', '--tile', type=int, default=0, help='Tile size, 0 for no tile during testing')\n    parser.add_argument('--tile_pad', type=int, default=10, help='Tile padding')\n    parser.add_argument('--pre_pad', type=int, default=0, help='Pre padding size at each border')\n    parser.add_argument('--face_enhance', action='store_true', help='Use GFPGAN to enhance face')\n    parser.add_argument(\n        '--fp32', action='store_true', help='Use fp32 precision during inference. Default: fp16 (half precision).')\n    parser.add_argument(\n        '--alpha_upsampler',\n        type=str,\n        default='realesrgan',\n        help='The upsampler for the alpha channels. Options: realesrgan | bicubic')\n    parser.add_argument(\n        '--ext',\n        type=str,\n        default='auto',\n        help='Image extension. Options: auto | jpg | png, auto means using the same extension as inputs')\n    parser.add_argument(\n        '-g', '--gpu-id', type=int, default=None, help='gpu device to use (default=None) can be 0,1,2 for multi-gpu')\n\n    args = parser.parse_args()\n\n    # determine models according to model names\n    args.model_name = args.model_name.split('.')[0]\n    if args.model_name == 'RealESRGAN_x4plus':  # x4 RRDBNet model\n        model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32, scale=4)\n        netscale = 4\n        file_url = ['https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth']\n    elif args.model_name == 'RealESRNet_x4plus':  # x4 RRDBNet model\n        model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32, scale=4)\n        netscale = 4\n        file_url = ['https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.1/RealESRNet_x4plus.pth']\n    elif args.model_name == 'RealESRGAN_x4plus_anime_6B':  # x4 RRDBNet model with 6 blocks\n        model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=6, num_grow_ch=32, scale=4)\n        netscale = 4\n        file_url = ['https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth']\n    elif args.model_name == 'RealESRGAN_x2plus':  # x2 RRDBNet model\n        model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32, scale=2)\n        netscale = 2\n        file_url = ['https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pth']\n    elif args.model_name == 'realesr-animevideov3':  # x4 VGG-style model (XS size)\n        model = SRVGGNetCompact(num_in_ch=3, num_out_ch=3, num_feat=64, num_conv=16, upscale=4, act_type='prelu')\n        netscale = 4\n        file_url = ['https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.5.0/realesr-animevideov3.pth']\n    elif args.model_name == 'realesr-general-x4v3':  # x4 VGG-style model (S size)\n        model = SRVGGNetCompact(num_in_ch=3, num_out_ch=3, num_feat=64, num_conv=32, upscale=4, act_type='prelu')\n        netscale = 4\n        file_url = [\n            'https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.5.0/realesr-general-wdn-x4v3.pth',\n            'https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.5.0/realesr-general-x4v3.pth'\n        ]\n\n    # determine model paths\n    if args.model_path is not None:\n        model_path = args.model_path\n    else:\n        model_path = os.path.join('weights', args.model_name + '.pth')\n        if not os.path.isfile(model_path):\n            ROOT_DIR = os.path.dirname(os.path.abspath(__file__))\n            for url in file_url:\n                # model_path will be updated\n                model_path = load_file_from_url(\n                    url=url, model_dir=os.path.join(ROOT_DIR, 'weights'), progress=True, file_name=None)\n\n    # use dni to control the denoise strength\n    dni_weight = None\n    if args.model_name == 'realesr-general-x4v3' and args.denoise_strength != 1:\n        # wdn_model_path = model_path.replace('realesr-general-x4v3', 'realesr-general-wdn-x4v3')\n        # model_path = [model_path, wdn_model_path]\n        # dni_weight = [args.denoise_strength, 1 - args.denoise_strength]\n        model_path = model_path.replace('realesr-general-x4v3', 'realesr-general-x4v3-cat')\n        dni_weight = None\n\n    # restorer\n    upsampler = RealESRGANer(\n        scale=netscale,\n        model_path=model_path,\n        dni_weight=dni_weight,\n        model=model,\n        tile=args.tile,\n        tile_pad=args.tile_pad,\n        pre_pad=args.pre_pad,\n        half=not args.fp32,\n        gpu_id=args.gpu_id)\n\n    if os.path.isfile('real-esrgan.wts'):\n        print('Already, real-esrgan.wts file exists.')\n    else:\n        print('making real-esrgan.wts file ...')\n        f = open(\"real-esrgan.wts\", 'w')\n        f.write(\"{}\\n\".format(len(upsampler.model.state_dict().keys())))\n        for k, v in upsampler.model.state_dict().items():\n            print('key: ', k)\n            print('value: ', v.shape)\n            vr = v.reshape(-1).cpu().numpy()\n            f.write(\"{} {}\".format(k, len(vr)))\n            for vv in vr:\n                f.write(\" \")\n                f.write(struct.pack(\">f\", float(vv)).hex())\n            f.write(\"\\n\")\n        print('Completed real-esrgan.wts file!')\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "real-esrgan/general-x4v3/main.cpp",
    "content": "#include <NvInfer.h>\n#include <dirent.h>\n#include <fstream>\n#include <iostream>\n#include <memory>\n#include <opencv4/opencv2/opencv.hpp>\n#include <vector>\n\n#include \"config/config.hpp\"\n#include \"cuda_utils.h\"\n#include \"logging/logging.h\"\n#include \"pixel_shuffle/pixel_shuffle.hpp\"\n#include \"preprocess/preprocess.hpp\"\n\nstatic Logger gLogger;\n\nusing namespace nvinfer1;\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nauto* ConvPRelu(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int conv_nb,\n                int index) {\n\n    IConvolutionLayer* conv = network->addConvolutionNd(input, conv_nb, DimsHW{3, 3},\n                                                        weightMap[\"body.\" + std::to_string(index) + \".weight\"],\n                                                        weightMap[\"body.\" + std::to_string(index) + \".bias\"]);\n    assert(conv);\n    conv->setName((\"body.\" + std::to_string(index) + \".weight\").c_str());\n    conv->setStrideNd(DimsHW{1, 1});\n    conv->setPaddingNd(DimsHW{1, 1});\n    auto conv_res = conv->getOutput(0);\n\n    // add prelu layer\n    // slope 64 number\n\n    //auto slope = network->addConstant( {64}, weightMap[\"body.\" + std::to_string(index + 1) + \".weight\"] );\n    auto slope = network->addConstant(Dims4{1, 64, 1, 1}, weightMap[\"body.\" + std::to_string(index + 1) + \".weight\"]);\n    assert(slope);\n    slope->setName((\"body.\" + std::to_string(index + 1) + \".weight\").c_str());\n\n    auto prelu = network->addParametricReLU(*conv_res, *slope->getOutput(0));\n    assert(prelu);\n\n    return prelu;\n}\n\nvoid build_engine(DataType dt, std::string& wts_path) {\n\n    std::map<std::string, Weights> weightMap = loadWeights(wts_path);\n\n    nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(gLogger);\n    nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();\n\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1U);\n\n    auto data = network->addInput(INPUT_BLOB_NAME, nvinfer1::DataType::kFLOAT,\n                                  nvinfer1::Dims4{BATCH_SIZE, INPUT_C, INPUT_H, INPUT_W});\n\n    // first\n    auto layer = ConvPRelu(network, weightMap, *data, 64, 0);\n\n    for (int i = 0; i < 32; ++i) {\n        layer = ConvPRelu(network, weightMap, *layer->getOutput(0), 64, 2 * i + 2);\n    }\n\n    auto conv_last = network->addConvolutionNd(*layer->getOutput(0), 48, DimsHW{3, 3}, weightMap[\"body.66.weight\"],\n                                               weightMap[\"body.66.bias\"]);\n    assert(conv_last);\n    conv_last->setName(\"body.66.weight\");\n    conv_last->setStrideNd(DimsHW{1, 1});\n    conv_last->setPaddingNd(DimsHW{1, 1});\n    auto conv_last_res = conv_last->getOutput(0);\n\n    // add pixel shuffle layer by plugin\n    IPluginCreator* creator = getPluginRegistry()->getPluginCreator(\"PixelShufflePlugin\", \"1\");\n    const PluginFieldCollection* pluginFC = creator->getFieldNames();\n    std::vector<PluginField> pluginData;\n    int upscaleFactor = 4;\n    pluginData.emplace_back(PluginField{\"upscaleFactor\", &upscaleFactor, PluginFieldType::kINT32, 1});\n    PluginFieldCollection pluginFCWithData = {static_cast<int>(pluginData.size()), pluginData.data()};\n    auto pluginObj = creator->createPlugin(\"PixelShuffle\", &pluginFCWithData);\n\n    auto pixelShuffleLayer = network->addPluginV2(&conv_last_res, 1, *pluginObj);\n\n    // the input \"data\" interpolate 4x and add to pixelShuffleLayer->getOutput(0)\n\n    auto interpolateLayer = network->addResize(*data);\n    interpolateLayer->setResizeMode(ResizeMode::kNEAREST);\n    // Define scale factors\n    float scales[] = {1.0f, 1.0f, 1.0 * OUT_SCALE, 1.0 * OUT_SCALE};  // scale_factor=4 for height and width\n    interpolateLayer->setScales(scales, OUT_SCALE);\n\n    // Add the two tensor as output\n    auto addLayer = network->addElementWise(*interpolateLayer->getOutput(0), *pixelShuffleLayer->getOutput(0),\n                                            ElementWiseOperation::kSUM);\n\n    // output\n    addLayer->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*addLayer->getOutput(0));\n\n    // fp16\n    if (USE_FP16) {\n        config->setFlag(BuilderFlag::kFP16);\n    }\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    std::ofstream ofs(\"../weights/real-esrgan.engine\", std::ios::binary);\n\n    assert(serialized_model != nullptr);\n    ofs.write(reinterpret_cast<const char*>(serialized_model->data()), serialized_model->size());\n\n    delete config;\n    delete serialized_model;\n    delete builder;\n}\n\nstatic inline int read_files_in_dir(const char* p_dir_name, std::vector<std::string>& file_names) {\n    DIR* p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 && strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\nvoid doInference(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output) {\n    context.setBindingDimensions(0, Dims4(BATCH_SIZE, INPUT_C, INPUT_H, INPUT_W));\n    context.enqueueV2(buffers, stream, nullptr);\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], BATCH_SIZE * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost,\n                               stream));\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nint main(int argc, char** argv) {\n    std::string img_dir;\n\n    if (argc < 2) {\n        std::cerr << \"Usage: \" << argv[0] << \" <image_dir>\" << std::endl;\n        return -1;\n    } else {\n        img_dir = argv[1];\n    }\n\n    std::string wts_path = \"../weights/real-esrgan.wts\";\n    build_engine(DataType::kFLOAT, wts_path);\n\n    std::string engine_name = \"../weights/real-esrgan.engine\";\n    // deserialize the .engine and run inference\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        return -1;\n    }\n    char* trtModelStream = nullptr;\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    trtModelStream = new char[size];\n    assert(trtModelStream);\n    file.read(trtModelStream, size);\n    file.close();\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n    assert(engine->getNbBindings() == 2);\n    void* buffers[2];\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine->getBindingIndex(OUTPUT_BLOB_NAME);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc(&buffers[inputIndex], BATCH_SIZE * INPUT_C * INPUT_H * INPUT_W * sizeof(float)));\n    CUDA_CHECK(cudaMalloc(&buffers[outputIndex], BATCH_SIZE * OUTPUT_SIZE * sizeof(float)));\n\n    std::vector<float> data;\n    std::vector<float> output;\n    //std::vector<float> res;\n\n    //data.resize(BATCH_SIZE * 3 * INPUT_H * INPUT_W);\n    data.resize(BATCH_SIZE * INPUT_C * INPUT_H * INPUT_W);\n    output.resize(BATCH_SIZE * OUTPUT_SIZE);\n\n    // Create stream\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    for (int index = 0; index < file_names.size(); ++index) {\n\n        auto img = cv::imread(img_dir + \"/\" + file_names[index]);\n        auto begin = std::chrono::high_resolution_clock::now();\n\n        // BATCH_SIZE = 1\n        for (int b = 0; b < BATCH_SIZE; b++) {\n            int i = 0;\n            for (int row = 0; row < INPUT_H; ++row) {\n                uchar* uc_pixel = img.data + row * img.step;\n                for (int col = 0; col < INPUT_W; ++col) {\n                    //    static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];\n                    // BGR2RGB and normalization\n                    data[b * 3 * INPUT_H * INPUT_W + i] = (float)uc_pixel[2] / 255.0;\n                    data[b * 3 * INPUT_H * INPUT_W + i + INPUT_H * INPUT_W] = (float)uc_pixel[1] / 255.0;\n                    data[b * 3 * INPUT_H * INPUT_W + i + 2 * INPUT_H * INPUT_W] = (float)uc_pixel[0] / 255.0;\n                    uc_pixel += 3;\n                    ++i;\n                }\n            }\n        }\n        CUDA_CHECK(cudaMemcpyAsync(buffers[inputIndex], data.data(),\n                                   BATCH_SIZE * INPUT_C * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice,\n                                   stream));\n        doInference(*context, stream, (void**)buffers, output.data());\n        auto end = std::chrono::high_resolution_clock::now();\n        std::cout << \"Inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count()\n                  << \" ms\" << std::endl;\n\n        int OUTPUT_C = 3;\n        int OUTPUT_H = INPUT_H * OUT_SCALE;\n        int OUTPUT_W = INPUT_W * OUT_SCALE;\n\n        for (int b = 0; b < BATCH_SIZE; b++) {\n            cv::Mat img_res(OUTPUT_H, OUTPUT_W, CV_8UC3);\n            int i = 0;\n            for (int row = 0; row < OUTPUT_H; ++row) {\n                uchar* uc_pixel = img_res.data + row * img_res.step;\n                for (int col = 0; col < OUTPUT_W; ++col) {\n                    // RGB2BGR and de_normalization\n                    auto r2 = std::round(output[b * OUTPUT_C * OUTPUT_H * OUTPUT_W + i] * 255.0);\n                    if (r2 < 0)\n                        r2 = 0;\n                    if (r2 > 255)\n                        r2 = 255;\n                    auto g2 = std::round(output[b * OUTPUT_C * OUTPUT_H * OUTPUT_W + i + 1 * OUTPUT_H * OUTPUT_W] *\n                                         255.0);\n                    if (g2 < 0)\n                        g2 = 0;\n                    if (g2 > 255)\n                        g2 = 255;\n                    auto b2 = std::round(output[b * OUTPUT_C * OUTPUT_H * OUTPUT_W + i + 2 * OUTPUT_H * OUTPUT_W] *\n                                         255.0);\n                    if (b2 < 0)\n                        b2 = 0;\n                    if (b2 > 255)\n                        b2 = 255;\n\n                    uc_pixel[0] = static_cast<uchar>(b2);  // B\n                    uc_pixel[1] = static_cast<uchar>(g2);  // G\n                    uc_pixel[2] = static_cast<uchar>(r2);  // R\n\n                    // uc_pixel[0] = static_cast<uchar>(std::round(output[b * OUTPUT_C * OUTPUT_H * OUTPUT_W + i + 2 * OUTPUT_H * OUTPUT_W] * 255.0)); // B\n                    // uc_pixel[1] = static_cast<uchar>(std::round(output[b * OUTPUT_C * OUTPUT_H * OUTPUT_W + i + 1 * OUTPUT_H * OUTPUT_W] * 255.0)); // G\n                    // uc_pixel[2] = static_cast<uchar>(std::round(output[b * OUTPUT_C * OUTPUT_H * OUTPUT_W + i] * 255.0)); // R\n                    uc_pixel += 3;\n                    ++i;\n                }\n            }\n            cv::imwrite(\"_\" + file_names[index] + \".jpg\", img_res);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(buffers[0]));\n    CUDA_CHECK(cudaFree(buffers[1]));\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n}\n"
  },
  {
    "path": "real-esrgan/general-x4v3/src/include/config/config.hpp",
    "content": "#ifndef REAL_ESRGAN_TRT_CONFIG_HPP\n#define REAL_ESRGAN_TRT_CONFIG_HPP\n\n#include <string>\n\n//std::string INPUT_BLOB_NAME = \"input\";\n//std::string OUTPUT_BLOB_NAME = \"output\";\n\nconst char* INPUT_BLOB_NAME = \"input_0\";\nconst char* OUTPUT_BLOB_NAME = \"output_0\";\n\nconst bool USE_FP16 = false;\n\nstatic const int BATCH_SIZE = 1;\nstatic const int INPUT_C = 3;\nstatic const int INPUT_H = 450;\nstatic const int INPUT_W = 300;\nstatic const int OUT_SCALE = 4;\n//static const int OUTPUT_SIZE = INPUT_C * INPUT_H * OUT_SCALE * INPUT_W * OUT_SCALE;\nstatic const int OUTPUT_SIZE = BATCH_SIZE * 48 * 450 * 300;\n//INPUT_C * INPUT_H * OUT_SCALE * INPUT_W * OUT_SCALE;\n#endif  //REAL_ESRGAN_TRT_CONFIG_HPP\n"
  },
  {
    "path": "real-esrgan/general-x4v3/src/include/cuda_utils.h",
    "content": "#ifndef TRTX_CUDA_UTILS_H_\n#define TRTX_CUDA_UTILS_H_\n\n#include <cuda_runtime_api.h>\n#include <stdint.h>\n#include <cstdio>\n#include <iostream>\n#include <vector>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n#endif  // CUDA_CHECK\n\n#endif  // TRTX_CUDA_UTILS_H_\n"
  },
  {
    "path": "real-esrgan/general-x4v3/src/include/logging/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"NvInferRuntimeCommon.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) noexcept override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "real-esrgan/general-x4v3/src/include/pixel_shuffle/pixel_shuffle.hpp",
    "content": "#ifndef REAL_ESRGAN_TRT_PIXEL_SHUFFLE_HPP\n#define REAL_ESRGAN_TRT_PIXEL_SHUFFLE_HPP\n\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n\nclass PixelShufflePlugin : public nvinfer1::IPluginV2DynamicExt {\n   public:\n    PixelShufflePlugin(int upscaleFactor) : mUpscaleFactor(upscaleFactor) {}\n\n    PixelShufflePlugin(const void* data, size_t length) { memcpy(&mUpscaleFactor, data, sizeof(mUpscaleFactor)); }\n\n    const char* getPluginType() const noexcept override { return \"PixelShufflePlugin\"; }\n\n    const char* getPluginVersion() const noexcept override { return \"1\"; }\n\n    int getNbOutputs() const noexcept override { return 1; }\n\n    // nvinfer1::DimsExprs getOutputDimensions(int outputIndex, const nvinfer1::DimsExprs* inputs, int nbInputs, nvinfer1::IExprBuilder& exprBuilder) noexcept override\n    // {\n    //     assert(outputIndex == 0);\n    //     auto* in = &inputs[0];\n    //     nvinfer1::DimsExprs outputDims = *in;\n    //     int channels = in->d[0];\n    //     int height = in->d[1];\n    //     int width = in->d[2];\n    //     int upscaleFactor = mUpscaleFactor;\n    //     outputDims.d[0] = exprBuilder.constant(channels / (upscaleFactor * upscaleFactor));\n    //     outputDims.d[1] = exprBuilder.operation(nvinfer1::DimensionOperation::kPROD, {height, exprBuilder.constant(upscaleFactor)});\n    //     outputDims.d[2] = exprBuilder.operation(nvinfer1::DimensionOperation::kPROD, {width, exprBuilder.constant(upscaleFactor)});\n    //     return outputDims;\n    // }\n    nvinfer1::DimsExprs getOutputDimensions(int32_t outputIndex, nvinfer1::DimsExprs const* inputs, int32_t nbInputs,\n                                            nvinfer1::IExprBuilder& exprBuilder) noexcept override {\n        // assert(nbInputs == 1);\n        auto inDims = inputs[0];\n        // assert(inDims.nbDims == 4);\n        int c = inDims.d[1]->getConstantValue() / (mUpscaleFactor * mUpscaleFactor);\n        int h = inDims.d[2]->getConstantValue() * mUpscaleFactor;\n        int w = inDims.d[3]->getConstantValue() * mUpscaleFactor;\n        nvinfer1::DimsExprs outDims;\n        outDims.nbDims = 4;\n        outDims.d[0] = inDims.d[0];\n        outDims.d[1] = exprBuilder.constant(c);\n        outDims.d[2] = exprBuilder.constant(h);\n        outDims.d[3] = exprBuilder.constant(w);\n        return outDims;\n    }\n\n    bool supportsFormatCombination(int pos, const nvinfer1::PluginTensorDesc* inOut, int nbInputs,\n                                   int nbOutputs) noexcept override {\n        return inOut[pos].format == nvinfer1::TensorFormat::kLINEAR && inOut[pos].type == nvinfer1::DataType::kFLOAT;\n    }\n\n    nvinfer1::DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes,\n                                         int nbInputs) const noexcept override {\n\n        return inputTypes[0];\n    }\n\n    // bool canBroadcastInputAcrossBatch(int inputIndex) const noexcept override\n    // {\n    //     return false;\n    // }\n\n    void configurePlugin(const nvinfer1::DynamicPluginTensorDesc* inputs, int nbInputs,\n                         const nvinfer1::DynamicPluginTensorDesc* outputs, int nbOutputs) noexcept override {}\n\n    // void configurePlugin(const nvinfer1::DynamicPluginTensorDesc* in, int nbInputs, const nvinfer1::DynamicPluginTensorDesc* out, int nbOutputs) noexcept override\n    // {\n    //     // Optionally configure plugin if necessary\n    // }\n\n    size_t getWorkspaceSize(const nvinfer1::PluginTensorDesc* inputs, int nbInputs,\n                            const nvinfer1::PluginTensorDesc* outputs, int nbOutputs) const noexcept override {\n        return 0;\n    }\n\n    size_t getSerializationSize() const noexcept override { return sizeof(mUpscaleFactor); }\n\n    void serialize(void* buffer) const noexcept override { memcpy(buffer, &mUpscaleFactor, sizeof(mUpscaleFactor)); }\n\n    void destroy() noexcept override {\n        // delete this;\n    }\n\n    nvinfer1::IPluginV2DynamicExt* clone() const noexcept override { return new PixelShufflePlugin(mUpscaleFactor); }\n\n    void setPluginNamespace(const char* pluginNamespace) noexcept override { mNamespace = pluginNamespace; }\n\n    const char* getPluginNamespace() const noexcept override { return mNamespace.c_str(); }\n\n    int initialize() noexcept override { return 0; }\n\n    int32_t enqueue(nvinfer1::PluginTensorDesc const* inputDesc, nvinfer1::PluginTensorDesc const* outputDesc,\n                    void const* const* inputs, void* const* outputs, void* workspace,\n                    cudaStream_t stream) noexcept override;\n\n    void terminate() noexcept override {}\n\n   private:\n    int mUpscaleFactor;\n    std::string mNamespace;\n};\n\nclass PixelShufflePluginCreator : public nvinfer1::IPluginCreator {\n   public:\n    PixelShufflePluginCreator() {\n        mPluginAttributes.clear();\n        mPluginAttributes.emplace_back(\n                nvinfer1::PluginField(\"upscaleFactor\", nullptr, nvinfer1::PluginFieldType::kINT32, 1));\n        mFC.nbFields = mPluginAttributes.size();\n        mFC.fields = mPluginAttributes.data();\n    }\n\n    ~PixelShufflePluginCreator() override = default;\n\n    const char* getPluginName() const noexcept override { return \"PixelShufflePlugin\"; }\n\n    const char* getPluginVersion() const noexcept override { return \"1\"; }\n\n    const nvinfer1::PluginFieldCollection* getFieldNames() noexcept override { return &mFC; }\n\n    nvinfer1::IPluginV2* createPlugin(const char* name, const nvinfer1::PluginFieldCollection* fc) noexcept override {\n        int upscaleFactor = 0;\n        for (int i = 0; i < fc->nbFields; ++i) {\n            if (strcmp(fc->fields[i].name, \"upscaleFactor\") == 0) {\n                upscaleFactor = *static_cast<const int*>(fc->fields[i].data);\n            }\n        }\n        return new PixelShufflePlugin(upscaleFactor);\n    }\n\n    nvinfer1::IPluginV2* deserializePlugin(const char* name, const void* serialData,\n                                           size_t serialLength) noexcept override {\n        return new PixelShufflePlugin(serialData, serialLength);\n    }\n\n    void setPluginNamespace(const char* pluginNamespace) noexcept override { mNamespace = pluginNamespace; }\n\n    const char* getPluginNamespace() const noexcept override { return mNamespace.c_str(); }\n\n   private:\n    static nvinfer1::PluginFieldCollection mFC;\n    static std::vector<nvinfer1::PluginField> mPluginAttributes;\n    std::string mNamespace;\n};\n\nnvinfer1::PluginFieldCollection PixelShufflePluginCreator::mFC{};\nstd::vector<nvinfer1::PluginField> PixelShufflePluginCreator::mPluginAttributes{\n        nvinfer1::PluginField{\"upscaleFactor\", nullptr, nvinfer1::PluginFieldType::kINT32, 1}};\n\nREGISTER_TENSORRT_PLUGIN(PixelShufflePluginCreator);\n\n#endif  //REAL_ESRGAN_TRT_PIXEL_SHUFFLE_HPP\n"
  },
  {
    "path": "real-esrgan/general-x4v3/src/include/preprocess/preprocess.hpp",
    "content": "#ifndef REAL_ESRGAN_TRT_PREPROCESS_HPP\n#define REAL_ESRGAN_TRT_PREPROCESS_HPP\n\nstruct PreprocessStruct {\n    int N;\n    int C;\n    int H;\n    int W;\n};\n\n#endif  //REAL_ESRGAN_TRT_PREPROCESS_HPP\n"
  },
  {
    "path": "real-esrgan/general-x4v3/src/pixel_shuffle/pixel_shuffle.cpp",
    "content": "// PixelShufflePlugin.cpp\n//\n// #include \"pixel_shuffle/pixel_shuffle.hpp\"\n// #include <cstring>\n// #include <cassert>\n//\n// PixelShufflePlugin::PixelShufflePlugin(int upscaleFactor)\n//         : mUpscaleFactor(upscaleFactor) {\n//     // Initialize other members\n// }\n//\n// PixelShufflePlugin::PixelShufflePlugin(const void* data, size_t length) {\n//     // Deserialize data to initialize members\n//     const char* d = static_cast<const char*>(data);\n//     mUpscaleFactor = *reinterpret_cast<const int*>(d);\n//     d += sizeof(int);\n//     mInputVolume = *reinterpret_cast<const size_t*>(d);\n//     d += sizeof(size_t);\n//     mOutputVolume = *reinterpret_cast<const size_t*>(d);\n// }\n//\n// int PixelShufflePlugin::getNbOutputs() const {\n//     return 1;\n// }\n//\n// nvinfer1::Dims PixelShufflePlugin::getOutputDimensions(int index, const nvinfer1::Dims* inputs, int nbInputDims) {\n//     assert(index == 0);\n//     assert(nbInputDims == 1);\n//     int c = inputs[0].d[0];\n//     int h = inputs[0].d[1];\n//     int w = inputs[0].d[2];\n//     int upscaleFactor = mUpscaleFactor;\n//\n//     assert(c % (upscaleFactor * upscaleFactor) == 0);\n//     int newC = c / (upscaleFactor * upscaleFactor);\n//     int newH = h * upscaleFactor;\n//     int newW = w * upscaleFactor;\n//\n//     return nvinfer1::Dims3(newC, newH, newW);\n// }\n//\n// int PixelShufflePlugin::initialize() {\n//     return 0;\n// }\n//\n// void PixelShufflePlugin::terminate() {\n//     // Clean up\n// }\n//\n// size_t PixelShufflePlugin::getWorkspaceSize(int maxBatchSize) const {\n//     return 0;\n// }\n//\n// int PixelShufflePlugin::enqueue(int batchSize, const void* const* inputs, void** outputs, void* workspace, cudaStream_t stream) {\n//     // Launch CUDA kernel for PixelShuffle\n//     // Assume inputs[0] and outputs[0] are pointers to device memory\n//     const float* input = static_cast<const float*>(inputs[0]);\n//     float* output = static_cast<float*>(outputs[0]);\n//\n//     int c = mInputVolume / (mUpscaleFactor * mUpscaleFactor);\n//     int h = mOutputVolume / (c * mUpscaleFactor);\n//     int w = h; // Assuming square input for simplicity\n//     int upscaleFactor = mUpscaleFactor;\n//\n//     // Launch CUDA kernel (to be implemented)\n//     // pixelShuffleKernel(input, output, c, h, w, upscaleFactor, stream);\n//\n//     return 0;\n// }\n//\n// size_t PixelShufflePlugin::getSerializationSize() const {\n//     return sizeof(int) + sizeof(size_t) * 2;\n// }\n//\n// void PixelShufflePlugin::serialize(void* buffer) const {\n//     char* d = static_cast<char*>(buffer);\n//     *reinterpret_cast<int*>(d) = mUpscaleFactor;\n//     d += sizeof(int);\n//     *reinterpret_cast<size_t*>(d) = mInputVolume;\n//     d += sizeof(size_t);\n//     *reinterpret_cast<size_t*>(d) = mOutputVolume;\n// }\n//\n// void PixelShufflePlugin::destroy() {\n//     delete this;\n// }\n//\n// const char* PixelShufflePlugin::getPluginType() const {\n//     return \"PixelShufflePlugin\";\n// }\n//\n// const char* PixelShufflePlugin::getPluginVersion() const {\n//     return \"1\";\n// }\n//\n// void PixelShufflePlugin::setPluginNamespace(const char* pluginNamespace) {\n//     mPluginNamespace = pluginNamespace;\n// }\n//\n// const char* PixelShufflePlugin::getPluginNamespace() const {\n//     return mPluginNamespace;\n// }\n//\n// nvinfer1::IPluginV2IOExt* PixelShufflePlugin::clone() const {\n//     return new PixelShufflePlugin(mUpscaleFactor);\n// }\n//\n// bool PixelShufflePlugin::supportsFormatCombination(int pos, const nvinfer1::PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const {\n//     return inOut[pos].format == nvinfer1::TensorFormat::kLINEAR && inOut[pos].type == nvinfer1::DataType::kFLOAT;\n// }\n//\n// void PixelShufflePlugin::configurePlugin(const nvinfer1::DynamicPluginTensorDesc* in, int nbInputs, const nvinfer1::DynamicPluginTensorDesc* out, int nbOutputs) {\n//     // Configure the plugin based on the input and output descriptions\n//     mInputVolume = in[0].desc.volume();\n//     mOutputVolume = out[0].desc.volume();\n// }\n//\n// nvinfer1::DataType PixelShufflePlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const {\n//     return inputTypes[0];\n// }\n//\n// bool PixelShufflePlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const {\n//     return false;\n// }\n//\n// bool PixelShufflePlugin::canBroadcastInputAcrossBatch(int inputIndex) const {\n//     return false;\n// }\n"
  },
  {
    "path": "real-esrgan/general-x4v3/src/pixel_shuffle/pixel_shuffle.cu",
    "content": "#include <cuda_runtime.h>\n#include <string>\n#include \"pixel_shuffle/pixel_shuffle.hpp\"\n\n// CUDA kernel for PixelShuffle\n__global__ void PixelShuffleKernel(const float* input, float* output, int batchSize, int channels, int height,\n                                   int width, int upscaleFactor) {\n    int outHeight = height * upscaleFactor;\n    int outWidth = width * upscaleFactor;\n    int outChannels = channels / (upscaleFactor * upscaleFactor);\n\n    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n    if (idx >= batchSize * outChannels * outHeight * outWidth)\n        return;\n\n    int out_w = idx % outWidth;\n    int out_h = (idx / outWidth) % outHeight;\n    int out_c = (idx / outWidth / outHeight) % outChannels;\n    int b = idx / (outWidth * outHeight * outChannels);\n\n    int in_c =\n            out_c * upscaleFactor * upscaleFactor + (out_h % upscaleFactor) * upscaleFactor + (out_w % upscaleFactor);\n    int in_h = out_h / upscaleFactor;\n    int in_w = out_w / upscaleFactor;\n\n    output[idx] = input[((b * channels + in_c) * height + in_h) * width + in_w];\n}\n\nint32_t PixelShufflePlugin::enqueue(nvinfer1::PluginTensorDesc const* inputDesc,\n                                    nvinfer1::PluginTensorDesc const* outputDesc, void const* const* inputs,\n                                    void* const* outputs, void* workspace, cudaStream_t stream) noexcept {\n    const float* input = static_cast<const float*>(inputs[0]);\n    float* output = static_cast<float*>(outputs[0]);\n\n    int batchSize = inputDesc[0].dims.d[0];\n    int channels = inputDesc[0].dims.d[1];\n    int height = inputDesc[0].dims.d[2];\n    int width = inputDesc[0].dims.d[3];\n    int upscaleFactor = mUpscaleFactor;\n\n    int outChannels = channels / (upscaleFactor * upscaleFactor);\n    int outHeight = height * upscaleFactor;\n    int outWidth = width * upscaleFactor;\n\n    int numElements = batchSize * outChannels * outHeight * outWidth;\n\n    PixelShuffleKernel<<<(numElements + 255) / 256, 256>>>(input, output, batchSize, channels, height, width,\n                                                           upscaleFactor);\n    return cudaGetLastError() != cudaSuccess;\n}\n"
  },
  {
    "path": "real-esrgan/x4plus/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(real-esrgan)\n\nadd_definitions(-std=c++11)\nadd_definitions(-DAPI_EXPORTS)\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\nif(WIN32)\nenable_language(CUDA)\nendif(WIN32)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -g -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\ncuda_add_library(myplugins SHARED preprocess.cu postprocess.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\ncuda_add_executable(real-esrgan real-esrgan.cpp)\n\ntarget_link_libraries(real-esrgan nvinfer)\ntarget_link_libraries(real-esrgan cudart)\ntarget_link_libraries(real-esrgan myplugins)\ntarget_link_libraries(real-esrgan ${OpenCV_LIBS})\n\nif(UNIX)\nadd_definitions(-O2 -pthread)\nendif(UNIX)\n\n\n"
  },
  {
    "path": "real-esrgan/x4plus/README.md",
    "content": "# Real-ESRGAN\nThe Pytorch implementation is [real-esrgan](https://github.com/xinntao/Real-ESRGAN).\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/40158321/170728105-0a1429e8-d117-4844-9c4b-a2d9db4a4ada.png\">\n</p>\n\n## Config\n- Input shape(**INPUT_H**, **INPUT_W**, **INPUT_C**) defined in real-esrgan.cpp\n- GPU id(**DEVICE**) can be selected by the macro in real-esrgan.cpp\n- **BATCH_SIZE** can be selected by the macro in real-esrgan.cpp\n- FP16/FP32 can be selected by **PRECISION_MODE** in real-esrgan.cpp\n- The example result can be visualized by **VISUALIZATION**. \n\n## How to Run, real-esrgan as example\n\n0. prepare test image  \n- download : [OST_009.png](https://drive.google.com/file/d/1KAyAiQ8qHc5jSBkk2Uft2LfIhzi9XSyH/view?usp=sharing)   \n\n```\ncd {tensorrtx}/real-esrgan/\nmkdir sample   \ncp ~/Download/OST_009.png {tensorrtx}/real-esrgan/sample\n```\n\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\n\n```\ngit clone https://github.com/xinntao/Real-ESRGAN.git\ncd Real-ESRGAN\npip install basicsr\npip install facexlib\npip install gfpgan\npip install -r requirements.txt\npython setup.py develop\n\n// download https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth\ncp ~/RealESRGAN_x4plus.pth {xinntao}/Real-ESRGAN/experiments/pretrained_models\n\ncp {tensorrtx}/Real-ESRGAN/gen_wts.py {xinntao}/Real-ESRGAN\ncd {xinntao}/Real-ESRGAN\npython gen_wts.py\n// a file 'real-esrgan.wts' will be generated.\n```\n\n2. build tensorrtx/real-esrgan and run\n\n```\ncd {tensorrtx}/real-esrgan/\nmkdir build\ncd build\ncp {xinntao}/Real-ESRGAN/real-esrgan.wts {tensorrtx}/real-esrgan/build\ncmake ..\nmake\nsudo ./real-esrgan -s [.wts] [.engine]   // serialize model to plan file\nsudo ./real-esrgan -d [.engine] [image folder]  // deserialize and run inference, the images in [image folder] will be processed.\n// For example\n// sudo ./real-esrgan -s ./real-esrgan.wts ./real-esrgan_f32.engine\n// sudo ./real-esrgan -d ./real-esrgan_f32.engine ../samples\n\n```\n\n3. check the images generated, as follows. _OST_009.png\n"
  },
  {
    "path": "real-esrgan/x4plus/common.hpp",
    "content": "#ifndef REAL_ESRGAN_COMMON_H_\n#define REAL_ESRGAN_COMMON_H_\n\n#include <fstream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n\nusing namespace nvinfer1;\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{ DataType::kFLOAT, nullptr, 0 };\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nITensor* residualDenseBlock(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor* x, std::string lname)\n{\n    IConvolutionLayer* conv_1 = network->addConvolutionNd(*x, 32, DimsHW{ 3, 3 }, weightMap[lname + \".conv1.weight\"], weightMap[lname + \".conv1.bias\"]);\n    conv_1->setStrideNd(DimsHW{ 1, 1 });\n    conv_1->setPaddingNd(DimsHW{ 1, 1 });\n    IActivationLayer* leaky_relu_1 = network->addActivation(*conv_1->getOutput(0), ActivationType::kLEAKY_RELU);\n    leaky_relu_1->setAlpha(0.2);\n    ITensor* x1 = leaky_relu_1->getOutput(0);\n\n    ITensor* concat_input2[] = { x, x1 };\n    IConcatenationLayer* concat2 = network->addConcatenation(concat_input2, 2);\n    concat2->setAxis(0);\n    IConvolutionLayer* conv_2 = network->addConvolutionNd(*concat2->getOutput(0), 32, DimsHW{ 3, 3 }, weightMap[lname + \".conv2.weight\"], weightMap[lname + \".conv2.bias\"]);\n    conv_2->setStrideNd(DimsHW{ 1, 1 });\n    conv_2->setPaddingNd(DimsHW{ 1, 1 });\n    IActivationLayer* leaky_relu_2 = network->addActivation(*conv_2->getOutput(0), ActivationType::kLEAKY_RELU);\n    leaky_relu_2->setAlpha(0.2);\n    ITensor* x2 = leaky_relu_2->getOutput(0);\n\n    ITensor* concat_input3[] = { x, x1, x2 };\n    IConcatenationLayer* concat3 = network->addConcatenation(concat_input3, 3);\n    concat3->setAxis(0);\n    IConvolutionLayer* conv_3 = network->addConvolutionNd(*concat3->getOutput(0), 32, DimsHW{ 3, 3 }, weightMap[lname + \".conv3.weight\"], weightMap[lname + \".conv3.bias\"]);\n    conv_3->setStrideNd(DimsHW{ 1, 1 });\n    conv_3->setPaddingNd(DimsHW{ 1, 1 });\n    IActivationLayer* leaky_relu_3 = network->addActivation(*conv_3->getOutput(0), ActivationType::kLEAKY_RELU);\n    leaky_relu_3->setAlpha(0.2);\n    ITensor* x3 = leaky_relu_3->getOutput(0);\n\n    ITensor* concat_input4[] = { x, x1, x2, x3 };\n    IConcatenationLayer* concat4 = network->addConcatenation(concat_input4, 4);\n    concat4->setAxis(0);\n    IConvolutionLayer* conv_4 = network->addConvolutionNd(*concat4->getOutput(0), 32, DimsHW{ 3, 3 }, weightMap[lname + \".conv4.weight\"], weightMap[lname + \".conv4.bias\"]);\n    conv_4->setStrideNd(DimsHW{ 1, 1 });\n    conv_4->setPaddingNd(DimsHW{ 1, 1 });\n    IActivationLayer* leaky_relu_4 = network->addActivation(*conv_4->getOutput(0), ActivationType::kLEAKY_RELU);\n    leaky_relu_4->setAlpha(0.2);\n    ITensor* x4 = leaky_relu_4->getOutput(0);\n\n    ITensor* concat_input5[] = { x, x1, x2, x3, x4 };\n    IConcatenationLayer* concat5 = network->addConcatenation(concat_input5, 5);\n    concat5->setAxis(0);\n    IConvolutionLayer* conv_5 = network->addConvolutionNd(*concat5->getOutput(0), 64, DimsHW{ 3, 3 }, weightMap[lname + \".conv5.weight\"], weightMap[lname + \".conv5.bias\"]);\n    conv_5->setStrideNd(DimsHW{ 1, 1 });\n    conv_5->setPaddingNd(DimsHW{ 1, 1 });\n    ITensor* x5 = conv_5->getOutput(0);\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float)));\n    *scval = 0.2;\n    Weights scale{ DataType::kFLOAT, scval, 1 };\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float)));\n    *shval = 0.0;\n    Weights shift{ DataType::kFLOAT, shval, 1 };\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float)));\n    *pval = 1.0;\n    Weights power{ DataType::kFLOAT, pval, 1 };\n\n    IScaleLayer* scaled = network->addScale(*x5, ScaleMode::kUNIFORM, shift, scale, power);\n    IElementWiseLayer* ew1 = network->addElementWise(*scaled->getOutput(0), *x, ElementWiseOperation::kSUM);\n    return ew1->getOutput(0);\n}\n\nITensor* RRDB(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor* x, std::string lname)\n{\n    ITensor* out = residualDenseBlock(network, weightMap, x, lname + \".rdb1\");\n    out = residualDenseBlock(network, weightMap, out, lname + \".rdb2\");\n    out = residualDenseBlock(network, weightMap, out, lname + \".rdb3\");\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float)));\n    *scval = 0.2;\n    Weights scale{ DataType::kFLOAT, scval, 1 };\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float)));\n    *shval = 0.0;\n    Weights shift{ DataType::kFLOAT, shval, 1 };\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float)));\n    *pval = 1.0;\n    Weights power{ DataType::kFLOAT, pval, 1 };\n\n    IScaleLayer* scaled = network->addScale(*out, ScaleMode::kUNIFORM, shift, scale, power);\n    IElementWiseLayer* ew1 = network->addElementWise(*scaled->getOutput(0), *x, ElementWiseOperation::kSUM);\n    return ew1->getOutput(0);\n}\n\n\n#endif"
  },
  {
    "path": "real-esrgan/x4plus/cuda_utils.h",
    "content": "#ifndef TRTX_CUDA_UTILS_H_\n#define TRTX_CUDA_UTILS_H_\n\n#include <cuda_runtime_api.h>\n#include <stdint.h>\n#include <cstdio>\n#include <vector>\n#include <iostream>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)\\\n    {\\\n        cudaError_t error_code = callstr;\\\n        if (error_code != cudaSuccess) {\\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__;\\\n            assert(0);\\\n        }\\\n    } \n#endif  // CUDA_CHECK\n\n#endif  // TRTX_CUDA_UTILS_H_\n\n"
  },
  {
    "path": "real-esrgan/x4plus/gen_wts.py",
    "content": "import argparse\nimport os\nimport struct\nfrom basicsr.archs.rrdbnet_arch import RRDBNet\nfrom realesrgan import RealESRGANer\nfrom realesrgan.archs.srvgg_arch import SRVGGNetCompact\n\ndef main():\n    \"\"\"Inference demo for Real-ESRGAN.\n    \"\"\"\n    parser = argparse.ArgumentParser()\n    #parser.add_argument('-i', '--input', type=str, default='../TestData3', help='Input image or folder')\n    parser.add_argument('-i', '--input', type=str, default='inputs', help='Input image or folder')\n    parser.add_argument(\n        '-n',\n        '--model_name',\n        type=str,\n        default='RealESRGAN_x4plus',\n        help=('Model names: RealESRGAN_x4plus | RealESRNet_x4plus | RealESRGAN_x4plus_anime_6B | RealESRGAN_x2plus | '\n              'realesr-animevideov3'))\n    parser.add_argument('-o', '--output', type=str, default='results', help='Output folder')\n    parser.add_argument('-s', '--outscale', type=float, default=4, help='The final upsampling scale of the image')\n    parser.add_argument('--suffix', type=str, default='out', help='Suffix of the restored image')\n    parser.add_argument('-t', '--tile', type=int, default=0, help='Tile size, 0 for no tile during testing')\n    parser.add_argument('--tile_pad', type=int, default=10, help='Tile padding')\n    parser.add_argument('--pre_pad', type=int, default=0, help='Pre padding size at each border')\n    parser.add_argument('--face_enhance', action='store_true', help='Use GFPGAN to enhance face')\n    parser.add_argument(\n        '--fp32', action='store_true', help='Use fp32 precision during inference. Default: fp16 (half precision).')\n    parser.add_argument(\n        '--alpha_upsampler',\n        type=str,\n        default='realesrgan',\n        help='The upsampler for the alpha channels. Options: realesrgan | bicubic')\n    parser.add_argument(\n        '--ext',\n        type=str,\n        default='auto',\n        help='Image extension. Options: auto | jpg | png, auto means using the same extension as inputs')\n    args = parser.parse_args()\n\n    # determine models according to model names\n    args.model_name = args.model_name.split('.')[0]\n    if args.model_name in ['RealESRGAN_x4plus', 'RealESRNet_x4plus']:  # x4 RRDBNet model\n        model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32, scale=4)\n        netscale = 4\n    elif args.model_name in ['RealESRGAN_x4plus_anime_6B']:  # x4 RRDBNet model with 6 blocks\n        model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=6, num_grow_ch=32, scale=4)\n        netscale = 4\n    elif args.model_name in ['RealESRGAN_x2plus']:  # x2 RRDBNet model\n        model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32, scale=2)\n        netscale = 2\n    elif args.model_name in ['realesr-animevideov3']:  # x4 VGG-style model (XS size)\n        model = SRVGGNetCompact(num_in_ch=3, num_out_ch=3, num_feat=64, num_conv=16, upscale=4, act_type='prelu')\n        netscale = 4\n\n    # determine model paths\n    model_path = os.path.join('experiments/pretrained_models', args.model_name + '.pth')\n    if not os.path.isfile(model_path):\n        model_path = os.path.join('realesrgan/weights', args.model_name + '.pth')\n    if not os.path.isfile(model_path):\n        raise ValueError(f'Model {args.model_name} does not exist.')\n\n    # restorer\n    upsampler = RealESRGANer(\n        scale=netscale,\n        model_path=model_path,\n        model=model,\n        tile=args.tile,\n        tile_pad=args.tile_pad,\n        pre_pad=args.pre_pad,\n        half=args.fp32)\n\n    if os.path.isfile('real-esrgan.wts'):\n        print('Already, real-esrgan.wts file exists.')\n    else:\n        print('making real-esrgan.wts file ...')\n        f = open(\"real-esrgan.wts\", 'w')\n        f.write(\"{}\\n\".format(len(upsampler.model.state_dict().keys())))\n        for k, v in upsampler.model.state_dict().items():\n            print('key: ', k)\n            print('value: ', v.shape)\n            vr = v.reshape(-1).cpu().numpy()\n            f.write(\"{} {}\".format(k, len(vr)))\n            for vv in vr:\n                f.write(\" \")\n                f.write(struct.pack(\">f\", float(vv)).hex())\n            f.write(\"\\n\")\n        print('Completed real-esrgan.wts file!')\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "real-esrgan/x4plus/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override \n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "real-esrgan/x4plus/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "real-esrgan/x4plus/postprocess.cu",
    "content": "#include \"cuda_utils.h\"\n\nusing namespace std;\n\n// postprocess (NCHW->NHWC, RGB->BGR, *255, ROUND, uint8)\n__global__ void postprocess_kernel(uint8_t* output, float* input,\n    const int batchSize, const int height, const int width, const int channel,\n    const int thread_count)\n{\n    int index = threadIdx.x + blockIdx.x * blockDim.x;\n    if (index >= thread_count) return;\n\n    const int c_idx = index % channel;\n    int idx = index / channel;\n    const int w_idx = idx % width;\n    idx /= width;\n    const int h_idx = idx % height;\n    const int b_idx = idx / height;\n\n    int g_idx = b_idx * height * width * channel + (2 - c_idx)* height * width + h_idx * width + w_idx;\n    float tt = input[g_idx] * 255.f;\n    if (tt > 255)\n        tt = 255;\n    output[index] = tt;\n}\n\nvoid postprocess(uint8_t* output, float*input, int batchSize, int height, int width, int channel, cudaStream_t stream)\n{\n    int thread_count = batchSize * height * width * channel;\n    int block = 512;\n    int grid = (thread_count - 1) / block + 1;\n\n    postprocess_kernel << <grid, block, 0, stream >> > (output, input, batchSize, height, width, channel, thread_count);\n}\n\n\n#include \"postprocess.hpp\"\n\nnamespace nvinfer1\n{\n    int PostprocessPluginV2::enqueue(int batchSize, const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept\n    {\n        float* input = (float*)inputs[0];\n        uint8_t* output = (uint8_t*)outputs[0];\n\n        const int H = mPostprocess.H;\n        const int W = mPostprocess.W;\n        const int C = mPostprocess.C;\n\n        postprocess(output, input, batchSize, H, W, C, stream);\n\n        return 0;\n    }\n}"
  },
  {
    "path": "real-esrgan/x4plus/postprocess.hpp",
    "content": "#pragma once\n#include <NvInfer.h>\n#include <fstream>\n#include \"macros.h\"\n#include <assert.h>\n\nstruct Postprocess {\n    int N;\n    int C;\n    int H;\n    int W;\n};\n\nnamespace nvinfer1\n{\n    class PostprocessPluginV2 : public IPluginV2IOExt\n    {\n    public:\n        PostprocessPluginV2(const Postprocess& arg)\n        {\n            mPostprocess = arg;\n        }\n\n        PostprocessPluginV2(const void* data, size_t length)\n        {\n            const char* d = static_cast<const char*>(data);\n            const char* const a = d;\n            mPostprocess = read<Postprocess>(d);\n            assert(d == a + length);\n        }\n        PostprocessPluginV2() = delete;\n\n        virtual ~PostprocessPluginV2() {}\n\n    public:\n        int getNbOutputs() const noexcept override\n        {\n            return 1;\n        }\n\n        Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) noexcept override\n        {\n            return Dims3(mPostprocess.H, mPostprocess.W, mPostprocess.C);\n        }\n\n        int initialize() noexcept override\n        {\n            return 0;\n        }\n\n        void terminate() noexcept override\n        {\n        }\n\n        size_t getWorkspaceSize(int maxBatchSize) const noexcept override\n        {\n            return 0;\n        }\n\n        int enqueue(int batchSize, const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept override;\n\n        size_t getSerializationSize() const noexcept override\n        {\n            size_t serializationSize = 0;\n            serializationSize += sizeof(mPostprocess);\n            return serializationSize;\n        }\n\n        void serialize(void* buffer) const noexcept override\n        {\n            char* d = static_cast<char*>(buffer);\n            const char* const a = d;\n            write(d, mPostprocess);\n            assert(d == a + getSerializationSize());\n        }\n\n        void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) noexcept override\n        {\n        }\n\n        //! The combination of kLINEAR + kINT8/kHALF/kFLOAT is supported.\n        bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const noexcept override\n        {\n            assert(nbInputs == 1 && nbOutputs == 1 && pos < nbInputs + nbOutputs);\n            bool condition = inOut[pos].format == TensorFormat::kLINEAR;\n            condition &= inOut[pos].type != DataType::kINT32;\n            condition &= inOut[pos].type == inOut[0].type;\n            return condition;\n        }\n\n        DataType getOutputDataType(int index, const DataType* inputTypes, int nbInputs) const noexcept override\n        {\n            assert(inputTypes && nbInputs == 1);\n            return DataType::kFLOAT; //\n        }\n\n        const char* getPluginType() const noexcept override\n        {\n            return \"postprocess\";\n        }\n\n        const char* getPluginVersion() const noexcept override\n        {\n            return \"1\";\n        }\n\n        void destroy() noexcept override\n        {\n            delete this;\n        }\n\n        IPluginV2Ext* clone() const noexcept override\n        {\n            PostprocessPluginV2* plugin = new PostprocessPluginV2(*this);\n            return plugin;\n        }\n\n        void setPluginNamespace(const char* libNamespace) noexcept override\n        {\n            mNamespace = libNamespace;\n        }\n\n        const char* getPluginNamespace() const noexcept override\n        {\n            return mNamespace.data();\n        }\n\n        bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const noexcept override\n        {\n            return false;\n        }\n\n        bool canBroadcastInputAcrossBatch(int inputIndex) const noexcept override\n        {\n            return false;\n        }\n\n    private:\n        template <typename T>\n        void write(char*& buffer, const T& val) const\n        {\n            *reinterpret_cast<T*>(buffer) = val;\n            buffer += sizeof(T);\n        }\n\n        template <typename T>\n        T read(const char*& buffer) const\n        {\n            T val = *reinterpret_cast<const T*>(buffer);\n            buffer += sizeof(T);\n            return val;\n        }\n\n    private:\n        Postprocess mPostprocess;\n        std::string mNamespace;\n    };\n\n    class PostprocessPluginV2Creator : public IPluginCreator\n    {\n    public:\n        const char* getPluginName() const noexcept override\n        {\n            return \"postprocess\";\n        }\n\n        const char* getPluginVersion() const noexcept override\n        {\n            return \"1\";\n        }\n\n        const PluginFieldCollection* getFieldNames() noexcept override\n        {\n            return nullptr;\n        }\n\n        IPluginV2* createPlugin(const char* name, const PluginFieldCollection* fc) noexcept override\n        {\n            PostprocessPluginV2* plugin = new PostprocessPluginV2(*(Postprocess*)fc);\n            mPluginName = name;\n            return plugin;\n        }\n\n        IPluginV2* deserializePlugin(const char* name, const void* serialData, size_t serialLength) noexcept override\n        {\n            auto plugin = new PostprocessPluginV2(serialData, serialLength);\n            mPluginName = name;\n            return plugin;\n        }\n\n        void setPluginNamespace(const char* libNamespace) noexcept override\n        {\n            mNamespace = libNamespace;\n        }\n\n        const char* getPluginNamespace() const noexcept override\n        {\n            return mNamespace.c_str();\n        }\n\n    private:\n        std::string mNamespace;\n        std::string mPluginName;\n    };\n    REGISTER_TENSORRT_PLUGIN(PostprocessPluginV2Creator);\n};\n"
  },
  {
    "path": "real-esrgan/x4plus/preprocess.cu",
    "content": "#include \"cuda_utils.h\"\n\nusing namespace std;\n\n// preprocess (NHWC->NCHW, BGR->RGB, [0, 255]->[0, 1](Normalize))\n__global__ void preprocess_kernel(float* output, uint8_t* input,\n    const int batchSize, const int height, const int width, const int channel,\n    const int thread_count)\n{\n    int index = threadIdx.x + blockIdx.x * blockDim.x;\n    if (index >= thread_count) return;\n\n    const int w_idx = index % width;\n    int idx = index / width;\n    const int h_idx = idx % height;\n    idx /= height;\n    const int c_idx = idx % channel;\n    const int b_idx = idx / channel;\n\n    int g_idx = b_idx * height * width * channel + h_idx * width * channel + w_idx * channel + 2 - c_idx;\n\n    output[index] = input[g_idx] / 255.f;\n}\n\nvoid preprocess(float* output, uint8_t*input, int batchSize, int height, int width, int channel, cudaStream_t stream)\n{\n    int thread_count = batchSize * height * width * channel;\n    int block = 512;\n    int grid = (thread_count - 1) / block + 1;\n\n    preprocess_kernel << <grid, block, 0, stream >> > (output, input, batchSize, height, width, channel, thread_count);\n}\n\n#include \"preprocess.hpp\"\n\nnamespace nvinfer1\n{\n    int PreprocessPluginV2::enqueue(int batchSize, const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept\n    {\n        uint8_t* input = (uint8_t*)inputs[0];\n        float* output = (float*)outputs[0];\n\n        const int H = mPreprocess.H;\n        const int W = mPreprocess.W;\n        const int C = mPreprocess.C;\n\n        preprocess(output, input, batchSize, H, W, C, stream);\n\n        return 0;\n    }\n}"
  },
  {
    "path": "real-esrgan/x4plus/preprocess.hpp",
    "content": "#pragma once\n#include <NvInfer.h>\n#include <fstream>\n#include \"macros.h\"\n#include <assert.h>\n\nstruct Preprocess {\n    int N;\n    int C;\n    int H;\n    int W;\n};\n\nnamespace nvinfer1\n{\n    class PreprocessPluginV2 : public IPluginV2IOExt\n    {\n    public:\n        PreprocessPluginV2(const Preprocess& arg)\n        {\n            mPreprocess = arg;\n        }\n\n        PreprocessPluginV2(const void* data, size_t length)\n        {\n            const char* d = static_cast<const char*>(data);\n            const char* const a = d;\n            mPreprocess = read<Preprocess>(d);\n            assert(d == a + length);\n        }\n        PreprocessPluginV2() = delete;\n\n        virtual ~PreprocessPluginV2() {}\n\n    public:\n        int getNbOutputs() const noexcept override\n        {\n            return 1;\n        }\n\n        Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) noexcept override\n        {\n            return Dims3(mPreprocess.C, mPreprocess.H, mPreprocess.W);\n        }\n\n        int initialize() noexcept override\n        {\n            return 0;\n        }\n\n        void terminate() noexcept override\n        {\n        }\n\n        size_t getWorkspaceSize(int maxBatchSize) const noexcept override\n        {\n            return 0;\n        }\n\n        int enqueue(int batchSize, const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept override;\n\n        size_t getSerializationSize() const noexcept override\n        {\n            size_t serializationSize = 0;\n            serializationSize += sizeof(mPreprocess);\n            return serializationSize;\n        }\n\n        void serialize(void* buffer) const noexcept override\n        {\n            char* d = static_cast<char*>(buffer);\n            const char* const a = d;\n            write(d, mPreprocess);\n            assert(d == a + getSerializationSize());\n        }\n\n        void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) noexcept override\n        {\n        }\n\n        //! The combination of kLINEAR + kINT8/kHALF/kFLOAT is supported.\n        bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const noexcept override\n        {\n            assert(nbInputs == 1 && nbOutputs == 1 && pos < nbInputs + nbOutputs);\n            bool condition = inOut[pos].format == TensorFormat::kLINEAR;\n            condition &= inOut[pos].type != DataType::kINT32;\n            condition &= inOut[pos].type == inOut[0].type;\n            return condition;\n        }\n\n        DataType getOutputDataType(int index, const DataType* inputTypes, int nbInputs) const noexcept override\n        {\n            assert(inputTypes && nbInputs == 1);\n            return DataType::kFLOAT; //\n        }\n\n        const char* getPluginType() const noexcept override\n        {\n            return \"preprocess\";\n        }\n\n        const char* getPluginVersion() const noexcept override\n        {\n            return \"1\";\n        }\n\n        void destroy() noexcept override\n        {\n            delete this;\n        }\n\n        IPluginV2Ext* clone() const noexcept override\n        {\n            PreprocessPluginV2* plugin = new PreprocessPluginV2(*this);\n            return plugin;\n        }\n\n        void setPluginNamespace(const char* libNamespace) noexcept override\n        {\n            mNamespace = libNamespace;\n        }\n\n        const char* getPluginNamespace() const noexcept override\n        {\n            return mNamespace.data();\n        }\n\n        bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const noexcept override\n        {\n            return false;\n        }\n\n        bool canBroadcastInputAcrossBatch(int inputIndex) const noexcept override\n        {\n            return false;\n        }\n\n    private:\n        template <typename T>\n        void write(char*& buffer, const T& val) const\n        {\n            *reinterpret_cast<T*>(buffer) = val;\n            buffer += sizeof(T);\n        }\n\n        template <typename T>\n        T read(const char*& buffer) const\n        {\n            T val = *reinterpret_cast<const T*>(buffer);\n            buffer += sizeof(T);\n            return val;\n        }\n\n    private:\n        Preprocess mPreprocess;\n        std::string mNamespace;\n    };\n\n    class PreprocessPluginV2Creator : public IPluginCreator\n    {\n    public:\n        const char* getPluginName() const noexcept override\n        {\n            return \"preprocess\";\n        }\n\n        const char* getPluginVersion() const noexcept override\n        {\n            return \"1\";\n        }\n\n        const PluginFieldCollection* getFieldNames() noexcept override\n        {\n            return nullptr;\n        }\n\n        IPluginV2* createPlugin(const char* name, const PluginFieldCollection* fc) noexcept override\n        {\n            PreprocessPluginV2* plugin = new PreprocessPluginV2(*(Preprocess*)fc);\n            mPluginName = name;\n            return plugin;\n        }\n\n        IPluginV2* deserializePlugin(const char* name, const void* serialData, size_t serialLength) noexcept override\n        {\n            auto plugin = new PreprocessPluginV2(serialData, serialLength);\n            mPluginName = name;\n            return plugin;\n        }\n\n        void setPluginNamespace(const char* libNamespace) noexcept override\n        {\n            mNamespace = libNamespace;\n        }\n\n        const char* getPluginNamespace() const noexcept override\n        {\n            return mNamespace.c_str();\n        }\n\n    private:\n        std::string mNamespace;\n        std::string mPluginName;\n    };\n    REGISTER_TENSORRT_PLUGIN(PreprocessPluginV2Creator);\n};\n"
  },
  {
    "path": "real-esrgan/x4plus/real-esrgan.cpp",
    "content": "#include \"cuda_utils.h\"\n#include \"common.hpp\"\n#include \"preprocess.hpp\"// preprocess plugin \n#include \"postprocess.hpp\"// postprocess plugin \n#include \"logging.h\"\n#include \"utils.h\"\n#include <unistd.h>//access()\n\n#define DEVICE 0 // GPU id\n#define BATCH_SIZE 1\n\n// stuff we know about the network and the input/output blobs\nstatic const int PRECISION_MODE = 32; // fp32 : 32, fp16 : 16\nstatic const bool VISUALIZATION = true;\nstatic const int INPUT_H = 640;\nstatic const int INPUT_W = 448;\nstatic const int INPUT_C = 3;\nstatic const int OUT_SCALE = 4;\nstatic const int OUTPUT_SIZE = INPUT_C * INPUT_H * OUT_SCALE * INPUT_W * OUT_SCALE;\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\nstatic Logger gLogger;\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* build_engine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, std::string& wts_name) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape {INPUT_H, INPUT_W, INPUT_C} with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{ INPUT_H, INPUT_W, INPUT_C });\n    assert(data);\n    std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n    // Custom preprocess (NHWC->NCHW, BGR->RGB, [0, 255]->[0, 1](Normalize))\n    Preprocess preprocess{ maxBatchSize, INPUT_C, INPUT_H, INPUT_W };\n    IPluginCreator* preprocess_creator = getPluginRegistry()->getPluginCreator(\"preprocess\", \"1\");\n    IPluginV2 *preprocess_plugin = preprocess_creator->createPlugin(\"preprocess_plugin\", (PluginFieldCollection*)&preprocess);\n    IPluginV2Layer* preprocess_layer = network->addPluginV2(&data, 1, *preprocess_plugin);\n    preprocess_layer->setName(\"preprocess_layer\");\n    ITensor* prep = preprocess_layer->getOutput(0);\n\n    // conv_first\n    IConvolutionLayer* conv_first = network->addConvolutionNd(*prep, 64, DimsHW{ 3, 3 }, weightMap[\"conv_first.weight\"], weightMap[\"conv_first.bias\"]);\n    conv_first->setStrideNd(DimsHW{ 1, 1 });\n    conv_first->setPaddingNd(DimsHW{ 1, 1 });\n    conv_first->setName(\"conv_first\");\n    ITensor* feat = conv_first->getOutput(0);\n\n    // conv_body\n    ITensor* body_feat = RRDB(network, weightMap, feat, \"body.0\");\n    for (int idx = 1; idx < 23; idx++) {\n        body_feat = RRDB(network, weightMap, body_feat, \"body.\" + std::to_string(idx));\n    }\n\n    IConvolutionLayer* conv_body = network->addConvolutionNd(*body_feat, 64, DimsHW{ 3, 3 }, weightMap[\"conv_body.weight\"], weightMap[\"conv_body.bias\"]);\n    conv_body->setStrideNd(DimsHW{ 1, 1 });\n    conv_body->setPaddingNd(DimsHW{ 1, 1 });\n    IElementWiseLayer* ew1 = network->addElementWise(*feat, *conv_body->getOutput(0), ElementWiseOperation::kSUM);\n    feat = ew1->getOutput(0);\n\n    //upsample\n    IResizeLayer* interpolate_nearest = network->addResize(*feat);\n    float sclaes1[] = { 1, 2, 2 };\n    interpolate_nearest->setScales(sclaes1, 3);\n    interpolate_nearest->setResizeMode(ResizeMode::kNEAREST);\n\n    IConvolutionLayer* conv_up1 = network->addConvolutionNd(*interpolate_nearest->getOutput(0), 64, DimsHW{ 3, 3 }, weightMap[\"conv_up1.weight\"], weightMap[\"conv_up1.bias\"]);\n    conv_up1->setStrideNd(DimsHW{ 1, 1 });\n    conv_up1->setPaddingNd(DimsHW{ 1, 1 });\n    IActivationLayer* leaky_relu_1 = network->addActivation(*conv_up1->getOutput(0), ActivationType::kLEAKY_RELU);\n    leaky_relu_1->setAlpha(0.2);\n\n    IResizeLayer* interpolate_nearest2 = network->addResize(*leaky_relu_1->getOutput(0));\n    float sclaes2[] = { 1, 2, 2 };\n    interpolate_nearest2->setScales(sclaes2, 3);\n    interpolate_nearest2->setResizeMode(ResizeMode::kNEAREST);\n    IConvolutionLayer* conv_up2 = network->addConvolutionNd(*interpolate_nearest2->getOutput(0), 64, DimsHW{ 3, 3 }, weightMap[\"conv_up2.weight\"], weightMap[\"conv_up2.bias\"]);\n    conv_up2->setStrideNd(DimsHW{ 1, 1 });\n    conv_up2->setPaddingNd(DimsHW{ 1, 1 });\n    IActivationLayer* leaky_relu_2 = network->addActivation(*conv_up2->getOutput(0), ActivationType::kLEAKY_RELU);\n    leaky_relu_2->setAlpha(0.2);\n\n    IConvolutionLayer* conv_hr = network->addConvolutionNd(*leaky_relu_2->getOutput(0), 64, DimsHW{ 3, 3 }, weightMap[\"conv_hr.weight\"], weightMap[\"conv_hr.bias\"]);\n    conv_hr->setStrideNd(DimsHW{ 1, 1 });\n    conv_hr->setPaddingNd(DimsHW{ 1, 1 });\n    IActivationLayer* leaky_relu_hr = network->addActivation(*conv_hr->getOutput(0), ActivationType::kLEAKY_RELU);\n    leaky_relu_hr->setAlpha(0.2);\n    IConvolutionLayer* conv_last = network->addConvolutionNd(*leaky_relu_hr->getOutput(0), 3, DimsHW{ 3, 3 }, weightMap[\"conv_last.weight\"], weightMap[\"conv_last.bias\"]);\n    conv_last->setStrideNd(DimsHW{ 1, 1 });\n    conv_last->setPaddingNd(DimsHW{ 1, 1 });\n    ITensor* out = conv_last->getOutput(0);\n\n    // Custom postprocess (RGB -> BGR, NCHW->NHWC, *255, ROUND, uint8)\n    Postprocess postprocess{ maxBatchSize, out->getDimensions().d[0], out->getDimensions().d[1], out->getDimensions().d[2] };\n    IPluginCreator* postprocess_creator = getPluginRegistry()->getPluginCreator(\"postprocess\", \"1\");\n    IPluginV2 *postprocess_plugin = postprocess_creator->createPlugin(\"postprocess_plugin\", (PluginFieldCollection*)&postprocess);\n    IPluginV2Layer* postprocess_layer = network->addPluginV2(&out, 1, *postprocess_plugin);\n    postprocess_layer->setName(\"postprocess_layer\");\n\n    ITensor* final_tensor = postprocess_layer->getOutput(0);\n    final_tensor->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*final_tensor);\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n\n    if (PRECISION_MODE == 16) {\n        std::cout << \"==== precision f16 ====\" << std::endl << std::endl;\n        config->setFlag(BuilderFlag::kFP16);\n    }\n    else {\n        std::cout << \"==== precision f32 ====\" << std::endl << std::endl;\n    }\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream, std::string& wts_name) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine *engine = build_engine(maxBatchSize, builder, config, DataType::kFLOAT, wts_name);\n\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    delete engine;\n    delete builder;\n    delete config;\n}\n\nvoid doInference(IExecutionContext& context, cudaStream_t& stream, void **buffers, uint8_t* output, int batchSize) {\n    // infer on the batch asynchronously, and DMA output back to host\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * OUTPUT_SIZE * sizeof(uint8_t), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir) {\n    if (argc < 4) return false;\n    if (std::string(argv[1]) == \"-s\" && argc == 4) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n    }\n    else if (std::string(argv[1]) == \"-d\" && argc == 4) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n    }\n    else {\n        return false;\n    }\n    return true;\n}\n\n// ./real-esrgan -s ./real-esrgan.wts ./real-esrgan_f32.engine\n// ./real-esrgan -d ./real-esrgan_f32.engine ../samples\n\nint main(int argc, char** argv) {\n    std::string wts_name = \"\";\n    std::string engine_name = \"\";\n    std::string img_dir;\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir)) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./real-esrgan -s [.wts] [.engine] // serialize model to plan file\" << std::endl;\n        std::cerr << \"./real-esrgan -d [.engine] ../samples  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    if (!wts_name.empty()) {\n        IHostMemory* modelStream{ nullptr };\n        APIToModel(BATCH_SIZE, &modelStream, wts_name);\n        assert(modelStream != nullptr);\n        std::ofstream p(engine_name, std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        delete modelStream;\n        return 0;\n    }\n\n    // deserialize the .engine and run inference\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        return -1;\n    }\n    char *trtModelStream = nullptr;\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    trtModelStream = new char[size];\n    assert(trtModelStream);\n    file.read(trtModelStream, size);\n    file.close();\n\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n    assert(engine->getNbBindings() == 2);\n    void* buffers[2];\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine->getBindingIndex(OUTPUT_BLOB_NAME);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n\n    // Create GPU buffers on device\t\n    CUDA_CHECK(cudaMalloc(&buffers[inputIndex], BATCH_SIZE * INPUT_C * INPUT_H * INPUT_W * sizeof(uint8_t)));\n    CUDA_CHECK(cudaMalloc(&buffers[outputIndex], BATCH_SIZE * OUTPUT_SIZE * sizeof(uint8_t)));\n\n    std::vector<uint8_t> input(BATCH_SIZE * INPUT_H * INPUT_W * INPUT_C);\n    std::vector<uint8_t> outputs(BATCH_SIZE * OUTPUT_SIZE);\n\n    // Create stream\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n\n    std::vector<cv::Mat> imgs_buffer(BATCH_SIZE);\n    for (int f = 0; f < (int)file_names.size(); f++) {\n\n        for (int b = 0; b < BATCH_SIZE; b++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[f]);\n            if (img.empty()) continue;\n            memcpy(input.data() + b * INPUT_H * INPUT_W * INPUT_C, img.data, INPUT_H * INPUT_W * INPUT_C);\n        }\n\n        CUDA_CHECK(cudaMemcpyAsync(buffers[inputIndex], input.data(), BATCH_SIZE * INPUT_C * INPUT_H * INPUT_W * sizeof(uint8_t), cudaMemcpyHostToDevice, stream));\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, stream, (void**)buffers, outputs.data(), BATCH_SIZE);\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    cv::Mat frame = cv::Mat(INPUT_H * OUT_SCALE, INPUT_W * OUT_SCALE, CV_8UC3, outputs.data());\n    cv::imwrite(\"../_\" + file_names[0] + \".png\", frame);\n\n    if (VISUALIZATION) {\n        cv::imshow(\"result : \" + file_names[0], frame);\n        cv::waitKey(0);\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(buffers[inputIndex]));\n    CUDA_CHECK(cudaFree(buffers[outputIndex]));\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n}"
  },
  {
    "path": "real-esrgan/x4plus/utils.h",
    "content": "#ifndef TRTX_REAL_ESRGAN_UTILS_H_\n#define TRTX_REAL_ESRGAN_UTILS_H_\n\n#include <dirent.h>\n#include <opencv2/opencv.hpp>\n\nstatic inline int read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n            strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\n#endif  // TRTX_REAL_ESRGAN_UTILS_H_\n\n"
  },
  {
    "path": "refinedet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(refinedet)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\n# tensorrt\ninclude_directories(/data_2/tensorrt/TensorRT-7.0.0.11/include/) #include_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/data_2/tensorrt/TensorRT-7.0.0.11/lib/) #link_directories(/usr/lib/x86_64-linux-gnu/)\n\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n\n#find_package(OpenCV)\n#include_directories(OpenCV_INCLUDE_DIRS)\n\ninclude_directories(/home/software_install/opencv3.4.6/include)\nlink_directories(/home/software_install/opencv3.4.6/lib)\n\n\nset(CMAKE_PREFIX_PATH \"/data_1/torch1.1.0\") ###torch1.1.0\nfind_package(Torch REQUIRED)\n\ninclude_directories(/data_1/torch1.1.0/include)\nlink_directories(/data_1/torch1.1.0/lib)\n\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\n\nadd_executable(refinedet ${PROJECT_SOURCE_DIR}/calibrator.cpp ${PROJECT_SOURCE_DIR}/refinedet.cpp)\ntarget_link_libraries(refinedet nvinfer)\ntarget_link_libraries(refinedet cudart)\ntarget_link_libraries(refinedet \"${TORCH_LIBRARIES}\")\ntarget_link_libraries(refinedet opencv_calib3d opencv_core opencv_dnn opencv_imgproc opencv_highgui opencv_imgcodecs caffe2)\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "refinedet/README.md",
    "content": "# RefineDet\n\nFor the Pytorch implementation, you can refer to [luuuyi/RefineDet.PyTorch](https://github.com/luuuyi/RefineDet.PyTorch)\n\n## How to run\n\n```\n1. generate wts file. from pytorch\npython gen_wts_refinedet.py\n// a file 'refinedet.wts' will be generated.\n\n2. build tensorrtx/RefineDet and run or Using clion to open a project(recommend)\nConfiguration file in configure.h\nYou need configure your own paths and modes(SERIALIZE or INFER)\nDetailed information reference configure.h\nmkdir build\ncd build\ncmake ..\nmake\n```\n\n## dependence\n\n```\nTensorRT7.0.0.11 \nOpenCV >= 3.4\nlibtorch >=1.1.0\n```\n\n## feature\n\n1.tensorrt Multi output  \n2.L2norm  \n3.Postprocessing with libtorch\n\n## More Information\n\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)  \n[tensorrt tutorials](https://github.com/wang-xinyu/tensorrtx/tree/master/tutorials)  \nFor more detailed guidance, see [yhl blog](https://www.cnblogs.com/yanghailin/p/14525128.html)\n\n"
  },
  {
    "path": "refinedet/calibrator.cpp",
    "content": "#include <iostream>\n#include <iterator>\n#include <fstream>\n#include <opencv2/dnn/dnn.hpp>\n#include \"calibrator.h\"\n#include \"cuda_runtime_api.h\"\n#include \"utils.h\"\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name, const char* input_blob_name, bool read_cache)\n    : batchsize_(batchsize)\n    , input_w_(input_w)\n    , input_h_(input_h)\n    , img_idx_(0)\n    , img_dir_(img_dir)\n    , calib_table_name_(calib_table_name)\n    , input_blob_name_(input_blob_name)\n    , read_cache_(read_cache)\n{\n    input_count_ = 3 * input_w * input_h * batchsize;\n    CUDA_CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n    read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2()\n{\n    CUDA_CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const\n{\n    return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings)\n{\n    if (img_idx_ + batchsize_ > (int)img_files_.size()) {\n        return false;\n    }\n\n    std::vector<cv::Mat> input_imgs_;\n    for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n        std::cout << img_files_[i] << \"  \" << i << std::endl;\n        cv::Mat temp = cv::imread(img_dir_ + img_files_[i]);\n        if (temp.empty()){\n            std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n            return false;\n        }\n//        cv::Mat pr_img = preprocess_img(temp, input_w_, input_h_);\n        input_imgs_.push_back(temp);\n    }\n    img_idx_ += batchsize_;\n    cv::Mat blob = cv::dnn::blobFromImages(input_imgs_, 1.0, cv::Size(input_w_, input_h_), cv::Scalar(123.0, 117.0, 104.0), true, false);\n//    cv::Mat blob = cv::dnn::blobFromImages(input_imgs_, 1.0 / 255.0, cv::Size(input_w_, input_h_), cv::Scalar(0, 0, 0), true, false);\n\n    CUDA_CHECK(cudaMemcpy(device_input_, blob.ptr<float>(0), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n    assert(!strcmp(names[0], input_blob_name_));\n    bindings[0] = device_input_;\n    return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length)\n{\n    std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n    calib_cache_.clear();\n    std::ifstream input(calib_table_name_, std::ios::binary);\n    input >> std::noskipws;\n    if (read_cache_ && input.good())\n    {\n        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n    }\n    length = calib_cache_.size();\n    return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length)\n{\n    std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n    std::ofstream output(calib_table_name_, std::ios::binary);\n    output.write(reinterpret_cast<const char*>(cache), length);\n}\n\n"
  },
  {
    "path": "refinedet/calibrator.h",
    "content": "#ifndef ENTROPY_CALIBRATOR_H\n#define ENTROPY_CALIBRATOR_H\n\n#include \"NvInfer.h\"\n#include <string>\n#include <vector>\n\n//! \\class Int8EntropyCalibrator2\n//!\n//! \\brief Implements Entropy calibrator 2.\n//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.\n//!\nclass Int8EntropyCalibrator2 : public nvinfer1::IInt8EntropyCalibrator2\n{\npublic:\n    Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name, const char* input_blob_name, bool read_cache = true);\n\n    virtual ~Int8EntropyCalibrator2();\n    int getBatchSize() const override;\n    bool getBatch(void* bindings[], const char* names[], int nbBindings) override;\n    const void* readCalibrationCache(size_t& length) override;\n    void writeCalibrationCache(const void* cache, size_t length) override;\n\nprivate:\n    int batchsize_;\n    int input_w_;\n    int input_h_;\n    int img_idx_;\n    std::string img_dir_;\n    std::vector<std::string> img_files_;\n    size_t input_count_;\n    std::string calib_table_name_;\n    const char* input_blob_name_;\n    bool read_cache_;\n    void* device_input_;\n    std::vector<char> calib_cache_;\n};\n\n#endif // ENTROPY_CALIBRATOR_H\n"
  },
  {
    "path": "refinedet/configure.h",
    "content": "\n#define USE_FP32  // set USE_INT8 or USE_FP16 or USE_FP32\n\nconst int num_class = 25; //num_class + 1     //Including background class\n\n//SERIALIZE: It indicates that to generate engin by serialization, the following path needs to be set,path_wts_ and path_save_engine\n//INFER: It shows that it is a reasoning mode,the following path needs to be set,path_engine\n#define INFER    //SERIALIZE   INFER\n\nconst std::string path_engine = \"/data_2//cmake-build-debug/refinedet_0312-now.engine\";\nconst std::string path_wts = \"/data_1/refinedet/pytorch_refinedet-master/refinedet0312.wts\";\nconst std::string path_save_engine = \"./refinedet_0312-now.engine\";\n\n//Picture folder to be detected\nconst char *p_dir_name = \"/data_1/img/\";\n\nconst float TH = 0.2;  //Confidence threshold\nconst int T_show = 1; //1:Show the effect      0:Test map to generate TXT\n//The path to save the generated TXT when testing the map\nstd::string save_path_txt = \"/data_1/txt/\";\n\n#define DEVICE 0  // GPU id\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 320;\nstatic const int INPUT_W = 320;\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME_arm_loc = \"arm_loc\";\nconst char* OUTPUT_BLOB_NAME_arm_conf = \"arm_conf\";\nconst char* OUTPUT_BLOB_NAME_odm_loc = \"odm_loc\";\nconst char* OUTPUT_BLOB_NAME_odm_conf = \"odm_conf\";\n\nstd::string label_map[] =\n        {\n                \"background\",\n                \"aa\",\n                \"bb\",\n                \"cc\",\n                \"dd\",\n                \"ee\",\n                \"ff\",\n                \"gg\",\n                \"hh\",\n                \"ii\",\n                \"jj\",\n                \"kk\",\n                \"ll\",\n                \"mm\",\n                \"nn\",\n                \"oo\",\n                \"pp\",\n                \"qq\",\n                \"rr\",\n                \"ss\",\n                \"tt\",\n                \"uu\",\n                \"vv\",\n                \"ww\",\n                \"xx\"\n        };"
  },
  {
    "path": "refinedet/gen_wts_refinedet.py",
    "content": "import torch\nimport torch.nn as nn\nimport struct\nfrom models.refinedet import build_refinedet\n\n\n\nnum_classes = 25\npath_model = \"/data_2/project_2021/pytorch_refinedet/2021/20210308.pth\"\npath_save_wts = \"./refinedet0312.wts\"\ninput_size = 320\n\nnet = build_refinedet('test', input_size, num_classes)  # initialize net\nnet.load_state_dict(torch.load(path_model))\nnet.eval()\n\n\nf = open(path_save_wts, 'w')\nf.write('{}\\n'.format(len(net.state_dict().keys())))\nfor k, v in net.state_dict().items():\n    vr = v.reshape(-1).cpu().numpy()\n    f.write('{} {} '.format(k, len(vr)))\n    for vv in vr:\n        f.write(' ')\n        f.write(struct.pack('>f',float(vv)).hex())\n    f.write('\\n')\n\nprint(\"success generate wts!\")"
  },
  {
    "path": "refinedet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "refinedet/refinedet.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"utils.h\"\n#include \"logging.h\"\n#include \"calibrator.h\"\n#include \"configure.h\"\n\n#include <torch/script.h> // One-stop header.\n#include \"torch/torch.h\"\n#include \"torch/jit.h\"\n\nusing namespace nvinfer1;\nstatic Logger gLogger;\n\n//Correct the rectangle area to prevent the image from crossing the boundary\nvoid RoiCorrect(const cv::Mat &m, cv::Rect &r)\n{\n    if (r.x < 0) r.x = 0;\n    if (r.y < 0) r.y = 0;\n\n    if(r.x >= m.cols-1) r.x=0;\n    if(r.y >= m.rows-1) r.y=0;\n\n    if(r.width <= 0) r.width = 1;\n    if(r.height <= 0) r.height = 1;\n\n    if(r.x + r.width > m.cols - 1) r.width = m.cols - 1 - r.x;\n    if(r.y + r.height > m.rows - 1) r.height = m.rows - 1 - r.y;\n}\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\n//convBnLeaky(network, weightMap, *data, 32, 3, 1, 1, 0);\nILayer* convRelu(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input,  int outch, int ksize, int s, int p,\\\n        int linx, const std::string pre_name = \"vgg.\", bool b_dilate = false) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    if (weightMap.count(pre_name + std::to_string(linx) + \".weight\") == 0)\n        std::cout << \"no key: \" <<pre_name + std::to_string(linx) + \".weight\" << std::endl;\n\n    if (weightMap.count(pre_name + std::to_string(linx) + \".bias\") == 0)\n        std::cout << \"no key: \" <<pre_name + std::to_string(linx) + \".bias\" << std::endl;\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[pre_name + std::to_string(linx) + \".weight\"], weightMap[pre_name + std::to_string(linx) + \".bias\"]);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n    if(true == b_dilate)\n    {\n       conv1->setDilation(DimsHW{3, 3});\n    }\n\n    auto lr = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n\n    return lr;\n}\n\n//convBnLeaky(network, weightMap, *data, 32, 3, 1, 1, 0);\nILayer* convRelu_extras(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input,  int outch, int ksize, int s, int p, const std::string weight_name, const std::string bias_name){\n\n    if (weightMap.count(weight_name) == 0)\n        std::cout << \"no key: \" <<weight_name << std::endl;\n\n    if (weightMap.count(bias_name) == 0)\n        std::cout << \"no key: \" <<bias_name << std::endl;\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[weight_name], weightMap[bias_name]);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n\n    auto lr = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n\n    return lr;\n}\n\n//convBnLeaky(network, weightMap, *data, 32, 3, 1, 1, 0);\nIConvolutionLayer* convReluconv_tcb0(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input,  int outch, int ksize, int s, int p, int indx_0, int indx_1){\n\n    std::string name_w0 = \"tcb0.\" + (std::string)std::to_string(indx_0) + \".weight\";\n    std::string name_b0 = \"tcb0.\" + (std::string)std::to_string(indx_0) + \".bias\";\n\n    std::string name_w1 = \"tcb0.\" + (std::string)std::to_string(indx_1) + \".weight\";\n    std::string name_b1 = \"tcb0.\" + (std::string)std::to_string(indx_1) + \".bias\";\n\n    if (weightMap.count(name_w0) == 0)\n        std::cout << \"no key: \" <<name_w0 << std::endl;\n    if (weightMap.count(name_b0) == 0)\n        std::cout << \"no key: \" <<name_b0 << std::endl;\n    if (weightMap.count(name_w1) == 0)\n        std::cout << \"no key: \" <<name_w1 << std::endl;\n    if (weightMap.count(name_b1) == 0)\n        std::cout << \"no key: \" <<name_b1 << std::endl;\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[name_w0], weightMap[name_b0]);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n\n    auto lr = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*lr->getOutput(0), 256, DimsHW{3, 3}, weightMap[name_w1], weightMap[name_b1]);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{1, 1});\n    conv2->setPaddingNd(DimsHW{1, 1});\n\n    return conv2;\n}\n\nILayer* ReluconvRelu_tcb2(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input,  int outch, int ksize, int s, int p, int indx_0){\n    auto lr = network->addActivation(input, ActivationType::kRELU);\n\n    std::string name_w0 = \"tcb2.\" + (std::string)std::to_string(indx_0) + \".weight\";\n    std::string name_b0 = \"tcb2.\" + (std::string)std::to_string(indx_0) + \".bias\";\n\n    if (weightMap.count(name_w0) == 0)\n        std::cout << \"no key: \" <<name_w0 << std::endl;\n\n    if (weightMap.count(name_b0) == 0)\n        std::cout << \"no key: \" <<name_b0 << std::endl;\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*lr->getOutput(0), outch, DimsHW{ksize, ksize}, weightMap[name_w0], weightMap[name_b0]);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n\n    auto lr1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    return lr1;\n}\n\nILayer* conv_permutation(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input,  int outch, int ksize, int s, int p, const std::string weight_name, const std::string bias_name)\n{\n    if (weightMap.count(weight_name) == 0)\n        std::cout << \"no key: \" <<weight_name << std::endl;\n    if (weightMap.count(bias_name) == 0)\n        std::cout << \"no key: \" <<bias_name << std::endl;\n    IConvolutionLayer* a0 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[weight_name], weightMap[bias_name]);\n    assert(a0);\n    a0->setStrideNd(DimsHW{s, s});\n    a0->setPaddingNd(DimsHW{p, p});\n\n    auto sfl = network->addShuffle(*a0->getOutput(0));\n    sfl->setFirstTranspose(Permutation{1, 2, 0});\n\n    return sfl;\n}\n\nILayer* cat_4_tensor(INetworkDefinition *network, ILayer*tensor_0, ILayer*tensor_1, ILayer*tensor_2, ILayer*tensor_3)\n{\n    Dims dim_;\n    dim_.nbDims=1;\n    dim_.d[0]=-1;\n    //40 40 12 --->>40*40*12\n    auto arm_loc_00 = network->addShuffle(*tensor_0->getOutput(0));\n    assert(arm_loc_00);\n    arm_loc_00->setReshapeDimensions(dim_);\n\n    //20 20 12 --->>20*20*12\n    auto arm_loc_11 = network->addShuffle(*tensor_1->getOutput(0));\n    assert(arm_loc_11);\n    arm_loc_11->setReshapeDimensions(dim_);  //Dims2(-1, 1)\n\n    //10 10 12 --->>10*10*12\n    auto arm_loc_22 = network->addShuffle(*tensor_2->getOutput(0));\n    assert(arm_loc_22);\n    arm_loc_22->setReshapeDimensions(dim_);\n\n    //5 5 12 --->>5*5*12\n    auto arm_loc_33 = network->addShuffle(*tensor_3->getOutput(0));\n    assert(arm_loc_33);\n    arm_loc_33->setReshapeDimensions(dim_);\n\n//\n//    Dims dim0 = arm_loc_00->getOutput(0)->getDimensions();\n//    std::cout <<\"debug  arm_loc_0 dim==\" << dim0.d[0] << \" \" << dim0.d[1] << \" \" << dim0.d[2] << \" \" << dim0.d[3] << std::endl;\n//    Dims dim1 = arm_loc_11->getOutput(0)->getDimensions();\n//    std::cout <<\"debug  arm_loc_1 dim==\" << dim1.d[0] << \" \" << dim1.d[1] << \" \" << dim1.d[2] << \" \" << dim1.d[3] << std::endl;\n//    Dims dim2 = arm_loc_22->getOutput(0)->getDimensions();\n//    std::cout <<\"debug  arm_loc_2 dim==\" << dim2.d[0] << \" \" << dim2.d[1] << \" \" << dim2.d[2] << \" \" << dim2.d[3] << std::endl;\n//    Dims dim3 = arm_loc_33->getOutput(0)->getDimensions();\n//    std::cout <<\"debug  arm_loc_3 dim==\" << dim3.d[0] << \" \" << dim3.d[1] << \" \" << dim3.d[2] << \" \" << dim3.d[3] << std::endl;\n\n    ITensor* arm_loc_t[] = {arm_loc_00->getOutput(0), arm_loc_11->getOutput(0), arm_loc_22->getOutput(0), arm_loc_33->getOutput(0)};\n    auto arm_loc = network->addConcatenation(arm_loc_t, 4);\n    //[25500]\n    return arm_loc;\n}\n\n\nILayer* reshapeSoftmax(INetworkDefinition *network, ITensor& input, int ch) {\n    //The input is one-dimensional[12750]\n    //reshape[XX,ch]\n    auto re1 = network->addShuffle(input);\n    assert(re1);\n    re1->setReshapeDimensions(Dims3(1, -1, ch)); //[1,6375,2];\n//     re1->setReshapeDimensions(Dims2(-1, ch)); //[6375,2];\n\n    Dims dim0 = re1->getOutput(0)->getDimensions();\n    std::cout <<\"debug  re1 dim==\" << dim0.d[0] << \" \" << dim0.d[1] << \" \" << dim0.d[2] << \" \" << dim0.d[3] << std::endl;\n\n//    return re1;/////////////////////////////////////////\n\n    auto sm = network->addSoftMax(*re1->getOutput(0));\n    sm->setAxes(1<<2);\n    assert(sm);\n    //And then reshape one-dimensional again, and it's the same shape as it came in\n    Dims dim_;\n    dim_.nbDims=1;\n    dim_.d[0]=-1;\n    auto re2 = network->addShuffle(*sm->getOutput(0));\n    assert(re2);\n    re2->setReshapeDimensions(dim_);\n\n    return re2;\n}\n\nIScaleLayer* L2norm(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, const std::string pre_name = \"conv4_3_L2Norm.weight\")\n{\n    //aa = x.pow(2)  ## [1,512,40,40]\n    const static float pval1[3]{0.0, 1.0, 2.0};\n    Weights wshift1{DataType::kFLOAT, pval1, 1};\n    Weights wscale1{DataType::kFLOAT, pval1+1, 1};\n    Weights wpower1{DataType::kFLOAT, pval1+2, 1};\n    IScaleLayer* scale1 = network->addScale(\n            input,\n            ScaleMode::kUNIFORM,\n            wshift1,\n            wscale1,\n            wpower1);\n    assert(scale1);\n\n   //bb =  x.pow(2).sum(dim=1, keepdim=True)  ## [1,1,40,40]\n    IReduceLayer* reduce1 = network->addReduce(*scale1->getOutput(0),\n                                               ReduceOperation::kSUM,\n                                               1,\n                                               true);\n    assert(reduce1);\n\n    //norm = x.pow(2).sum(dim=1, keepdim=True).sqrt()+self.eps  # [1,1,40,40]\n    const static float pval2[3]{0.0, 1.0, 0.5};\n    Weights wshift2{DataType::kFLOAT, pval2, 1};\n    Weights wscale2{DataType::kFLOAT, pval2+1, 1};\n    Weights wpower2{DataType::kFLOAT, pval2+2, 1};\n    IScaleLayer* scale2 = network->addScale(\n            *reduce1->getOutput(0),\n            ScaleMode::kUNIFORM,\n            wshift2,\n            wscale2,\n            wpower2);\n    assert(scale2);\n\n    // x = torch.div(x,norm)\n    IElementWiseLayer* ew2 = network->addElementWise(input,\n                                                     *scale2->getOutput(0),\n                                                     ElementWiseOperation::kDIV);\n    assert(ew2);\n\n    //out = self.weight.unsqueeze(0).unsqueeze(2).unsqueeze(3).expand_as(x) * x\n    int len = weightMap[pre_name].count;\n    float* pval3 = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    std::fill_n(pval3, len, 1.0);\n    Weights wpower3{DataType::kFLOAT, pval3, len};\n    weightMap[pre_name + \".power3\"] = wpower3;\n\n    float* pval4 = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    std::fill_n(pval4, len, 0.0);\n    Weights wpower4{DataType::kFLOAT, pval4, len};\n    weightMap[pre_name + \".power4\"] = wpower4;\n\n    IScaleLayer* scale3 = network->addScale(\n            *ew2->getOutput(0),\n            ScaleMode::kCHANNEL,\n            wpower4,\n            weightMap[pre_name],\n            wpower3);\n    assert(scale3);\n    return scale3;\n}\n\n\n//convBnLeaky(network, weightMap, *data, 32, 3, 1, 1, 0);\nILayer* convBnLeaky(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input,  int outch, int ksize, int s, int p, int linx) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[\"module_list.\" + std::to_string(linx) + \".Conv2d.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"module_list.\" + std::to_string(linx) + \".BatchNorm2d\", 1e-5);\n\n    auto lr = network->addActivation(*bn1->getOutput(0), ActivationType::kLEAKY_RELU);\n    lr->setAlpha(0.1);\n\n    return lr;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(path_wts);\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    DimsHW maxpool_hw = DimsHW(2,2);\n\n    auto lr0 = convRelu(network, weightMap, *data, 64, 3, 1, 1, 0);\n    auto lr1 = convRelu(network, weightMap, *lr0->getOutput(0), 64, 3, 1, 1, 2);\n    IPoolingLayer* pool1 = network->addPoolingNd(*lr1->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n\n    auto lr2 = convRelu(network, weightMap, *pool1->getOutput(0), 128, 3, 1, 1, 5);\n    auto lr3 = convRelu(network, weightMap, *lr2->getOutput(0), 128, 3, 1, 1, 7);\n    IPoolingLayer* pool2 = network->addPoolingNd(*lr3->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    assert(pool2);\n    pool2->setStrideNd(DimsHW{2, 2});\n\n    auto lr4 = convRelu(network, weightMap, *pool2->getOutput(0), 256, 3, 1, 1, 10);\n    auto lr5 = convRelu(network, weightMap, *lr4->getOutput(0), 256, 3, 1, 1, 12);\n    auto lr6 = convRelu(network, weightMap, *lr5->getOutput(0), 256, 3, 1, 1, 14);\n    IPoolingLayer* pool3 = network->addPoolingNd(*lr6->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    assert(pool3);\n    pool3->setStrideNd(DimsHW{2, 2});\n\n    auto lr7 = convRelu(network, weightMap, *pool3->getOutput(0), 512, 3, 1, 1, 17);\n    auto lr8 = convRelu(network, weightMap, *lr7->getOutput(0), 512, 3, 1, 1, 19);\n    auto lr9 = convRelu(network, weightMap, *lr8->getOutput(0), 512, 3, 1, 1, 21);\n    IPoolingLayer* pool4 = network->addPoolingNd(*lr9->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    assert(pool4);\n    pool4->setStrideNd(DimsHW{2, 2});\n\n    auto lr24 = convRelu(network, weightMap, *pool4->getOutput(0), 512, 3, 1, 1, 24);\n    auto lr26 = convRelu(network, weightMap, *lr24->getOutput(0), 512, 3, 1, 1, 26);\n    auto lr28 = convRelu(network, weightMap, *lr26->getOutput(0), 512, 3, 1, 1, 28);\n    IPoolingLayer* pool5 = network->addPoolingNd(*lr28->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    assert(pool5);\n    pool5->setStrideNd(DimsHW{2, 2});\n\n    auto lr31 = convRelu(network, weightMap, *pool5->getOutput(0), 1024, 3, 1, 3, 31,\"vgg.\",true);\n\n    //s_0\n    auto out_conv4_3_L2Norm = L2norm(network, weightMap, *lr9->getOutput(0),\"conv4_3_L2Norm.weight\");\n    //s_1\n    auto out_conv5_3_L2Norm = L2norm(network, weightMap, *lr28->getOutput(0),\"conv5_3_L2Norm.weight\");\n\n    //s_2\n    auto lr33 = convRelu(network, weightMap, *lr31->getOutput(0), 1024, 1, 1, 0, 33);\n\n    auto extras0 = convRelu_extras(network, weightMap, *lr33->getOutput(0), 256, 1, 1, 0, \"extras.0.weight\", \"extras.0.bias\");\n    //s_3\n    auto extras1 = convRelu_extras(network, weightMap, *extras0->getOutput(0), 512, 3, 2, 1, \"extras.1.weight\", \"extras.1.bias\");\n\n    auto arm_loc_0 = conv_permutation(network, weightMap, *out_conv4_3_L2Norm->getOutput(0), 12, 3, 1, 1, \"arm_loc.0.weight\", \"arm_loc.0.bias\");\n    auto arm_loc_1 = conv_permutation(network, weightMap, *out_conv5_3_L2Norm->getOutput(0), 12, 3, 1, 1, \"arm_loc.1.weight\", \"arm_loc.1.bias\");\n    auto arm_loc_2 = conv_permutation(network, weightMap, *lr33->getOutput(0), 12, 3, 1, 1, \"arm_loc.2.weight\", \"arm_loc.2.bias\");\n    auto arm_loc_3 = conv_permutation(network, weightMap, *extras1->getOutput(0), 12, 3, 1, 1, \"arm_loc.3.weight\", \"arm_loc.3.bias\");\n\n    auto arm_conf_0 = conv_permutation(network, weightMap, *out_conv4_3_L2Norm->getOutput(0), 6, 3, 1, 1, \"arm_conf.0.weight\", \"arm_conf.0.bias\");\n    auto arm_conf_1 = conv_permutation(network, weightMap, *out_conv5_3_L2Norm->getOutput(0), 6, 3, 1, 1, \"arm_conf.1.weight\", \"arm_conf.1.bias\");\n    auto arm_conf_2 = conv_permutation(network, weightMap, *lr33->getOutput(0), 6, 3, 1, 1, \"arm_conf.2.weight\", \"arm_conf.2.bias\");\n    auto arm_conf_3 = conv_permutation(network, weightMap, *extras1->getOutput(0), 6, 3, 1, 1, \"arm_conf.3.weight\", \"arm_conf.3.bias\");\n\n    auto arm_loc = cat_4_tensor(network, arm_loc_0, arm_loc_1, arm_loc_2, arm_loc_3);\n    auto arm_conf = cat_4_tensor(network, arm_conf_0, arm_conf_1, arm_conf_2, arm_conf_3);\n\n    auto ss_0 = convReluconv_tcb0(network, weightMap, *extras1->getOutput(0),  256, 3, 1, 1, 9, 11);\n    auto ss_00 = ReluconvRelu_tcb2(network, weightMap, *ss_0->getOutput(0),  256, 3, 1, 1, 10);\n    auto ss_1 = convReluconv_tcb0(network, weightMap, *lr33->getOutput(0),  256, 3, 1, 1, 6, 8);\n\n    IDeconvolutionLayer* tcb1_2 = network->addDeconvolutionNd(*ss_00->getOutput(0), 256, DimsHW{2, 2}, weightMap[\"tcb1.2.weight\"], weightMap[\"tcb1.2.bias\"]);  //nn.ConvTranspose2d(256, 256, 2, 2)\n    tcb1_2->setStrideNd(DimsHW{2, 2});\n    assert(tcb1_2);\n    auto ss_1_add = network->addElementWise(*ss_1->getOutput(0), *tcb1_2->getOutput(0), ElementWiseOperation::kSUM);\n    auto ss_11 = ReluconvRelu_tcb2(network, weightMap, *ss_1_add->getOutput(0),  256, 3, 1, 1, 7);\n\n    auto ss_2 = convReluconv_tcb0(network, weightMap, *out_conv5_3_L2Norm->getOutput(0),  256, 3, 1, 1, 3, 5);\n    IDeconvolutionLayer* tcb1_1 = network->addDeconvolutionNd(*ss_11->getOutput(0), 256, DimsHW{2, 2}, weightMap[\"tcb1.1.weight\"], weightMap[\"tcb1.1.bias\"]);  //nn.ConvTranspose2d(256, 256, 2, 2)\n    tcb1_1->setStrideNd(DimsHW{2, 2});\n    assert(tcb1_1);\n    auto ss_2_add = network->addElementWise(*ss_2->getOutput(0), *tcb1_1->getOutput(0), ElementWiseOperation::kSUM);\n    auto ss_22 = ReluconvRelu_tcb2(network, weightMap, *ss_2_add->getOutput(0),  256, 3, 1, 1, 4);\n\n    auto ss_3 = convReluconv_tcb0(network, weightMap, *out_conv4_3_L2Norm->getOutput(0),  256, 3, 1, 1, 0, 2);\n    IDeconvolutionLayer* tcb1_0 = network->addDeconvolutionNd(*ss_22->getOutput(0), 256, DimsHW{2, 2}, weightMap[\"tcb1.0.weight\"], weightMap[\"tcb1.0.bias\"]);  //nn.ConvTranspose2d(256, 256, 2, 2)\n    tcb1_0->setStrideNd(DimsHW{2, 2});\n    assert(tcb1_0);\n    auto ss_3_add = network->addElementWise(*ss_3->getOutput(0), *tcb1_0->getOutput(0), ElementWiseOperation::kSUM);\n    auto ss_33 = ReluconvRelu_tcb2(network, weightMap, *ss_3_add->getOutput(0),  256, 3, 1, 1, 1);\n\n    auto odm_loc_0 = conv_permutation(network, weightMap, *ss_33->getOutput(0), 12, 3, 1, 1, \"odm_loc.0.weight\", \"odm_loc.0.bias\");\n    auto odm_loc_1 = conv_permutation(network, weightMap, *ss_22->getOutput(0), 12, 3, 1, 1, \"odm_loc.1.weight\", \"odm_loc.1.bias\");\n    auto odm_loc_2 = conv_permutation(network, weightMap, *ss_11->getOutput(0), 12, 3, 1, 1, \"odm_loc.2.weight\", \"odm_loc.2.bias\");\n    auto odm_loc_3 = conv_permutation(network, weightMap, *ss_00->getOutput(0), 12, 3, 1, 1, \"odm_loc.3.weight\", \"odm_loc.3.bias\");\n\n    auto odm_conf_0 = conv_permutation(network, weightMap, *ss_33->getOutput(0), 3 * num_class, 3, 1, 1, \"odm_conf.0.weight\", \"odm_conf.0.bias\");\n    auto odm_conf_1 = conv_permutation(network, weightMap, *ss_22->getOutput(0), 3 * num_class, 3, 1, 1, \"odm_conf.1.weight\", \"odm_conf.1.bias\");\n    auto odm_conf_2 = conv_permutation(network, weightMap, *ss_11->getOutput(0), 3 * num_class, 3, 1, 1, \"odm_conf.2.weight\", \"odm_conf.2.bias\");\n    auto odm_conf_3 = conv_permutation(network, weightMap, *ss_00->getOutput(0), 3 * num_class, 3, 1, 1, \"odm_conf.3.weight\", \"odm_conf.3.bias\");\n\n    auto odm_loc = cat_4_tensor(network, odm_loc_0, odm_loc_1, odm_loc_2, odm_loc_3);\n    auto odm_conf = cat_4_tensor(network, odm_conf_0, odm_conf_1, odm_conf_2, odm_conf_3);\n\n    //25500\n    Dims dim = arm_loc->getOutput(0)->getDimensions();\n    std::cout <<\"debug  arm_loc dim==\" << dim.d[0] << \" \" << dim.d[1] << \" \" << dim.d[2] << \" \" << dim.d[3] << std::endl;\n    arm_loc->getOutput(0)->setName(OUTPUT_BLOB_NAME_arm_loc);\n    network->markOutput(*arm_loc->getOutput(0));\n\n    auto arm_conf_111 = reshapeSoftmax(network, *arm_conf->getOutput(0), 2);\n    //12750\n    Dims dim2 = arm_conf_111->getOutput(0)->getDimensions();\n    std::cout <<\"debug  arm_conf dim==\" << dim2.d[0] << \" \" << dim2.d[1] << \" \" << dim2.d[2] << \" \" << dim2.d[3] << std::endl;\n    arm_conf_111->getOutput(0)->setName(OUTPUT_BLOB_NAME_arm_conf);\n    network->markOutput(*arm_conf_111->getOutput(0));\n\n    //25500\n    Dims dim3 = odm_loc->getOutput(0)->getDimensions();\n    std::cout <<\"debug  odm_loc dim==\" << dim3.d[0] << \" \" << dim3.d[1] << \" \" << dim3.d[2] << \" \" << dim3.d[3] << std::endl;\n    odm_loc->getOutput(0)->setName(OUTPUT_BLOB_NAME_odm_loc);\n    network->markOutput(*odm_loc->getOutput(0));\n\n    //159375\n    Dims dim4 = odm_conf->getOutput(0)->getDimensions();\n    odm_conf = reshapeSoftmax(network, *odm_conf->getOutput(0), 25);\n    std::cout <<\"debug  odm_conf dim==\" << dim4.d[0] << \" \" << dim4.d[1] << \" \" << dim4.d[2] << \" \" << dim4.d[3] << std::endl;\n    odm_conf->getOutput(0)->setName(OUTPUT_BLOB_NAME_odm_conf);\n    network->markOutput(*odm_conf->getOutput(0));\n\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n\n#if defined(USE_FP16)\n    config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(BuilderFlag::kINT8);\n    Int8EntropyCalibrator2 *calibrator = new Int8EntropyCalibrator2(1, INPUT_W, INPUT_H, \"./coco_calib/\", \"int8calib.table\", INPUT_BLOB_NAME);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n}\n\ntorch::Tensor PriorBox()\n{\n    std::vector<float> mean;\n    std::vector<int> feature_maps = {40,20,10,5};\n    int image_size = 320;\n    std::vector<int> steps = {8,16,32,64};\n    std::vector<int> min_sizes = {32,64,128,256};\n    std::vector<int> aspect_ratios = {2,2,2,2};\n    for(int k=0;k<feature_maps.size();k++)\n    {\n        int f = feature_maps[k];\n        for(int i=0;i<f;i++)\n        {\n            for(int j=0;j<f;j++)\n            {\n                float f_k = image_size * 1.0 / steps[k];\n                float cx = (j + 0.5) / f_k;\n                float cy = (i + 0.5) / f_k;\n                float s_k = min_sizes[k] * 1.0 / image_size;\n                mean.push_back(cx);\n                mean.push_back(cy);\n                mean.push_back(s_k);\n                mean.push_back(s_k);\n\n                float ar = aspect_ratios[k];\n                mean.push_back(cx);\n                mean.push_back(cy);\n                mean.push_back(s_k * 1.0 * sqrt(ar));\n                mean.push_back(s_k * 1.0 / sqrt(ar));\n\n                mean.push_back(cx);\n                mean.push_back(cy);\n                mean.push_back(s_k * 1.0 / sqrt(ar));\n                mean.push_back(s_k * 1.0 * sqrt(ar));\n            }\n        }\n    }\n\n    torch::Tensor m_prior;\n    int m_prior_size = 6375;\n    m_prior = torch::from_blob(mean.data(),{m_prior_size,4}).cuda();\n    m_prior = m_prior.clamp(0,1);\n    //    std::cout<<m_prior<<std::endl;\n    return m_prior.toType(torch::kFloat64);\n}\n\n\ntorch::Tensor decode(const torch::Tensor _loc,torch::Tensor _prior,bool b_form_pt = false)\n{\n    std::vector<float> variance({0.1,0.2});\n    torch::Tensor top_2 = torch::tensor({0,1}).cuda().to(torch::kLong);\n    torch::Tensor bottom_2 = torch::tensor({2,3}).cuda().to(torch::kLong);\n\n    auto c1 = _prior.index_select(1,top_2)+_loc.index_select(1,top_2).mul(variance[0])*_prior.index_select(1,bottom_2);\n    auto c2 = _prior.index_select(1,bottom_2)*torch::exp(_loc.index_select(1,bottom_2)*variance[1]);\n    auto _retv = torch::cat({c1,c2},1);\n    if(b_form_pt)\n    {\n        auto c3 = _retv.index_select(1,top_2)-_retv.index_select(1,bottom_2).div(2);\n        auto c4 = c3 + _retv.index_select(1,bottom_2);\n        return torch::cat({c3,c4},1);\n    } else\n    {\n        return _retv;\n    }\n\n}\n\ntorch::Tensor center(torch::Tensor retv)\n{\n    auto c1 = retv.select(1,0).unsqueeze(1);\n    auto c2 = retv.select(1,1).unsqueeze(1);\n    auto c3 = retv.select(1,2).unsqueeze(1);\n    auto c4 = retv.select(1,3).unsqueeze(1);\n\n    auto _retv = torch::cat({(c1+c3).div(2),(c2+c4).div(2),c3-c1,c4-c2},1);\n    return _retv;\n}\n\nbool nms(const torch::Tensor& boxes, const torch::Tensor& scores, torch::Tensor &keep, int &count,float overlap, int top_k)\n{\n    count =0;\n    keep = torch::zeros({scores.size(0)}).to(torch::kLong).to(scores.device());\n    if(0 == boxes.numel())\n    {\n        return false;\n    }\n\n    torch::Tensor x1 = boxes.select(1,0).clone();\n    torch::Tensor y1 = boxes.select(1,1).clone();\n    torch::Tensor x2 = boxes.select(1,2).clone();\n    torch::Tensor y2 = boxes.select(1,3).clone();\n    torch::Tensor area = (x2-x1)*(y2-y1);\n    //    std::cout<<area<<std::endl;\n\n    std::tuple<torch::Tensor,torch::Tensor> sort_ret = torch::sort(scores.unsqueeze(1), 0, 0);\n    torch::Tensor v = std::get<0>(sort_ret).squeeze(1).to(scores.device());\n    torch::Tensor idx = std::get<1>(sort_ret).squeeze(1).to(scores.device());\n\n    int num_ = idx.size(0);\n    if(num_ > top_k) //python:idx = idx[-top_k:]\n    {\n        idx = idx.slice(0,num_-top_k,num_).clone();\n    }\n    torch::Tensor xx1,yy1,xx2,yy2,w,h;\n    while(idx.numel() > 0)\n    {\n        auto i = idx[-1];\n        keep[count] = i;\n        count += 1;\n        if(1 == idx.size(0))\n        {\n            break;\n        }\n        idx = idx.slice(0,0,idx.size(0)-1).clone();\n\n        xx1 = x1.index_select(0,idx);\n        yy1 = y1.index_select(0,idx);\n        xx2 = x2.index_select(0,idx);\n        yy2 = y2.index_select(0,idx);\n\n        xx1 = xx1.clamp(x1[i].item().toFloat(),INT_MAX*1.0);\n        yy1 = yy1.clamp(y1[i].item().toFloat(),INT_MAX*1.0);\n        xx2 = xx2.clamp(INT_MIN*1.0,x2[i].item().toFloat());\n        yy2 = yy2.clamp(INT_MIN*1.0,y2[i].item().toFloat());\n\n        w = xx2 - xx1;\n        h = yy2 - yy1;\n\n        w = w.clamp(0,INT_MAX);\n        h = h.clamp(0,INT_MAX);\n\n        torch::Tensor inter = w * h;\n        torch::Tensor rem_areas = area.index_select(0,idx);\n\n        torch::Tensor union_ = (rem_areas - inter) + area[i];\n        torch::Tensor Iou = inter * 1.0 / union_;\n        torch::Tensor index_small = Iou < overlap;\n        auto mask_idx = torch::nonzero(index_small).squeeze();\n        idx = idx.index_select(0,mask_idx);//pthon: idx = idx[IoU.le(overlap)]\n    }\n    return true;\n}\n\nvoid doInference(IExecutionContext& context, void* buffers[], cudaStream_t &stream, float* input, std::vector<std::vector<float>> &detections) {\n    auto start_infer = std::chrono::system_clock::now();\n    detections.clear();\n    int batchSize = 1;\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n//    std::cout<<\"engine.getNbBindings()===\"<<engine.getNbBindings()<<std::endl;\n    assert(engine.getNbBindings() == 5);\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex_arm_loc = engine.getBindingIndex(OUTPUT_BLOB_NAME_arm_loc);\n    const int outputIndex_arm_conf = engine.getBindingIndex(OUTPUT_BLOB_NAME_arm_conf);\n    const int outputIndex_odm_loc = engine.getBindingIndex(OUTPUT_BLOB_NAME_odm_loc);\n    const int outputIndex_odm_conf = engine.getBindingIndex(OUTPUT_BLOB_NAME_odm_conf);\n//    const int outputIndex2 = engine.getBindingIndex(\"prob2\");\n//    printf(\"inputIndex=%d\\n\",inputIndex);\n//    printf(\"outputIndex_arm_loc=%d\\n\",outputIndex_arm_loc);\n//    printf(\"outputIndex_arm_conf=%d\\n\",outputIndex_arm_conf);\n//    printf(\"outputIndex_odm_loc=%d\\n\",outputIndex_odm_loc);\n//    printf(\"outputIndex_odm_conf=%d\\n\",outputIndex_odm_conf);\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CUDA_CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    cudaDeviceSynchronize();\n    auto end_infer = std::chrono::system_clock::now();\n    double during_time = std::chrono::duration_cast<std::chrono::milliseconds>(end_infer - start_infer).count();\n    std::cout <<\"time consume context.enqueue===\" <<  during_time << \"ms\" << std::endl;\n\n    auto start_houchuli = std::chrono::system_clock::now();\n    int m_prior_size = 6375;\n    torch::Tensor m_prior = PriorBox();\n    torch::Tensor arm_loc = torch::from_blob(buffers[outputIndex_arm_loc],{m_prior_size,4}).cuda().toType(torch::kFloat64).unsqueeze(0);\n    torch::Tensor arm_conf = torch::from_blob(buffers[outputIndex_arm_conf],{m_prior_size,2}).cuda().toType(torch::kFloat64).unsqueeze(0);\n    torch::Tensor odm_loc = torch::from_blob(buffers[outputIndex_odm_loc],{m_prior_size,4}).cuda().toType(torch::kFloat64).unsqueeze(0);\n    torch::Tensor odm_conf = torch::from_blob(buffers[outputIndex_odm_conf],{m_prior_size,25}).cuda().toType(torch::kFloat64).unsqueeze(0);\n\n    float obj_threshed = 0.01;\n    torch::Tensor arm_object_conf = arm_conf.squeeze(0).select(1,1);\n    torch::Tensor object_index = arm_object_conf > obj_threshed;\n    object_index=object_index.unsqueeze(1);\n\n    torch::Tensor object_index_1 = object_index.expand_as(odm_conf.squeeze(0)).toType(torch::kFloat64);\n    auto filter_odm_conf = odm_conf.squeeze(0).toType(torch::kFloat64) * object_index_1;\n    torch::Tensor conf_preds_ = filter_odm_conf.clone().toType(torch::kFloat64);\n    torch::Tensor conf_preds = conf_preds_.transpose(1,0).toType(torch::kFloat64);\n    torch::Tensor default_m = decode(arm_loc[0],m_prior);\n//    default_m = center(default_m);\n    bool b_form_pt = true;\n    torch::Tensor decode_boxes_m = decode(odm_loc[0],default_m,b_form_pt);//6375,4\n\n    float conf_thresh = 0.01;\n    float mask_thresh = 0.01;\n\n    torch::Tensor result_out;\n    for(int i=1;i<25;i++)\n    {\n        torch::Tensor c_mask_m = conf_preds[i] > mask_thresh;\n        torch::Tensor nonzero_index = torch::nonzero(c_mask_m);\n        torch::Tensor  score_m = torch::index_select(conf_preds[i],0,nonzero_index.squeeze(1));\n        torch::Tensor  boxes_m = torch::index_select(decode_boxes_m,0,nonzero_index.squeeze(1));\n\n        torch::Tensor keep;\n        int count = 0;\n        float overlap = 0.45;\n        int top_k=1000;\n        nms(boxes_m, score_m, keep, count, overlap, top_k);\n        if(0 == count) { continue; }\n\n        keep = keep.slice(0,0,count).clone();\n        torch::Tensor score_my = score_m.index_select(0,keep);\n        torch::Tensor boxes_my = boxes_m.index_select(0,keep);\n\n        if(score_my[0].item().toFloat() < conf_thresh)\n        {\n            continue;\n        }\n//        boxes_my.select(1,0).mul_(width);\n//        boxes_my.select(1,1).mul_(height);\n//        boxes_my.select(1,2).mul_(width);\n//        boxes_my.select(1,3).mul_(height);\n        torch::Tensor label_tensor = torch::full_like(score_my.unsqueeze(1),i);\n        torch::Tensor result_ = torch::cat({boxes_my.toType(torch::kFloat64),score_my.unsqueeze(1).toType(torch::kFloat64),label_tensor.toType(torch::kFloat64)},1);\n        if(0 == result_out.numel())\n        {\n            result_out = result_.clone();\n        }else\n        {\n            result_out = torch::cat({result_out,result_},0);//Splicing by line\n        }\n    }\n    if(0 == result_out.numel()) { std::cout<<\"libtorch refinedet obj_small: nothing detect!\"<<std::endl; return ;}\n    result_out =result_out.cpu();\n\n    // x1,y1,x2,y2,score,id\n    auto result_data = result_out.accessor<double, 2>();\n    for(int i=0;i<result_data.size(0);i++)\n    {\n        float score = result_data[i][4];\n        float x1 = result_data[i][0];\n        float y1 = result_data[i][1];\n        float x2 = result_data[i][2];\n        float y2 = result_data[i][3];\n        int id_label = result_data[i][5];\n\n        std::vector<float> v_detections;\n        v_detections.push_back(0); //image_id\n        v_detections.push_back(id_label); //label\n        v_detections.push_back(score); //score\n        v_detections.push_back(x1); //xmin\n        v_detections.push_back(y1); //ymin\n        v_detections.push_back(x2); //xmax\n        v_detections.push_back(y2); //ymax\n        detections.push_back(v_detections);\n    }\n    cudaDeviceSynchronize();\n    auto end_houchuli = std::chrono::system_clock::now();\n    double during_time_houchuli = std::chrono::duration_cast<std::chrono::milliseconds>(end_houchuli - start_houchuli).count();\n    std::cout <<\"time consume houchuli===\" <<  during_time_houchuli << \"ms\" << std::endl;\n}\n\nvoid base_transform(const cv::Mat &m_src,float *data)\n{\n    cv::Mat image;\n    cv::resize(m_src,image,cv::Size(INPUT_W,INPUT_H));\n    if(1 == image.channels()) { cv::cvtColor(image,image,CV_GRAY2BGR); }\n\n    for(int i=0;i<INPUT_H;i++)\n    {\n        uchar* img_data = image.ptr<uchar>(i); //Get the first address of the row pointer\n        for(int j=0;j<INPUT_W;j++)\n        {\n            int offset = i * INPUT_H + j;\n            data[offset] = (float)(img_data[j*3 + 2] * 1.0 - 123.0);\n            data[offset + INPUT_H * INPUT_W] = (float)(img_data[j*3 + 1] * 1.0 - 117.0);\n            data[offset + 2 * INPUT_H * INPUT_W] = (float)(img_data[j*3 + 0] * 1.0 - 104.0);\n        }\n    }\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n#ifdef SERIALIZE\n    IHostMemory* modelStream{nullptr};\n    APIToModel(1, &modelStream);\n    assert(modelStream != nullptr);\n    std::ofstream p(path_save_engine, std::ios::binary);\n    if (!p) {\n        std::cerr << \"could not open plan output file\" << std::endl;\n        return -1;\n    }\n    p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n    modelStream->destroy();\n    return 0;\n\n#elif defined  INFER\n    std::ifstream file(path_engine, std::ios::binary);\n    if (file.good()) {\n        file.seekg(0, file.end);\n        size = file.tellg();\n        file.seekg(0, file.beg);\n        trtModelStream = new char[size];\n        assert(trtModelStream);\n        file.read(trtModelStream, size);\n        file.close();\n    }\n\n#else\n    std::cerr << \"arguments not right!\" << std::endl;\n    std::cerr << \"configure.h should difine SERIALIZE INFER\" << std::endl;\n    std::cerr << \"please check!\" << std::endl;\n    return -1;\n#endif\n\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(p_dir_name, file_names) < 0) {\n        std::cout << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    // prepare input data ---------------------------\n    float data[3 * INPUT_H * INPUT_W];\n\n    IRuntime* runtime = createInferRuntime(gLogger);     //400M\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size); //777M\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();  //971M\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    const int batchSize = 1;\n    const int inputIndex=0;\n    const int outputIndex_arm_loc=1;\n    const int outputIndex_arm_conf=3;\n    const int outputIndex_odm_loc=2;\n    const int outputIndex_odm_conf=4;\n\n    //Initialize cuda  memory:  input and 4 output memory\n    void* buffers[5];\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc(&buffers[0], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n\n    const int OUTPUT_SIZE_arm_loc = 25500; //40*40*12 + 20*20*12 + 10*10*12 + 5*5*12 = 25500   (Fixed value)\n    CUDA_CHECK(cudaMalloc(&buffers[outputIndex_arm_loc], batchSize * OUTPUT_SIZE_arm_loc * sizeof(float)));\n\n    const int OUTPUT_SIZE_arm_conf = 12750; //40*40*6 + 20*20*6 + 10*10*6 + 5*5*6 = 12750 (Fixed value)\n    CUDA_CHECK(cudaMalloc(&buffers[outputIndex_arm_conf], batchSize * OUTPUT_SIZE_arm_conf * sizeof(float)));\n\n    const int OUTPUT_SIZE_odm_loc = 25500; //40*40*12 + 20*20*12 + 10*10*12 + 5*5*12 = 25500   (Fixed value)\n    CUDA_CHECK(cudaMalloc(&buffers[outputIndex_odm_loc], batchSize * OUTPUT_SIZE_odm_loc * sizeof(float)));\n\n    const int OUTPUT_SIZE_odm_conf = 159375; //40*40*(num_class*3) + 20*20**(num_class*3) + 10*10**(num_class*3) + 5*5**(num_class*3) //here num_class=25// =159375\n    CUDA_CHECK(cudaMalloc(&buffers[outputIndex_odm_conf], batchSize * OUTPUT_SIZE_odm_conf * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n\n\n    int fcount = 0;\n    auto t_0 = std::chrono::steady_clock::now();\n    for (auto f: file_names) {\n        fcount++;\n        std::cout << \"\\n\" << fcount << \"  \" << f << std::endl;\n        std::cout << std::string(p_dir_name) + \"/\" + f << std::endl;\n\n        auto start_read = std::chrono::system_clock::now();\n        cv::Mat img = cv::imread(std::string(p_dir_name) + \"/\" + f);\n        cudaDeviceSynchronize();\n        auto end_read = std::chrono::system_clock::now();\n        double during_time_read = std::chrono::duration_cast<std::chrono::milliseconds>(end_read - start_read).count();\n        std::cout <<\"time consume during_time_read===\" <<  during_time_read << \"ms\" << std::endl;\n\n        if (img.empty()) continue;\n\n        auto start_yuchuli = std::chrono::system_clock::now();\n        base_transform(img,data);\n        cudaDeviceSynchronize();\n        auto end_yuchuli = std::chrono::system_clock::now();\n        double during_time_yuchuli = std::chrono::duration_cast<std::chrono::milliseconds>(end_yuchuli - start_yuchuli).count();\n        std::cout <<\"time consume base_transform===\" <<  during_time_yuchuli << \"ms\" << std::endl;\n\n        auto start_doInfer = std::chrono::system_clock::now();\n        std::vector<std::vector<float>> detections;\n        doInference(*context, buffers, stream, data, detections);\n        cudaDeviceSynchronize();\n        auto end_doInfer = std::chrono::system_clock::now();\n        double during_doinfer = std::chrono::duration_cast<std::chrono::milliseconds>(end_doInfer - start_doInfer).count();\n        std::cout <<\"time consume doInference===\" <<  during_doinfer << \"ms\" << std::endl;\n\n        /* Print the detection results. */\n        for (size_t i = 0; i < detections.size(); ++i)\n        {\n            const std::vector<float> &d = detections[i];\n\n            CHECK_EQ(d.size(), 7);\n            const float score = d[2];\n\n            int label = int(d[1]);\n            if (label >= num_class || label < 0)\n            {\n                std::cout << \"label_Error!\" << std::endl;\n                continue;\n            }\n            if(score < TH)\n            {\n                continue;\n            }\n            cv::Rect r;\n            r.x = d[3] * img.cols;\n            r.y = d[4] * img.rows;\n            r.width = d[5] * img.cols - r.x;\n            r.height = d[6] * img.rows - r.y;\n\n            RoiCorrect(img, r);\n            if(T_show)\n            {\n                cv::rectangle(img,r,cv::Scalar(255,0,0),2);\n            }\n            if (T_show == 0)\n            {\n                std::string name_1 = f.substr(0,f.size()-4);\n                std::string path_txt = save_path_txt + name_1 + \".txt\";\n                std::ofstream fout(path_txt);\n                fout << label_map[label] << \" \" << score << \" \" << r.x << \" \" << r.y << \" \" << r.x + r.width\n                     << \" \" << r.y + r.height << std::endl; //使用自己的label\n            }\n        }\n        if(T_show)\n        {\n            cv::namedWindow(\"show\",0);\n            cv::imshow(\"show\",img);\n            cv::waitKey(0);\n        }\n    }\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(buffers[inputIndex]));\n    CUDA_CHECK(cudaFree(buffers[outputIndex_arm_loc]));\n    CUDA_CHECK(cudaFree(buffers[outputIndex_arm_conf]));\n\n    CUDA_CHECK(cudaFree(buffers[outputIndex_odm_loc]));\n    CUDA_CHECK(cudaFree(buffers[outputIndex_odm_conf]));\n\n    cudaDeviceSynchronize();\n    auto ttt = std::chrono::duration_cast<std::chrono::milliseconds>\n            (std::chrono::steady_clock::now() - t_0).count();\n    std::cout << \"all consume time=\"<<ttt <<\"ms\"<<std::endl;\n    std::cout << \"-----------end-----------------------\"<<std::endl;\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n    return 0;\n}\n"
  },
  {
    "path": "refinedet/utils.h",
    "content": "#ifndef __TRT_UTILS_H_\n#define __TRT_UTILS_H_\n\n#include <iostream>\n#include <vector>\n#include <algorithm>\n#include <cudnn.h>\n#include <dirent.h>\n#include <opencv2/opencv.hpp>\n\n#ifndef CUDA_CHECK\n\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n\n#endif\n\nnamespace Tn\n{\n    template<typename T> \n    void write(char*& buffer, const T& val)\n    {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T> \n    void read(const char*& buffer, T& val)\n    {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n}\n\nstatic inline int read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n            strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\n#endif\n"
  },
  {
    "path": "repvgg/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(repvgg)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nadd_executable(repvgg ${PROJECT_SOURCE_DIR}/repvgg.cpp)\ntarget_link_libraries(repvgg nvinfer)\ntarget_link_libraries(repvgg cudart)\n\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "repvgg/README.md",
    "content": "# RepVGG\n\nRepVGG models from\n\"RepVGG: Making VGG-style ConvNets Great Again\" <https://arxiv.org/pdf/2101.03697.pdf>\n\nFor the Pytorch implementation, you can refer to [DingXiaoH/RepVGG](https://github.com/DingXiaoH/RepVGG)\n\n# How to run\n\n1. generate wts file.\n\n```\ngit clone https://github.com/DingXiaoH/RepVGG.git\ncd ReoVGG\n```\n\nYou may convert a trained model into the inference-time structure with\n\n```\npython convert.py [weights file of the training-time model to load] [path to save] -a [model name]\n```\n\nFor example,\n\n```\npython convert.py RepVGG-B2-train.pth RepVGG-B2-deploy.pth -a RepVGG-B2\n```\n\nThen copy `gen_wts.py` to `RepVGG` and generate .wts file, for example\n\n```\npython gen_wts.py -w RepVGG-B2-deploy.pth -s RepVGG-B2.wts\n```\n\n2. build and run\n```\ncd tensorrtx/repvgg\n\nmkdir build\n\ncd build\n\ncmake ..\n\nmake\n\nsudo ./repvgg -s RepVGG-B2  // serialize model to plan file i.e. 'RepVGG-B2.engine'\nsudo ./repvgg -d RepVGG-B2  // deserialize plan file and run inference\n```\n\n"
  },
  {
    "path": "repvgg/gen_wts.py",
    "content": "import argparse\nimport struct\n\nimport torch\n\n\ndef main(args):\n    # Load model\n    state_dict = torch.load(args.weight)\n    with open(args.save_path, \"w\") as f:\n        f.write(\"{}\\n\".format(len(state_dict.keys())))\n        for k, v in state_dict.items():\n            vr = v.reshape(-1).cpu().numpy()\n            f.write(\"{} {} \".format(k, len(vr)))\n            for vv in vr:\n                f.write(\" \")\n                f.write(struct.pack(\">f\", float(vv)).hex())\n            f.write(\"\\n\")\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"-w\",\n        \"--weight\",\n        type=str,\n        required=True,\n        help=\"RepVGG model weight path\",\n    )\n    parser.add_argument(\n        \"-s\",\n        \"--save_path\",\n        type=str,\n        required=True,\n        help=\"generated wts path\",\n    )\n    args = parser.parse_args()\n    main(args)"
  },
  {
    "path": "repvgg/logging.h",
    "content": "#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <iostream>\n\n// Logger for TensorRT info/warning/errors\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger() : Logger(Severity::kINFO) {}\n\n    Logger(Severity severity) : reportableSeverity(severity) {}\n\n    void log(Severity severity, const char *msg) override\n    {\n        // suppress messages with severity enum value greater than the reportable\n        if (severity > reportableSeverity)\n            return;\n\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR:\n            std::cerr << \"INTERNAL_ERROR: \";\n            break;\n        case Severity::kERROR:\n            std::cerr << \"ERROR: \";\n            break;\n        case Severity::kWARNING:\n            std::cerr << \"WARNING: \";\n            break;\n        case Severity::kINFO:\n            std::cerr << \"INFO: \";\n            break;\n        default:\n            std::cerr << \"UNKNOWN: \";\n            break;\n        }\n        std::cerr << msg << std::endl;\n    }\n\n    Severity reportableSeverity{Severity::kWARNING};\n};\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "repvgg/repvgg.cpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <cmath>\n#include <algorithm>\n\n#define CHECK(status)                                          \\\n    do                                                         \\\n    {                                                          \\\n        auto ret = (status);                                   \\\n        if (ret != 0)                                          \\\n        {                                                      \\\n            std::cerr << \"Cuda failure: \" << ret << std::endl; \\\n            abort();                                           \\\n        }                                                      \\\n    } while (0)\n\n// stuff we know about the network and the input/output blobs\n#define MAX_BATCH_SIZE 1\nconst std::vector<int> groupwise_layers{2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26};\nconst std::map<std::string, int> groupwise_counts = {\n    {\"RepVGG-A0\", 1},\n    {\"RepVGG-A1\", 1},\n    {\"RepVGG-A2\", 1},\n    {\"RepVGG-B0\", 1},\n    {\"RepVGG-B1\", 1},\n    {\"RepVGG-B1g2\", 2},\n    {\"RepVGG-B1g4\", 4},\n    {\"RepVGG-B2\", 1},\n    {\"RepVGG-B2g2\", 2},\n    {\"RepVGG-B2g4\", 4},\n    {\"RepVGG-B3\", 1},\n    {\"RepVGG-B3g2\", 2},\n    {\"RepVGG-B3g4\", 4}};\nconst std::map<std::string, std::vector<int>> num_blocks = {\n    {\"RepVGG-A0\", {2, 4, 14, 1}},\n    {\"RepVGG-A1\", {2, 4, 14, 1}},\n    {\"RepVGG-A2\", {2, 4, 14, 1}},\n    {\"RepVGG-B0\", {4, 6, 16, 1}},\n    {\"RepVGG-B1\", {4, 6, 16, 1}},\n    {\"RepVGG-B1g2\", {4, 6, 16, 1}},\n    {\"RepVGG-B1g4\", {4, 6, 16, 1}},\n    {\"RepVGG-B2\", {4, 6, 16, 1}},\n    {\"RepVGG-B2g2\", {4, 6, 16, 1}},\n    {\"RepVGG-B2g4\", {4, 6, 16, 1}},\n    {\"RepVGG-B3\", {4, 6, 16, 1}},\n    {\"RepVGG-B3g2\", {4, 6, 16, 1}},\n    {\"RepVGG-B3g4\", {4, 6, 16, 1}}};\nconst std::map<std::string, std::vector<float>> width_multiplier = {\n    {\"RepVGG-A0\", {0.75, 0.75, 0.75, 2.5}},\n    {\"RepVGG-A1\", {1, 1, 1, 2.5}},\n    {\"RepVGG-A2\", {1.5, 1.5, 1.5, 2.75}},\n    {\"RepVGG-B0\", {1, 1, 1, 2.5}},\n    {\"RepVGG-B1\", {2, 2, 2, 4}},\n    {\"RepVGG-B1g2\", {2, 2, 2, 4}},\n    {\"RepVGG-B1g4\", {2, 2, 2, 4}},\n    {\"RepVGG-B2\", {2.5, 2.5, 2.5, 5}},\n    {\"RepVGG-B2g2\", {2.5, 2.5, 2.5, 5}},\n    {\"RepVGG-B2g4\", {2.5, 2.5, 2.5, 5}},\n    {\"RepVGG-B3\", {3, 3, 3, 5}},\n    {\"RepVGG-B3g2\", {3, 3, 3, 5}},\n    {\"RepVGG-B3g4\", {3, 3, 3, 5}}};\n\nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 1000;\n\nconst char *INPUT_BLOB_NAME = \"data\";\nconst char *OUTPUT_BLOB_NAME = \"prob\";\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t *val = reinterpret_cast<uint32_t *>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n    std::cout << \"Finished Load weights: \" << file << std::endl;\n    return weightMap;\n}\n\nIActivationLayer *RepVGGBlock(INetworkDefinition *network, std::map<std::string, Weights> &weightMap, ITensor &input, int inch, int outch, int stride, int groups, std::string lname)\n{\n    IConvolutionLayer *conv = network->addConvolutionNd(input, outch, DimsHW{3, 3}, weightMap[lname + \"rbr_reparam.weight\"], weightMap[lname + \"rbr_reparam.bias\"]);\n    conv->setStrideNd(DimsHW{stride, stride});\n    conv->setPaddingNd(DimsHW{1, 1});\n    conv->setNbGroups(groups);\n    assert(conv);\n    IActivationLayer *relu = network->addActivation(*conv->getOutput(0), ActivationType::kRELU);\n    assert(relu);\n    return relu;\n}\n\nIActivationLayer *makeStage(INetworkDefinition *network, std::map<std::string, Weights> &weightMap, int &layer_idx, const int group_count, ITensor &input, int inch, int outch, int stride, int blocks, std::string lname)\n{\n    IActivationLayer *layer;\n    for (int i = 0; i < blocks; ++i)\n    {\n        int group = 1;\n        if (std::find(groupwise_layers.begin(), groupwise_layers.end(), layer_idx) != groupwise_layers.end())\n            group = group_count;\n        if (i == 0)\n            layer = RepVGGBlock(network, weightMap, input, inch, outch, 2, group, lname + std::to_string(i) + \".\");\n        else\n            layer = RepVGGBlock(network, weightMap, *layer->getOutput(0), inch, outch, 1, group, lname + std::to_string(i) + \".\");\n        layer_idx += 1;\n    }\n    return layer;\n}\n// Creat the engine using only the API and not any parser.\nICudaEngine *createEngine(std::string netName, unsigned int maxBatchSize, IBuilder *builder, IBuilderConfig *config, DataType dt)\n{\n    const std::vector<int> blocks = num_blocks.at(netName);\n    const std::vector<float> widths = width_multiplier.at(netName);\n    const int group_count = groupwise_counts.at(netName);\n    int layer_idx = 1;\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../\" + netName + \".wts\");\n\n    INetworkDefinition *network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape { 3, INPUT_H, INPUT_W } with name INPUT_BLOB_NAME\n    ITensor *data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    int in_planes = std::min(64, int(64 * widths[0]));\n    auto stage0 = RepVGGBlock(network, weightMap, *data, 3, in_planes, 2, 1, \"stage0.\");\n    assert(stage0);\n\n    auto stage1 = makeStage(network, weightMap, layer_idx, group_count, *stage0->getOutput(0), in_planes, int(64 * widths[0]), 2, blocks[0], \"stage1.\");\n    assert(stage1);\n    auto stage2 = makeStage(network, weightMap, layer_idx, group_count, *stage1->getOutput(0), int(64 * widths[0]), int(128 * widths[1]), 2, blocks[1], \"stage2.\");\n    assert(stage2);\n    auto stage3 = makeStage(network, weightMap, layer_idx, group_count, *stage2->getOutput(0), int(128 * widths[1]), int(256 * widths[2]), 2, blocks[2], \"stage3.\");\n    assert(stage3);\n    auto stage4 = makeStage(network, weightMap, layer_idx, group_count, *stage3->getOutput(0), int(256 * widths[2]), int(512 * widths[3]), 2, blocks[3], \"stage4.\");\n    assert(stage4);\n\n    IPoolingLayer *pool = network->addPoolingNd(*stage4->getOutput(0), PoolingType::kAVERAGE, DimsHW{7, 7});\n    pool->setStrideNd(DimsHW{7, 7});\n    pool->setPaddingNd(DimsHW{0, 0});\n    assert(pool);\n\n    IFullyConnectedLayer *linear = network->addFullyConnected(*pool->getOutput(0), 1000, weightMap[\"linear.weight\"], weightMap[\"linear.bias\"]);\n    assert(linear);\n\n    linear->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n    network->markOutput(*linear->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n    ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto &mem : weightMap)\n    {\n        free((void *)(mem.second.values));\n    }\n    return engine;\n}\n\nvoid APIToModel(std::string netName, unsigned int maxBatchSize, IHostMemory **modelStream)\n{\n    // Create builder\n    IBuilder *builder = createInferBuilder(gLogger);\n    IBuilderConfig *config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine *engine = createEngine(netName, maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n    config->destroy();\n}\n\nvoid doInference(IExecutionContext &context, float *input, float *output, int batchSize)\n{\n    const ICudaEngine &engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void *buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char **argv)\n{\n    if (argc != 3)\n    {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./repvgg -s  RepVGG-B1g2 // serialize model to plan file\" << std::endl;\n        std::cerr << \"./repvgg -d  RepVGG-B1g2 // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\")\n    {\n        std::string netName = std::string(argv[2]);\n        IHostMemory *modelStream{nullptr};\n        APIToModel(netName, MAX_BATCH_SIZE, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(netName + \".engine\", std::ios::binary);\n        if (!p)\n        {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char *>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    }\n    else if (std::string(argv[1]) == \"-d\")\n    {\n        std::string netName = std::string(argv[2]);\n        std::ifstream file(netName + \".engine\", std::ios::binary);\n        if (file.good())\n        {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    }\n    else\n    {\n        return -1;\n    }\n\n    static float data[3 * INPUT_H * INPUT_W];\n    for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n        data[i] = 1.0;\n\n    IRuntime *runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine *engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n    assert(engine != nullptr);\n    IExecutionContext *context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    static float prob[OUTPUT_SIZE];\n    for (int i = 0; i < 100; i++)\n    {\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < 10; i++)\n    {\n        std::cout << prob[i] << \", \";\n    }\n    std::cout << std::endl;\n    for (unsigned int i = 0; i < 10; i++)\n    {\n        std::cout << prob[OUTPUT_SIZE - 10 + i] << \", \";\n    }\n    std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "resnet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(resnet)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nadd_executable(resnet18 ${PROJECT_SOURCE_DIR}/resnet18.cpp)\ntarget_link_libraries(resnet18 nvinfer)\ntarget_link_libraries(resnet18 cudart)\n\nadd_executable(resnet34 ${PROJECT_SOURCE_DIR}/resnet34.cpp)\ntarget_link_libraries(resnet34 nvinfer)\ntarget_link_libraries(resnet34 cudart)\n\nadd_executable(resnet50 ${PROJECT_SOURCE_DIR}/resnet50.cpp)\ntarget_link_libraries(resnet50 nvinfer)\ntarget_link_libraries(resnet50 cudart)\n\nadd_executable(resnext50 ${PROJECT_SOURCE_DIR}/resnext50_32x4d.cpp)\ntarget_link_libraries(resnext50 nvinfer)\ntarget_link_libraries(resnext50 cudart)\n\nadd_executable(wideresnet50 ${PROJECT_SOURCE_DIR}/wideresnet50.cpp)\ntarget_link_libraries(wideresnet50 nvinfer)\ntarget_link_libraries(wideresnet50 cudart)\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "resnet/README.md",
    "content": "# resnet\n\nResNet-18 and ResNet-50 model from \"Deep Residual Learning for Image Recognition\" <https://arxiv.org/pdf/1512.03385.pdf>\n\nFor the Pytorch implementation, you can refer to [pytorchx/resnet](https://github.com/wang-xinyu/pytorchx/tree/master/resnet)\n\nWide Resnet-50 model from \"Wide Residual Networks\" <https://arxiv.org/pdf/1605.07146.pdf> . For the Pytorch implementation, you can refer to [BlueMirrors/torchtrtz](https://github.com/BlueMirrors/torchtrtz)\n\nFollowing tricks are used in this resnet, nothing special, residual connection and batchnorm are used.\n\n- Batchnorm layer, implemented with scale layer.\n\n## TensorRT C++ API\n\n```\n// 1a. generate resnet18.wts,resnet34.wts or resnet50.wts from [pytorchx/resnet](https://github.com/wang-xinyu/pytorchx/tree/master/resnet)\n\n// 1b. generate wide_resnet50.wts from [BlueMirrors/torchtrtz](https://github.com/BlueMirrors/torchtrtz)\n\n// 2. put resnet18.wts,resnet34 or resnet50.wts into tensorrtx/resnet\n\n// 3. build and run\n\ncd tensorrtx/resnet\n\nmkdir build\n\ncd build\n\ncmake ..\n\nmake\n\nsudo ./resnet18 -s   // serialize model to plan file i.e. 'resnet18.engine'\nsudo ./resnet18 -d   // deserialize plan file and run inference\n\nor\nsudo ./resnet34 -s   // serialize model to plan file i.e. 'resnet34.engine'\nsudo ./resnet34 -d   // deserialize plan file and run inference\n\nor\n\nsudo ./resnet50 -s   // serialize model to plan file i.e. 'resnet50.engine'\nsudo ./resnet50 -d   // deserialize plan file and run inference\n\nor\n\nsudo ./resnext50 -s   // serialize model to plan file i.e. 'resnext50.engine'\nsudo ./resnext50 -d   // deserialize plan file and run inference\n\nor\n\nsudo ./wide_resnet50 -s   // serialize model to plan file i.e. 'wide_resnet50.engine'\nsudo ./wide_resnet50 -d   // deserialize plan file and run inference\n\n\n// 4. see if the output is same as \n- [pytorchx/resnet](https://github.com/wang-xinyu/pytorchx/tree/master/resnet) - for resnet18, resnet34, resnet50, resnext50\n- [BlueMirrors/torchtrtz](https://github.com/BlueMirrors/torchtrtz) - for wide_resnet50\n```\n\n### TensorRT Python API\n\n```\n# 1a. generate resnet50.wts from [pytorchx/resnet](https://github.com/wang-xinyu/pytorchx/tree/master/resnet)\n# 1b. generate wide_resnet50.wts from [BlueMirrors/torchtrtz](https://github.com/BlueMirrors/torchtrtz)\n\n# 2. put resnet50.wts or wide_resnet50.wts into tensorrtx/resnet\n\n# 3. install Python dependencies (tensorrt/pycuda/numpy)\n\ncd tensorrtx/resnet\n\npython resnet50.py -s   // serialize model to plan file i.e. 'resnet50.engine'\npython resnet50.py -d   // deserialize plan file and run inference\n\nor \n\npython wide_resnet50.py -s   // serialize model to plan file i.e. 'wide_resnet50.engine'\npython wide_resnet50.py -d   // deserialize plan file and run inference\n\n# 4. see if the output is same as \n- pytorchx/resnet - for resnet50\n- BlueMirrors/torchtrtz - for wide_resnet50\n```\n"
  },
  {
    "path": "resnet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "resnet/resnet18.cpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <cmath>\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 1000;\n\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n    std::cout << \"len \" << len << std::endl;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nIActivationLayer* basicBlock(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int stride, std::string lname) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{3, 3}, weightMap[lname + \"conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{stride, stride});\n    conv1->setPaddingNd(DimsHW{1, 1});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"bn1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{3, 3}, weightMap[lname + \"conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setPaddingNd(DimsHW{1, 1});\n\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"bn2\", 1e-5);\n\n    IElementWiseLayer* ew1;\n    if (inch != outch) {\n        IConvolutionLayer* conv3 = network->addConvolutionNd(input, outch, DimsHW{1, 1}, weightMap[lname + \"downsample.0.weight\"], emptywts);\n        assert(conv3);\n        conv3->setStrideNd(DimsHW{stride, stride});\n        IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"downsample.1\", 1e-5);\n        ew1 = network->addElementWise(*bn3->getOutput(0), *bn2->getOutput(0), ElementWiseOperation::kSUM);\n    } else {\n        ew1 = network->addElementWise(input, *bn2->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    IActivationLayer* relu2 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n    return relu2;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt)\n{\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape { 3, INPUT_H, INPUT_W } with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../resnet18.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 64, DimsHW{7, 7}, weightMap[\"conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{2, 2});\n    conv1->setPaddingNd(DimsHW{3, 3});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"bn1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n    pool1->setPaddingNd(DimsHW{1, 1});\n\n    IActivationLayer* relu2 = basicBlock(network, weightMap, *pool1->getOutput(0), 64, 64, 1, \"layer1.0.\");\n    IActivationLayer* relu3 = basicBlock(network, weightMap, *relu2->getOutput(0), 64, 64, 1, \"layer1.1.\");\n\n    IActivationLayer* relu4 = basicBlock(network, weightMap, *relu3->getOutput(0), 64, 128, 2, \"layer2.0.\");\n    IActivationLayer* relu5 = basicBlock(network, weightMap, *relu4->getOutput(0), 128, 128, 1, \"layer2.1.\");\n\n    IActivationLayer* relu6 = basicBlock(network, weightMap, *relu5->getOutput(0), 128, 256, 2, \"layer3.0.\");\n    IActivationLayer* relu7 = basicBlock(network, weightMap, *relu6->getOutput(0), 256, 256, 1, \"layer3.1.\");\n\n    IActivationLayer* relu8 = basicBlock(network, weightMap, *relu7->getOutput(0), 256, 512, 2, \"layer4.0.\");\n    IActivationLayer* relu9 = basicBlock(network, weightMap, *relu8->getOutput(0), 512, 512, 1, \"layer4.1.\");\n\n    IPoolingLayer* pool2 = network->addPoolingNd(*relu9->getOutput(0), PoolingType::kAVERAGE, DimsHW{7, 7});\n    assert(pool2);\n    pool2->setStrideNd(DimsHW{1, 1});\n    \n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 1000, weightMap[\"fc.weight\"], weightMap[\"fc.bias\"]);\n    assert(fc1);\n\n    fc1->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n    network->markOutput(*fc1->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream)\n{\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n    config->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize)\n{\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./resnet18 -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./resnet18 -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(\"resnet18.engine\", std::ios::binary);\n        if (!p)\n        {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"resnet18.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n\n    // Subtract mean from image\n    static float data[3 * INPUT_H * INPUT_W];\n    for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n        data[i] = 1.0;\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    static float prob[OUTPUT_SIZE];\n    for (int i = 0; i < 100; i++) {\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < 10; i++)\n    {\n        std::cout << prob[i] << \", \";\n    }\n    std::cout << std::endl;\n    for (unsigned int i = 0; i < 10; i++)\n    {\n        std::cout << prob[OUTPUT_SIZE - 10 + i] << \", \";\n    }\n    std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "resnet/resnet34.cpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <cmath>\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if(ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while(0)\n\n// stuff we know about the network and the input/output blobs \nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 1000;\n\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weigths files have a simple space delimited format:\n// [tpyt] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights>weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalis weight map file\");\n\n    while (count--)\n    {\n        Weights wt{ DataType::kFLOAT, nullptr,0 };\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val)* size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *bata = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n    std::cout << \"len \" << len << std::endl;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for(int i=0; i < len; i++){\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{ DataType::kFLOAT,scval,len };\n\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = bata[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{ DataType::kFLOAT, shval, len };\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{ DataType::kFLOAT, pval, len };\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nIActivationLayer* basicBlock(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int stride, std::string lname) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ 3,3 }, weightMap[lname + \"conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ stride,stride });\n    conv1->setPaddingNd(DimsHW{ 1,1 });\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"bn1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{ 3,3 }, weightMap[lname + \"conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setPaddingNd(DimsHW{ 1,1 });\n\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"bn2\", 1e-5);\n\n    IElementWiseLayer* ew1;\n    if (inch != outch) {\n        IConvolutionLayer* conv3 = network->addConvolutionNd(input, outch, DimsHW{ 1,1 }, weightMap[lname + \"downsample.0.weight\"], emptywts);\n        assert(conv3);\n        conv3->setStrideNd(DimsHW{ stride, stride });\n        IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"downsample.1\", 1e-5);\n        ew1 = network->addElementWise(*bn3->getOutput(0), *bn2->getOutput(0), ElementWiseOperation::kSUM);\n\n    }else {\n        ew1 = network->addElementWise(input, *bn2->getOutput(0),\n            ElementWiseOperation::kSUM);\n\n    }\n    IActivationLayer* relu2 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n    return relu2;\n}\n\n// Create the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shpae { 3, INPUT_H INPPUT_W} with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{ 3,INPUT_H,INPUT_W });\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../resnet34.wts\");\n    Weights emptywts{ DataType::kFLOAT,nullptr,0 };\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 64, DimsHW{ 7,7 }, weightMap[\"conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ 2,2 });\n    conv1->setPaddingNd(DimsHW{ 3,3 });\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"bn1\", 1e-5);\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{ 3,3 });\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{ 2,2 });\n    pool1->setPaddingNd(DimsHW{ 1,1 });\n\n    IActivationLayer* relu2 = basicBlock(network, weightMap, *pool1->getOutput(0), 64, 64, 1, \"layer1.0.\");\n    IActivationLayer* relu3 = basicBlock(network, weightMap, *relu2->getOutput(0), 64, 64, 1, \"layer1.1.\");\n    IActivationLayer* relu4 = basicBlock(network, weightMap, *relu3->getOutput(0), 64, 64, 1, \"layer1.2.\");\n    IActivationLayer* relu5 = basicBlock(network, weightMap, *relu4->getOutput(0), 64, 128, 2, \"layer2.0.\");\n    IActivationLayer* relu6 = basicBlock(network, weightMap, *relu5->getOutput(0), 128, 128, 1, \"layer2.1.\");\n    IActivationLayer* relu7 = basicBlock(network, weightMap, *relu6->getOutput(0), 128, 128, 1, \"layer2.2.\");\n    IActivationLayer* relu8 = basicBlock(network, weightMap, *relu7->getOutput(0), 128, 128, 1, \"layer2.3.\");\n    IActivationLayer* relu9 = basicBlock(network, weightMap, *relu8->getOutput(0), 128, 256, 2, \"layer3.0.\");\n    IActivationLayer* relu10 = basicBlock(network, weightMap, *relu9->getOutput(0), 256, 256, 1, \"layer3.1.\");\n    IActivationLayer* relu11 = basicBlock(network, weightMap, *relu10->getOutput(0), 256, 256, 1, \"layer3.2.\");\n    IActivationLayer* relu12 = basicBlock(network, weightMap, *relu11->getOutput(0), 256, 256, 1, \"layer3.3.\");\n    IActivationLayer* relu13 = basicBlock(network, weightMap, *relu12->getOutput(0), 256, 256, 1, \"layer3.4.\");\n    IActivationLayer* relu14 = basicBlock(network, weightMap, *relu13->getOutput(0), 256, 256, 1, \"layer3.5.\");\n    IActivationLayer* relu15 = basicBlock(network, weightMap, *relu14->getOutput(0), 256, 512, 2, \"layer4.0.\");\n    IActivationLayer* relu16 = basicBlock(network, weightMap, *relu15->getOutput(0), 512, 512, 1, \"layer4.1.\");\n    IActivationLayer* relu17 = basicBlock(network, weightMap, *relu16->getOutput(0), 512, 512, 1, \"layer4.2.\");\n    IPoolingLayer* pool2 = network->addPoolingNd(*relu17->getOutput(0), PoolingType::kAVERAGE, DimsHW{ 7,7 });\n    assert(pool2);\n    pool2->setStrideNd(DimsHW{ 1,1 });\n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 1000, weightMap[\"fc.weight\"], weightMap[\"fc.bias\"]);\n    assert(fc1);\n\n    fc1->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n    network->markOutput(*fc1->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream)\n{\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n    config->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize)\n{\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBingdings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to konow the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H* INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./resnet34 -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./resnet34 -d    // desrialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a stream\n    char *trtModelStream{ nullptr };\n    size_t size(0);\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{ nullptr };\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(\"resnet34.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    }else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"resnet34.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    }else {\n        return -1;\n\n    }\n\n    // Subtract mean from image\n    static float data[3 * INPUT_H * INPUT_W];\n    for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++) {\n        data[i] = 1.0;\n    }\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    static float prob[OUTPUT_SIZE];\n    for (int i = 0; i < 100; i++) {\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print historgram of the output distribution\n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < 10; i++)\n    {\n        std::cout << prob[i] << \",\";\n    }\n    std::cout << std::endl;\n    for (unsigned int i = 0; i < 10; i++)\n    {\n    std::cout << prob[OUTPUT_SIZE - 10 + i] << \",\";\n    }\n    std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "resnet/resnet50.cpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <cmath>\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 1000;\n\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n    std::cout << \"len \" << len << std::endl;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nIActivationLayer* bottleneck(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int stride, std::string lname) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{1, 1}, weightMap[lname + \"conv1.weight\"], emptywts);\n    assert(conv1);\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"bn1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{3, 3}, weightMap[lname + \"conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{stride, stride});\n    conv2->setPaddingNd(DimsHW{1, 1});\n\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"bn2\", 1e-5);\n\n    IActivationLayer* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n\n    IConvolutionLayer* conv3 = network->addConvolutionNd(*relu2->getOutput(0), outch * 4, DimsHW{1, 1}, weightMap[lname + \"conv3.weight\"], emptywts);\n    assert(conv3);\n\n    IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"bn3\", 1e-5);\n\n    IElementWiseLayer* ew1;\n    if (stride != 1 || inch != outch * 4) {\n        IConvolutionLayer* conv4 = network->addConvolutionNd(input, outch * 4, DimsHW{1, 1}, weightMap[lname + \"downsample.0.weight\"], emptywts);\n        assert(conv4);\n        conv4->setStrideNd(DimsHW{stride, stride});\n\n        IScaleLayer* bn4 = addBatchNorm2d(network, weightMap, *conv4->getOutput(0), lname + \"downsample.1\", 1e-5);\n        ew1 = network->addElementWise(*bn4->getOutput(0), *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    } else {\n        ew1 = network->addElementWise(input, *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    IActivationLayer* relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n    return relu3;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt)\n{\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape { 3, INPUT_H, INPUT_W } with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../resnet50.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 64, DimsHW{7, 7}, weightMap[\"conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{2, 2});\n    conv1->setPaddingNd(DimsHW{3, 3});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"bn1\", 1e-5);\n\n    // Add activation layer using the ReLU algorithm.\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    // Add max pooling layer with stride of 2x2 and kernel size of 2x2.\n    IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n    pool1->setPaddingNd(DimsHW{1, 1});\n\n    IActivationLayer* x = bottleneck(network, weightMap, *pool1->getOutput(0), 64, 64, 1, \"layer1.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 64, 1, \"layer1.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 64, 1, \"layer1.2.\");\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 128, 2, \"layer2.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.2.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.3.\");\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 256, 2, \"layer3.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.2.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.3.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.4.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.5.\");\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 512, 2, \"layer4.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 2048, 512, 1, \"layer4.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 2048, 512, 1, \"layer4.2.\");\n\n    IPoolingLayer* pool2 = network->addPoolingNd(*x->getOutput(0), PoolingType::kAVERAGE, DimsHW{7, 7});\n    assert(pool2);\n    pool2->setStrideNd(DimsHW{1, 1});\n    \n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 1000, weightMap[\"fc.weight\"], weightMap[\"fc.bias\"]);\n    assert(fc1);\n\n    fc1->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n    network->markOutput(*fc1->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream)\n{\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n    config->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize)\n{\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./resnet -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./resnet -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(\"resnet50.engine\", std::ios::binary);\n        if (!p)\n        {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"resnet50.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n\n    // Subtract mean from image\n    static float data[3 * INPUT_H * INPUT_W];\n    for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n        data[i] = 1.0;\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    static float prob[OUTPUT_SIZE];\n    for (int i = 0; i < 100; i++) {\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < 10; i++)\n    {\n        std::cout << prob[i] << \", \";\n    }\n    std::cout << std::endl;\n    for (unsigned int i = 0; i < 10; i++)\n    {\n        std::cout << prob[OUTPUT_SIZE - 10 + i] << \", \";\n    }\n    std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "resnet/resnet50.py",
    "content": "import argparse\nimport os\nimport struct\nimport sys\n\nimport numpy as np\nimport pycuda.autoinit  # noqa\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nBATCH_SIZE = 1\nINPUT_H = 224\nINPUT_W = 224\nOUTPUT_SIZE = 1000\nINPUT_BLOB_NAME = \"data\"\nOUTPUT_BLOB_NAME = \"prob\"\nEPS = 1e-5\n\nWEIGHT_PATH = \"./resnet50.wts\"\nENGINE_PATH = \"./resnet50.engine\"\n\nTRT_LOGGER = trt.Logger(trt.Logger.INFO)\n\n\ndef load_weights(file):\n    print(f\"Loading weights: {file}\")\n\n    assert os.path.exists(file), 'Unable to load weight file.'\n\n    weight_map = {}\n    with open(file, \"r\") as f:\n        lines = [line.strip() for line in f]\n    count = int(lines[0])\n    assert count == len(lines) - 1\n    for i in range(1, count + 1):\n        splits = lines[i].split(\" \")\n        name = splits[0]\n        cur_count = int(splits[1])\n        assert cur_count + 2 == len(splits)\n        values = []\n        for j in range(2, len(splits)):\n            # hex string to bytes to float\n            values.append(struct.unpack(\">f\", bytes.fromhex(splits[j])))\n        weight_map[name] = np.array(values, dtype=np.float32)\n\n    return weight_map\n\n\ndef addBatchNorm2d(network, weight_map, input, layer_name, eps):\n    gamma = weight_map[layer_name + \".weight\"]\n    beta = weight_map[layer_name + \".bias\"]\n    mean = weight_map[layer_name + \".running_mean\"]\n    var = weight_map[layer_name + \".running_var\"]\n    var = np.sqrt(var + eps)\n\n    scale = gamma / var\n    shift = -mean / var * gamma + beta\n    return network.add_scale(input=input,\n                             mode=trt.ScaleMode.CHANNEL,\n                             shift=shift,\n                             scale=scale)\n\n\ndef bottleneck(network, weight_map, input, in_channels, out_channels, stride,\n               layer_name):\n\n    conv1 = network.add_convolution(input=input,\n                                    num_output_maps=out_channels,\n                                    kernel_shape=(1, 1),\n                                    kernel=weight_map[layer_name +\n                                                      \"conv1.weight\"],\n                                    bias=trt.Weights())\n    assert conv1\n\n    bn1 = addBatchNorm2d(network, weight_map, conv1.get_output(0),\n                         layer_name + \"bn1\", EPS)\n    assert bn1\n\n    relu1 = network.add_activation(bn1.get_output(0),\n                                   type=trt.ActivationType.RELU)\n    assert relu1\n\n    conv2 = network.add_convolution(input=relu1.get_output(0),\n                                    num_output_maps=out_channels,\n                                    kernel_shape=(3, 3),\n                                    kernel=weight_map[layer_name +\n                                                      \"conv2.weight\"],\n                                    bias=trt.Weights())\n    assert conv2\n    conv2.stride = (stride, stride)\n    conv2.padding = (1, 1)\n\n    bn2 = addBatchNorm2d(network, weight_map, conv2.get_output(0),\n                         layer_name + \"bn2\", EPS)\n    assert bn2\n\n    relu2 = network.add_activation(bn2.get_output(0),\n                                   type=trt.ActivationType.RELU)\n    assert relu2\n\n    conv3 = network.add_convolution(input=relu2.get_output(0),\n                                    num_output_maps=out_channels * 4,\n                                    kernel_shape=(1, 1),\n                                    kernel=weight_map[layer_name +\n                                                      \"conv3.weight\"],\n                                    bias=trt.Weights())\n    assert conv3\n\n    bn3 = addBatchNorm2d(network, weight_map, conv3.get_output(0),\n                         layer_name + \"bn3\", EPS)\n    assert bn3\n\n    if stride != 1 or in_channels != 4 * out_channels:\n        conv4 = network.add_convolution(\n            input=input,\n            num_output_maps=out_channels * 4,\n            kernel_shape=(1, 1),\n            kernel=weight_map[layer_name + \"downsample.0.weight\"],\n            bias=trt.Weights())\n        assert conv4\n        conv4.stride = (stride, stride)\n\n        bn4 = addBatchNorm2d(network, weight_map, conv4.get_output(0),\n                             layer_name + \"downsample.1\", EPS)\n        assert bn4\n\n        ew1 = network.add_elementwise(bn4.get_output(0), bn3.get_output(0),\n                                      trt.ElementWiseOperation.SUM)\n    else:\n        ew1 = network.add_elementwise(input, bn3.get_output(0),\n                                      trt.ElementWiseOperation.SUM)\n    assert ew1\n\n    relu3 = network.add_activation(ew1.get_output(0),\n                                   type=trt.ActivationType.RELU)\n    assert relu3\n\n    return relu3\n\n\ndef create_engine(maxBatchSize, builder, config, dt):\n    weight_map = load_weights(WEIGHT_PATH)\n    network = builder.create_network()\n\n    data = network.add_input(INPUT_BLOB_NAME, dt, (3, INPUT_H, INPUT_W))\n    assert data\n\n    conv1 = network.add_convolution(input=data,\n                                    num_output_maps=64,\n                                    kernel_shape=(7, 7),\n                                    kernel=weight_map[\"conv1.weight\"],\n                                    bias=trt.Weights())\n    assert conv1\n    conv1.stride = (2, 2)\n    conv1.padding = (3, 3)\n\n    bn1 = addBatchNorm2d(network, weight_map, conv1.get_output(0), \"bn1\", EPS)\n    assert bn1\n\n    relu1 = network.add_activation(bn1.get_output(0),\n                                   type=trt.ActivationType.RELU)\n    assert relu1\n\n    pool1 = network.add_pooling(input=relu1.get_output(0),\n                                window_size=trt.DimsHW(3, 3),\n                                type=trt.PoolingType.MAX)\n    assert pool1\n    pool1.stride = (2, 2)\n    pool1.padding = (1, 1)\n\n    x = bottleneck(network, weight_map, pool1.get_output(0), 64, 64, 1,\n                   \"layer1.0.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 256, 64, 1,\n                   \"layer1.1.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 256, 64, 1,\n                   \"layer1.2.\")\n\n    x = bottleneck(network, weight_map, x.get_output(0), 256, 128, 2,\n                   \"layer2.0.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 512, 128, 1,\n                   \"layer2.1.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 512, 128, 1,\n                   \"layer2.2.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 512, 128, 1,\n                   \"layer2.3.\")\n\n    x = bottleneck(network, weight_map, x.get_output(0), 512, 256, 2,\n                   \"layer3.0.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 256, 1,\n                   \"layer3.1.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 256, 1,\n                   \"layer3.2.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 256, 1,\n                   \"layer3.3.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 256, 1,\n                   \"layer3.4.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 256, 1,\n                   \"layer3.5.\")\n\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 512, 2,\n                   \"layer4.0.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 2048, 512, 1,\n                   \"layer4.1.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 2048, 512, 1,\n                   \"layer4.2.\")\n\n    pool2 = network.add_pooling(x.get_output(0),\n                                window_size=trt.DimsHW(7, 7),\n                                type=trt.PoolingType.AVERAGE)\n    assert pool2\n    pool2.stride = (1, 1)\n\n    fc1 = network.add_fully_connected(input=pool2.get_output(0),\n                                      num_outputs=OUTPUT_SIZE,\n                                      kernel=weight_map['fc.weight'],\n                                      bias=weight_map['fc.bias'])\n    assert fc1\n\n    fc1.get_output(0).name = OUTPUT_BLOB_NAME\n    network.mark_output(fc1.get_output(0))\n\n    # Build engine\n    builder.max_batch_size = maxBatchSize\n    builder.max_workspace_size = 1 << 20\n    engine = builder.build_engine(network, config)\n\n    del network\n    del weight_map\n\n    return engine\n\n\ndef APIToModel(maxBatchSize):\n    builder = trt.Builder(TRT_LOGGER)\n    config = builder.create_builder_config()\n    engine = create_engine(maxBatchSize, builder, config, trt.float32)\n    assert engine\n    with open(ENGINE_PATH, \"wb\") as f:\n        f.write(engine.serialize())\n\n    del engine\n    del builder\n\n\ndef doInference(context, host_in, host_out, batchSize):\n    engine = context.engine\n    assert engine.num_bindings == 2\n\n    devide_in = cuda.mem_alloc(host_in.nbytes)\n    devide_out = cuda.mem_alloc(host_out.nbytes)\n    bindings = [int(devide_in), int(devide_out)]\n    stream = cuda.Stream()\n\n    cuda.memcpy_htod_async(devide_in, host_in, stream)\n    context.execute_async(bindings=bindings, stream_handle=stream.handle)\n    cuda.memcpy_dtoh_async(host_out, devide_out, stream)\n    stream.synchronize()\n\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"-s\", action='store_true')\n    parser.add_argument(\"-d\", action='store_true')\n    args = parser.parse_args()\n\n    if not (args.s ^ args.d):\n        print(\n            \"arguments not right!\\n\"\n            \"python resnet50.py -s   # serialize model to plan file\\n\"\n            \"python resnet50.py -d   # deserialize plan file and run inference\"\n        )\n        sys.exit()\n\n    if args.s:\n        APIToModel(BATCH_SIZE)\n    else:\n        runtime = trt.Runtime(TRT_LOGGER)\n        assert runtime\n\n        with open(ENGINE_PATH, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        assert engine\n\n        context = engine.create_execution_context()\n        assert context\n\n        data = np.ones((BATCH_SIZE * 3 * INPUT_H * INPUT_W), dtype=np.float32)\n        host_in = cuda.pagelocked_empty(BATCH_SIZE * 3 * INPUT_H * INPUT_W,\n                                        dtype=np.float32)\n        np.copyto(host_in, data.ravel())\n        host_out = cuda.pagelocked_empty(OUTPUT_SIZE, dtype=np.float32)\n\n        doInference(context, host_in, host_out, BATCH_SIZE)\n\n        print(f'Output: \\n{host_out[:10]}\\n{host_out[-10:]}')\n"
  },
  {
    "path": "resnet/resnext50_32x4d.cpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <cmath>\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 1000;\n\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n    std::cout << \"len \" << len << std::endl;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nIActivationLayer* bottleneck(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int stride, std::string lname) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    int groups = 32;\n    int width = outch * 4 / 64 * 32;\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, width, DimsHW{1, 1}, weightMap[lname + \"conv1.weight\"], emptywts);\n    assert(conv1);\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"bn1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), width, DimsHW{3, 3}, weightMap[lname + \"conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{stride, stride});\n    conv2->setPaddingNd(DimsHW{1, 1});\n    conv2->setNbGroups(groups);\n\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"bn2\", 1e-5);\n\n    IActivationLayer* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n\n    IConvolutionLayer* conv3 = network->addConvolutionNd(*relu2->getOutput(0), outch * 4, DimsHW{1, 1}, weightMap[lname + \"conv3.weight\"], emptywts);\n    assert(conv3);\n\n    IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"bn3\", 1e-5);\n\n    IElementWiseLayer* ew1;\n    if (stride != 1 || inch != outch * 4) {\n        IConvolutionLayer* conv4 = network->addConvolutionNd(input, outch * 4, DimsHW{1, 1}, weightMap[lname + \"downsample.0.weight\"], emptywts);\n        assert(conv4);\n        conv4->setStrideNd(DimsHW{stride, stride});\n\n        IScaleLayer* bn4 = addBatchNorm2d(network, weightMap, *conv4->getOutput(0), lname + \"downsample.1\", 1e-5);\n        ew1 = network->addElementWise(*bn4->getOutput(0), *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    } else {\n        ew1 = network->addElementWise(input, *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    IActivationLayer* relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n    return relu3;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt)\n{\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape { 3, INPUT_H, INPUT_W } with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../resnext50.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 64, DimsHW{7, 7}, weightMap[\"conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{2, 2});\n    conv1->setPaddingNd(DimsHW{3, 3});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"bn1\", 1e-5);\n\n    // Add activation layer using the ReLU algorithm.\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n    pool1->setPaddingNd(DimsHW{1, 1});\n\n    IActivationLayer* x = bottleneck(network, weightMap, *pool1->getOutput(0), 64, 64, 1, \"layer1.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 64, 1, \"layer1.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 64, 1, \"layer1.2.\");\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 128, 2, \"layer2.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.2.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.3.\");\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 256, 2, \"layer3.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.2.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.3.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.4.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.5.\");\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 512, 2, \"layer4.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 2048, 512, 1, \"layer4.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 2048, 512, 1, \"layer4.2.\");\n\n    IPoolingLayer* pool2 = network->addPoolingNd(*x->getOutput(0), PoolingType::kAVERAGE, DimsHW{7, 7});\n    assert(pool2);\n    pool2->setStrideNd(DimsHW{1, 1});\n    \n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 1000, weightMap[\"fc.weight\"], weightMap[\"fc.bias\"]);\n    assert(fc1);\n\n    fc1->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n    network->markOutput(*fc1->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream)\n{\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n    config->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize)\n{\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./resnext -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./resnext -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(\"resnext50.engine\", std::ios::binary);\n        if (!p)\n        {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"resnext50.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n\n    // Subtract mean from image\n    static float data[3 * INPUT_H * INPUT_W];\n    for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n        data[i] = 1.0;\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    static float prob[OUTPUT_SIZE];\n    for (int i = 0; i < 100; i++) {\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < 10; i++)\n    {\n        std::cout << prob[i] << \", \";\n    }\n    std::cout << std::endl;\n    for (unsigned int i = 0; i < 10; i++)\n    {\n        std::cout << prob[OUTPUT_SIZE - 10 + i] << \", \";\n    }\n    std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "resnet/wide_resnet50.py",
    "content": "import os\nimport sys\nimport struct\nimport argparse\n\nimport numpy as np\nimport pycuda.autoinit\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nBATCH_SIZE = 1\nINPUT_H = 224\nINPUT_W = 224\nOUTPUT_SIZE = 1000\nBS = 1\nINPUT_BLOB_NAME = \"data\"\nOUTPUT_BLOB_NAME = \"prob\"\nEPS = 1e-5\n\nWEIGHT_PATH = \"./wide_resnet50.wts\"\nENGINE_PATH = \"./wide_resnet50.engine\"\n\nTRT_LOGGER = trt.Logger(trt.Logger.INFO)\n\n\ndef load_weights(file):\n    print(f\"Loading weights: {file}\")\n\n    assert os.path.exists(file), 'Unable to load weight file.'\n\n    weight_map = {}\n    with open(file, \"r\") as f:\n        lines = [line.strip() for line in f]\n    count = int(lines[0])\n    assert count == len(lines) - 1\n    for i in range(1, count + 1):\n        splits = lines[i].split(\" \")\n        name = splits[0]\n        cur_count = int(splits[1])\n        assert cur_count + 2 == len(splits)\n        values = []\n        for j in range(2, len(splits)):\n            # hex string to bytes to float\n            values.append(struct.unpack(\">f\", bytes.fromhex(splits[j])))\n        weight_map[name] = np.array(values, dtype=np.float32)\n\n    return weight_map\n\n\ndef addBatchNorm2d(network, weight_map, inputs, layer_name, eps):\n    gamma = weight_map[layer_name + \".weight\"]\n    beta = weight_map[layer_name + \".bias\"]\n    mean = weight_map[layer_name + \".running_mean\"]\n    var = weight_map[layer_name + \".running_var\"]\n    print(layer_name + \" \" +  str(len(weight_map[layer_name + \".running_var\"])))\n    var = np.sqrt(var + eps)\n\n    scale = gamma / var\n    shift = -mean / var * gamma + beta\n    return network.add_scale(input=inputs,\n                             mode=trt.ScaleMode.CHANNEL,\n                             shift=shift,\n                             scale=scale)\n\n\ndef bottleneck(network, weight_map, input, in_channels, out_channels, stride, layer_name):\n    # empty weights for bias\n    emptywts = trt.Weights()\n\n    conv1 = network.add_convolution(input=input,\n                                    num_output_maps=out_channels,\n                                    kernel_shape=(1, 1),\n                                    kernel=weight_map[layer_name + \"conv1.weight\"],\n                                    bias=emptywts)\n    assert conv1\n\n    bn1 = addBatchNorm2d(network, weight_map, conv1.get_output(0), layer_name + \"bn1\", EPS)\n    assert bn1\n\n    relu1 = network.add_activation(bn1.get_output(0), type=trt.ActivationType.RELU)\n    assert relu1\n\n    conv2 = network.add_convolution(input=relu1.get_output(0),\n                                    num_output_maps=out_channels,\n                                    kernel_shape=(3, 3),\n                                    kernel=weight_map[layer_name + \"conv2.weight\"],\n                                    bias=emptywts)\n    assert conv2\n    conv2.stride = (stride, stride)\n    conv2.padding = (1, 1)\n\n    bn2 = addBatchNorm2d(network, weight_map, conv2.get_output(0),\n                         layer_name + \"bn2\", EPS)\n    assert bn2\n\n    relu2 = network.add_activation(bn2.get_output(0),\n                                   type=trt.ActivationType.RELU)\n    assert relu2\n\n    conv3 = network.add_convolution(input=relu2.get_output(0),\n                                    num_output_maps=out_channels * 2,\n                                    kernel_shape=(1, 1),\n                                    kernel=weight_map[layer_name + \"conv3.weight\"],\n                                    bias=emptywts)\n    assert conv3\n\n    bn3 = addBatchNorm2d(network, weight_map, conv3.get_output(0), layer_name + \"bn3\", EPS)\n    assert bn3\n\n    if stride != 1 or in_channels != 2 * out_channels:\n        conv4 = network.add_convolution(\n            input=input,\n            num_output_maps=out_channels * 2,\n            kernel_shape=(1, 1),\n            kernel=weight_map[layer_name + \"downsample.0.weight\"],\n            bias=emptywts)\n        assert conv4\n        conv4.stride = (stride, stride)\n\n        bn4 = addBatchNorm2d(network, weight_map, conv4.get_output(0), layer_name + \"downsample.1\", EPS)\n        assert bn4\n\n        ew1 = network.add_elementwise(bn4.get_output(0), bn3.get_output(0),\n                                      trt.ElementWiseOperation.SUM)\n    else:\n        ew1 = network.add_elementwise(input, bn3.get_output(0), trt.ElementWiseOperation.SUM)\n    assert ew1\n\n    relu3 = network.add_activation(ew1.get_output(0), type=trt.ActivationType.RELU)\n    assert relu3\n\n    return relu3\n\n\ndef create_engine(maxBatchSize, builder, config, dt):\n    weight_map = load_weights(WEIGHT_PATH)\n    network = builder.create_network()\n\n    data = network.add_input(INPUT_BLOB_NAME, dt, (3, INPUT_H, INPUT_W))\n    assert data\n\n    # empty weights for bias\n    emptywts = trt.Weights()\n\n    conv1 = network.add_convolution(input=data,\n                                    num_output_maps=64,\n                                    kernel_shape=(7, 7),\n                                    kernel=weight_map[\"conv1.weight\"],\n                                    bias=emptywts)\n    assert conv1\n    conv1.stride = (2, 2)\n    conv1.padding = (3, 3)\n\n    bn1 = addBatchNorm2d(network, weight_map, conv1.get_output(0), \"bn1\", EPS)\n    assert bn1\n\n    relu1 = network.add_activation(bn1.get_output(0), type=trt.ActivationType.RELU)\n    assert relu1\n\n    pool1 = network.add_pooling(input=relu1.get_output(0),\n                                window_size=trt.DimsHW(3, 3),\n                                type=trt.PoolingType.MAX)\n    assert pool1\n    pool1.stride = (2, 2)\n    pool1.padding = (1, 1)\n\n    x = bottleneck(network, weight_map, pool1.get_output(0), 64, 128, 1, \"layer1.0.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 256, 128, 1, \"layer1.1.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 256, 128, 1, \"layer1.2.\")\n\n    x = bottleneck(network, weight_map, x.get_output(0), 256, 256, 2, \"layer2.0.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 512, 256, 1, \"layer2.1.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 512, 256, 1, \"layer2.2.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 512, 256, 1, \"layer2.3.\")\n\n    x = bottleneck(network, weight_map, x.get_output(0), 512, 512, 2, \"layer3.0.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 512, 1, \"layer3.1.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 512, 1, \"layer3.2.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 512, 1, \"layer3.3.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 512, 1, \"layer3.4.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 512, 1, \"layer3.5.\")\n\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 1024, 2, \"layer4.0.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 2048, 1024, 1, \"layer4.1.\")\n    x = bottleneck(network, weight_map, x.get_output(0), 2048, 1024, 1, \"layer4.2.\")\n\n    pool2 = network.add_pooling(x.get_output(0),\n                                window_size=trt.DimsHW(7, 7),\n                                type=trt.PoolingType.AVERAGE)\n    assert pool2\n    pool2.stride = (1, 1)\n\n    fc1 = network.add_fully_connected(input=pool2.get_output(0),\n                                      num_outputs=OUTPUT_SIZE,\n                                      kernel=weight_map['fc.weight'],\n                                      bias=weight_map['fc.bias'])\n    assert fc1\n\n    fc1.get_output(0).name = OUTPUT_BLOB_NAME\n    network.mark_output(fc1.get_output(0))\n\n    # Build engine\n    builder.max_batch_size = maxBatchSize\n    builder.max_workspace_size = 1 << 20\n    engine = builder.build_engine(network, config)\n    print(\"build out\")\n    del network\n    del weight_map\n\n    return engine\n\n\ndef APIToModel(maxBatchSize):\n    builder = trt.Builder(TRT_LOGGER)\n    config = builder.create_builder_config()\n    engine = create_engine(maxBatchSize, builder, config, trt.float32)\n    assert engine\n    with open(ENGINE_PATH, \"wb\") as f:\n        f.write(engine.serialize())\n\n    del engine\n    del builder\n\n\ndef doInference(context, host_in, host_out, batchSize):\n    engine = context.engine\n    assert engine.num_bindings == 2\n\n    devide_in = cuda.mem_alloc(host_in.nbytes)\n    devide_out = cuda.mem_alloc(host_out.nbytes)\n    bindings = [int(devide_in), int(devide_out)]\n    stream = cuda.Stream()\n\n    cuda.memcpy_htod_async(devide_in, host_in, stream)\n    context.execute_async(bindings=bindings, stream_handle=stream.handle)\n    cuda.memcpy_dtoh_async(host_out, devide_out, stream)\n    stream.synchronize()\n\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"-s\", action='store_true')\n    parser.add_argument(\"-d\", action='store_true')\n    args = parser.parse_args()\n\n    if not (args.s ^ args.d):\n        print(\n            \"arguments not right!\\n\"\n            \"python wide_resnet50.py -s   # serialize model to plan file\\n\"\n            \"python wide_resnet50.py -d   # deserialize plan file and run inference\"\n        )\n        sys.exit()\n\n    if args.s:\n        APIToModel(BATCH_SIZE)\n    else:\n        runtime = trt.Runtime(TRT_LOGGER)\n        assert runtime\n\n        with open(ENGINE_PATH, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        assert engine\n\n        context = engine.create_execution_context()\n        assert context\n\n        data = np.ones((BATCH_SIZE * 3 * INPUT_H * INPUT_W), dtype=np.float32)\n        host_in = cuda.pagelocked_empty(BATCH_SIZE * 3 * INPUT_H * INPUT_W,\n                                        dtype=np.float32)\n        np.copyto(host_in, data.ravel())\n        host_out = cuda.pagelocked_empty(OUTPUT_SIZE, dtype=np.float32)\n\n        doInference(context, host_in, host_out, BATCH_SIZE)\n\n        print(f'Output: \\n{host_out[:10]}\\n{host_out[-10:]}')\n"
  },
  {
    "path": "resnet/wideresnet50.cpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <cmath>\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 1000;\n\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n    std::cout << \"len \" << len << std::endl;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nIActivationLayer* bottleneck(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int stride, std::string lname) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{1, 1}, weightMap[lname + \"conv1.weight\"], emptywts);\n    assert(conv1);\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"bn1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{3, 3}, weightMap[lname + \"conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{stride, stride});\n    conv2->setPaddingNd(DimsHW{1, 1});\n\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"bn2\", 1e-5);\n\n    IActivationLayer* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n\n    IConvolutionLayer* conv3 = network->addConvolutionNd(*relu2->getOutput(0), outch * 2, DimsHW{1, 1}, weightMap[lname + \"conv3.weight\"], emptywts);\n    assert(conv3);\n\n    IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"bn3\", 1e-5);\n\n    IElementWiseLayer* ew1;\n    if (stride != 1 || inch != outch * 2) {\n        IConvolutionLayer* conv4 = network->addConvolutionNd(input, outch * 2, DimsHW{1, 1}, weightMap[lname + \"downsample.0.weight\"], emptywts);\n        assert(conv4);\n        conv4->setStrideNd(DimsHW{stride, stride});\n\n        IScaleLayer* bn4 = addBatchNorm2d(network, weightMap, *conv4->getOutput(0), lname + \"downsample.1\", 1e-5);\n        ew1 = network->addElementWise(*bn4->getOutput(0), *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    } else {\n        ew1 = network->addElementWise(input, *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    IActivationLayer* relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n    return relu3;\n}\n\n// Create the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape { 3, INPUT_H, INPUT_W } with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../wideresnet50.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 64, DimsHW{7, 7}, weightMap[\"conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{2, 2});\n    conv1->setPaddingNd(DimsHW{3, 3});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"bn1\", 1e-5);\n\n    // Add activation layer using the ReLU algorithm.\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    // Add max pooling layer with stride of 2x2 and kernel size of 2x2.\n    IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n    pool1->setPaddingNd(DimsHW{1, 1});\n\n    IActivationLayer* x = bottleneck(network, weightMap, *pool1->getOutput(0), 64, 128, 1, \"layer1.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 128, 1, \"layer1.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 128, 1, \"layer1.2.\");\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 256, 2, \"layer2.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 256, 1, \"layer2.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 256, 1, \"layer2.2.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 256, 1, \"layer2.3.\");\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 512, 2, \"layer3.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 512, 1, \"layer3.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 512, 1, \"layer3.2.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 512, 1, \"layer3.3.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 512, 1, \"layer3.4.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 512, 1, \"layer3.5.\");\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 1024, 2, \"layer4.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 2048, 1024, 1, \"layer4.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 2048, 1024, 1, \"layer4.2.\");\n\n    IPoolingLayer* pool2 = network->addPoolingNd(*x->getOutput(0), PoolingType::kAVERAGE, DimsHW{7, 7});\n    assert(pool2);\n    pool2->setStrideNd(DimsHW{1, 1});\n\n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 1000, weightMap[\"fc.weight\"], weightMap[\"fc.bias\"]);\n    assert(fc1);\n\n    fc1->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n    network->markOutput(*fc1->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream)\n{\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n    config->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize)\n{\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./wideresnet -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./wideresnet -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(\"wideresnet50.engine\", std::ios::binary);\n        if (!p)\n        {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"wideresnet50.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n\n    // Subtract mean from image\n    static float data[3 * INPUT_H * INPUT_W];\n    for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n        data[i] = 1.0;\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    static float prob[OUTPUT_SIZE];\n    for (int i = 0; i < 100; i++) {\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < 10; i++)\n    {\n        std::cout << prob[i] << \", \";\n    }\n    std::cout << std::endl;\n    for (unsigned int i = 0; i < 10; i++)\n    {\n        std::cout << prob[OUTPUT_SIZE - 10 + i] << \", \";\n    }\n    std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "retinaface/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(retinaface)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\nif (CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n    message(\"embed_platform on\")\n    include_directories(/usr/local/cuda/targets/aarch64-linux/include)\n    link_directories(/usr/local/cuda/targets/aarch64-linux/lib)\nelse()\n    message(\"embed_platform off\")\n    include_directories(/usr/local/cuda/include)\n    link_directories(/usr/local/cuda/lib64)\nendif()\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\ncuda_add_library(decodeplugin SHARED ${PROJECT_SOURCE_DIR}/decode.cu)\ntarget_link_libraries(decodeplugin nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(retina_r50 ${PROJECT_SOURCE_DIR}/calibrator.cpp ${PROJECT_SOURCE_DIR}/retina_r50.cpp)\ntarget_link_libraries(retina_r50 nvinfer)\ntarget_link_libraries(retina_r50 cudart)\ntarget_link_libraries(retina_r50 decodeplugin)\ntarget_link_libraries(retina_r50 ${OpenCV_LIBRARIES})\n\nadd_executable(retina_mnet ${PROJECT_SOURCE_DIR}/calibrator.cpp ${PROJECT_SOURCE_DIR}/retina_mnet.cpp)\ntarget_link_libraries(retina_mnet nvinfer)\ntarget_link_libraries(retina_mnet cudart)\ntarget_link_libraries(retina_mnet decodeplugin)\ntarget_link_libraries(retina_mnet ${OpenCV_LIBRARIES})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "retinaface/README.md",
    "content": "# RetinaFace\n\n The pytorch implementation is [biubug6/Pytorch_Retinaface](https://github.com/biubug6/Pytorch_Retinaface), I forked it into \n[wang-xinyu/Pytorch_Retinaface](https://github.com/wang-xinyu/Pytorch_Retinaface) and add genwts.py\n\nThis branch is using TensorRT 7 API, branch [trt4->retinaface](https://github.com/wang-xinyu/tensorrtx/tree/trt4/retinaface) is using TensorRT 4.\n\n## Config\n\n- Input shape `INPUT_H`, `INPUT_W` defined in `decode.h`\n- INT8/FP16/FP32 can be selected by the macro `USE_FP16` or `USE_INT8` or `USE_FP32` in `retina_r50.cpp`\n- GPU id can be selected by the macro `DEVICE` in `retina_r50.cpp`\n- Batchsize can be selected by the macro `BATCHSIZE` in `retina_r50.cpp`\n\n## Run\n\nThe following described how to run `retina_r50`. While `retina_mnet` is nearly the same, just generate `retinaface.wts` with `mobilenet0.25_Final.pth` and run `retina_mnet`.\n\n1. generate retinaface.wts from pytorch implementation https://github.com/wang-xinyu/Pytorch_Retinaface\n\n```\ngit clone https://github.com/wang-xinyu/Pytorch_Retinaface.git\n// download its weights 'Resnet50_Final.pth', put it in Pytorch_Retinaface/weights\ncd Pytorch_Retinaface\npython detect.py --save_model\npython genwts.py\n// a file 'retinaface.wts' will be generated.\n```\n\n2. put retinaface.wts into tensorrtx/retinaface, build and run\n\n```\ngit clone https://github.com/wang-xinyu/tensorrtx.git\ncd tensorrtx/retinaface\n// put retinaface.wts here\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./retina_r50 -s  // build and serialize model to file i.e. 'retina_r50.engine'\nwget https://github.com/Tencent/FaceDetection-DSFD/raw/master/data/worlds-largest-selfie.jpg\nsudo ./retina_r50 -d  // deserialize model file and run inference.\n```\n\n3. check the images generated, as follows. 0_result.jpg\n\n4. we also provide a python wrapper\n\n```\n// install python-tensorrt, pycuda, etc.\n// ensure the retina_r50.engine and libdecodeplugin.so have been built\npython retinaface_trt.py\n```\n\n# INT8 Quantization\n\n1. Prepare calibration images, you can randomly select 1000s images from your train set. For widerface, you can also download my calibration images `widerface_calib` from [GoogleDrive](https://drive.google.com/drive/folders/1s7jE9DtOngZMzJC1uL307J2MiaGwdRSI?usp=sharing) or [BaiduPan](https://pan.baidu.com/s/1GOm_-JobpyLMAqZWCDUhKg) pwd: a9wh\n\n2. unzip it in retinaface/build\n\n3. set the macro `USE_INT8` in retina_r50.cpp and make\n\n4. serialize the model and test\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/78901890-9077fb80-7aab-11ea-94f1-237f51fcc347.jpg\">\n</p>\n\n## More Information\n\nCheck the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n\n"
  },
  {
    "path": "retinaface/calibrator.cpp",
    "content": "#include <iostream>\n#include <iterator>\n#include <fstream>\n#include <opencv2/dnn/dnn.hpp>\n#include \"calibrator.h\"\n#include \"cuda_runtime_api.h\"\n#include \"common.hpp\"\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name, const char* input_blob_name, bool read_cache)\n    : batchsize_(batchsize)\n    , input_w_(input_w)\n    , input_h_(input_h)\n    , img_idx_(0)\n    , img_dir_(img_dir)\n    , calib_table_name_(calib_table_name)\n    , input_blob_name_(input_blob_name)\n    , read_cache_(read_cache)\n{\n    input_count_ = 3 * input_w * input_h * batchsize;\n    CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n    read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2()\n{\n    CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const TRT_NOEXCEPT\n{\n    return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT\n{\n    if (img_idx_ + batchsize_ > (int)img_files_.size()) {\n        return false;\n    }\n\n    std::vector<cv::Mat> input_imgs_;\n    for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n        std::cout << img_files_[i] << \"  \" << i << std::endl;\n        cv::Mat temp = cv::imread(img_dir_ + img_files_[i]);\n        if (temp.empty()){\n            std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n            return false;\n        }\n        cv::Mat pr_img = preprocess_img(temp, input_w_, input_h_);\n        input_imgs_.push_back(pr_img);\n    }\n    img_idx_ += batchsize_;\n    cv::Mat blob = cv::dnn::blobFromImages(input_imgs_, 1.0, cv::Size(input_w_, input_h_), cv::Scalar(104, 117, 123), false, false);\n\n    CHECK(cudaMemcpy(device_input_, blob.ptr<float>(0), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n    assert(!strcmp(names[0], input_blob_name_));\n    bindings[0] = device_input_;\n    return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length) TRT_NOEXCEPT\n{\n    std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n    calib_cache_.clear();\n    std::ifstream input(calib_table_name_, std::ios::binary);\n    input >> std::noskipws;\n    if (read_cache_ && input.good())\n    {\n        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n    }\n    length = calib_cache_.size();\n    return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT\n{\n    std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n    std::ofstream output(calib_table_name_, std::ios::binary);\n    output.write(reinterpret_cast<const char*>(cache), length);\n}\n\n"
  },
  {
    "path": "retinaface/calibrator.h",
    "content": "#ifndef ENTROPY_CALIBRATOR_H\n#define ENTROPY_CALIBRATOR_H\n\n#include \"NvInfer.h\"\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\n//! \\class Int8EntropyCalibrator2\n//!\n//! \\brief Implements Entropy calibrator 2.\n//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.\n//!\nclass Int8EntropyCalibrator2 : public nvinfer1::IInt8EntropyCalibrator2\n{\npublic:\n    Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name, const char* input_blob_name, bool read_cache = true);\n\n    virtual ~Int8EntropyCalibrator2();\n    int getBatchSize() const TRT_NOEXCEPT override;\n    bool getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT override;\n    const void* readCalibrationCache(size_t& length) TRT_NOEXCEPT override;\n    void writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT override;\n\nprivate:\n    int batchsize_;\n    int input_w_;\n    int input_h_;\n    int img_idx_;\n    std::string img_dir_;\n    std::vector<std::string> img_files_;\n    size_t input_count_;\n    std::string calib_table_name_;\n    const char* input_blob_name_;\n    bool read_cache_;\n    void* device_input_;\n    std::vector<char> calib_cache_;\n};\n\n#endif // ENTROPY_CALIBRATOR_H\n"
  },
  {
    "path": "retinaface/common.hpp",
    "content": "#ifndef RETINAFACE_COMMON_H_\n#define RETINAFACE_COMMON_H_\n#include <opencv2/opencv.hpp>\n#include <dirent.h>\n#include \"NvInfer.h\"\n#include \"decode.h\"\n\nusing namespace nvinfer1;\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\nstatic inline cv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h) {\n    int w, h, x, y;\n    float r_w = input_w / (img.cols*1.0);\n    float r_h = input_h / (img.rows*1.0);\n    if (r_h > r_w) {\n        w = input_w;\n        h = r_w * img.rows;\n        x = 0;\n        y = (input_h - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = input_h;\n        x = (input_w - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n    cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\nstatic inline int read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n            strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\nstatic inline cv::Rect get_rect_adapt_landmark(cv::Mat& img, int input_w, int input_h, float bbox[4], float lmk[10]) {\n    int l, r, t, b;\n    float r_w = input_w / (img.cols * 1.0);\n    float r_h = input_h / (img.rows * 1.0);\n    if (r_h > r_w) {\n        l = bbox[0] / r_w;\n        r = bbox[2] / r_w;\n        t = (bbox[1] - (input_h - r_w * img.rows) / 2) / r_w;\n        b = (bbox[3] - (input_h - r_w * img.rows) / 2) / r_w;\n        for (int i = 0; i < 10; i += 2) {\n            lmk[i] /= r_w;\n            lmk[i + 1] = (lmk[i + 1] - (input_h - r_w * img.rows) / 2) / r_w;\n        }\n    } else {\n        l = (bbox[0] - (input_w - r_h * img.cols) / 2) / r_h;\n        r = (bbox[2] - (input_w - r_h * img.cols) / 2) / r_h;\n        t = bbox[1] / r_h;\n        b = bbox[3] / r_h;\n        for (int i = 0; i < 10; i += 2) {\n            lmk[i] = (lmk[i] - (input_w - r_h * img.cols) / 2) / r_h;\n            lmk[i + 1] /= r_h;\n        }\n    }\n    return cv::Rect(l, t, r-l, b-t);\n}\n\nstatic float iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n        std::max(lbox[0], rbox[0]), //left\n        std::min(lbox[2], rbox[2]), //right\n        std::max(lbox[1], rbox[1]), //top\n        std::min(lbox[3], rbox[3]), //bottom\n    };\n\n    if(interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS = (interBox[1] - interBox[0]) * (interBox[3] - interBox[2]);\n    return interBoxS / ((lbox[2] - lbox[0]) * (lbox[3] - lbox[1]) + (rbox[2] - rbox[0]) * (rbox[3] - rbox[1]) -interBoxS + 0.000001f);\n}\n\nstatic bool cmp(const decodeplugin::Detection& a, const decodeplugin::Detection& b) {\n    return a.class_confidence > b.class_confidence;\n}\n\nstatic inline void nms(std::vector<decodeplugin::Detection>& res, float *output, float nms_thresh = 0.4) {\n    std::vector<decodeplugin::Detection> dets;\n    for (int i = 0; i < output[0]; i++) {\n        if (output[15 * i + 1 + 4] <= 0.1) continue;\n        decodeplugin::Detection det;\n        memcpy(&det, &output[15 * i + 1], sizeof(decodeplugin::Detection));\n        dets.push_back(det);\n    }\n    std::sort(dets.begin(), dets.end(), cmp);\n    for (size_t m = 0; m < dets.size(); ++m) {\n        auto& item = dets[m];\n        res.push_back(item);\n        //std::cout << item.class_confidence << \" bbox \" << item.bbox[0] << \", \" << item.bbox[1] << \", \" << item.bbox[2] << \", \" << item.bbox[3] << std::endl;\n        for (size_t n = m + 1; n < dets.size(); ++n) {\n            if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                dets.erase(dets.begin()+n);\n                --n;\n            }\n        }\n    }\n}\n\n// Load weights from files\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstatic inline std::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nstatic inline Weights getWeights(std::map<std::string, Weights>& weightMap, std::string key) {\n    if (weightMap.count(key) != 1) {\n        std::cerr << key << \" not existed in weight map, fatal error!!!\" << std::endl;\n        exit(-1);\n    }\n    return weightMap[key];\n}\n\nstatic inline IScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\n#endif\n\n"
  },
  {
    "path": "retinaface/decode.cu",
    "content": "#include \"decode.h\"\n#include \"stdio.h\"\n\nnamespace nvinfer1\n{\n    DecodePlugin::DecodePlugin()\n    {\n    }\n\n    DecodePlugin::~DecodePlugin()\n    {\n    }\n\n    // create the plugin at runtime from a byte stream\n    DecodePlugin::DecodePlugin(const void* data, size_t length)\n    {\n    }\n\n    void DecodePlugin::serialize(void* buffer) const TRT_NOEXCEPT\n    {\n    }\n\n    size_t DecodePlugin::getSerializationSize() const TRT_NOEXCEPT\n    {\n        return 0;\n    }\n\n    int DecodePlugin::initialize() TRT_NOEXCEPT\n    { \n        return 0;\n    }\n\n    Dims DecodePlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT\n    {\n        //output the result to channel\n        int totalCount = 0;\n        totalCount += decodeplugin::INPUT_H / 8 * decodeplugin::INPUT_W / 8 * 2 * sizeof(decodeplugin::Detection) / sizeof(float);\n        totalCount += decodeplugin::INPUT_H / 16 * decodeplugin::INPUT_W / 16 * 2 * sizeof(decodeplugin::Detection) / sizeof(float);\n        totalCount += decodeplugin::INPUT_H / 32 * decodeplugin::INPUT_W / 32 * 2 * sizeof(decodeplugin::Detection) / sizeof(float);\n\n        return Dims3(totalCount + 1, 1, 1);\n    }\n\n    // Set plugin namespace\n    void DecodePlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT\n    {\n        mPluginNamespace = pluginNamespace;\n    }\n\n    const char* DecodePlugin::getPluginNamespace() const TRT_NOEXCEPT\n    {\n        return mPluginNamespace;\n    }\n\n    // Return the DataType of the plugin output at the requested index\n    DataType DecodePlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT\n    {\n        return DataType::kFLOAT;\n    }\n\n    // Return true if output tensor is broadcast across a batch.\n    bool DecodePlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    // Return true if plugin can use input that is broadcast across batch without replication.\n    bool DecodePlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    void DecodePlugin::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT\n    {\n    }\n\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\n    void DecodePlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT\n    {\n    }\n\n    // Detach the plugin object from its execution context.\n    void DecodePlugin::detachFromContext() TRT_NOEXCEPT {}\n\n    const char* DecodePlugin::getPluginType() const TRT_NOEXCEPT\n    {\n        return \"Decode_TRT\";\n    }\n\n    const char* DecodePlugin::getPluginVersion() const TRT_NOEXCEPT\n    {\n        return \"1\";\n    }\n\n    void DecodePlugin::destroy() TRT_NOEXCEPT\n    {\n        delete this;\n    }\n\n    // Clone the plugin\n    IPluginV2IOExt* DecodePlugin::clone() const TRT_NOEXCEPT\n    {\n        DecodePlugin *p = new DecodePlugin();\n        p->setPluginNamespace(mPluginNamespace);\n        return p;\n    }\n\n    __device__ float Logist(float data){ return 1./(1. + expf(-data)); };\n\n    __global__ void CalDetection(const float *input, float *output, int num_elem, int step, int anchor, int output_elem) {\n\n        int idx = threadIdx.x + blockDim.x * blockIdx.x;\n        if (idx >= num_elem) return;\n\n        int h = decodeplugin::INPUT_H / step;\n        int w = decodeplugin::INPUT_W / step;\n        int total_grid = h * w;\n        int bn_idx = idx / total_grid;\n        idx = idx - bn_idx * total_grid;\n        int y = idx / w;\n        int x = idx % w;\n        const float* cur_input = input + bn_idx * (4 + 2 + 10) * 2 * total_grid;\n        const float *bbox_reg = &cur_input[0];\n        const float *cls_reg = &cur_input[2 * 4 * total_grid];\n        const float *lmk_reg = &cur_input[2 * 4 * total_grid + 2 * 2 * total_grid];\n\n        for (int k = 0; k < 2; ++k) {\n            float conf1 = cls_reg[idx + k * total_grid * 2];\n            float conf2 = cls_reg[idx + k * total_grid * 2 + total_grid];\n            conf2 = expf(conf2) / (expf(conf1) + expf(conf2));\n            if (conf2 <= 0.02) continue;\n\n            float *res_count = output + bn_idx * output_elem;\n            int count = (int)atomicAdd(res_count, 1);\n            char* data = (char *)res_count + sizeof(float) + count * sizeof(decodeplugin::Detection);\n            decodeplugin::Detection* det = (decodeplugin::Detection*)(data);\n\n            float prior[4];\n            prior[0] = ((float)x + 0.5) / w;\n            prior[1] = ((float)y + 0.5) / h;\n            prior[2] = (float)anchor * (k + 1) / decodeplugin::INPUT_W;\n            prior[3] = (float)anchor * (k + 1) / decodeplugin::INPUT_H;\n\n            //Location\n            det->bbox[0] = prior[0] + bbox_reg[idx + k * total_grid * 4] * 0.1 * prior[2];\n            det->bbox[1] = prior[1] + bbox_reg[idx + k * total_grid * 4 + total_grid] * 0.1 * prior[3];\n            det->bbox[2] = prior[2] * expf(bbox_reg[idx + k * total_grid * 4 + total_grid * 2] * 0.2);\n            det->bbox[3] = prior[3] * expf(bbox_reg[idx + k * total_grid * 4 + total_grid * 3] * 0.2);\n            det->bbox[0] -= det->bbox[2] / 2;\n            det->bbox[1] -= det->bbox[3] / 2;\n            det->bbox[2] += det->bbox[0];\n            det->bbox[3] += det->bbox[1];\n            det->bbox[0] *= decodeplugin::INPUT_W;\n            det->bbox[1] *= decodeplugin::INPUT_H;\n            det->bbox[2] *= decodeplugin::INPUT_W;\n            det->bbox[3] *= decodeplugin::INPUT_H;\n            det->class_confidence = conf2;\n            for (int i = 0; i < 10; i += 2) {\n                det->landmark[i] = prior[0] + lmk_reg[idx + k * total_grid * 10 + total_grid * i] * 0.1 * prior[2];\n                det->landmark[i+1] = prior[1] + lmk_reg[idx + k * total_grid * 10 + total_grid * (i + 1)] * 0.1 * prior[3];\n                det->landmark[i] *= decodeplugin::INPUT_W;\n                det->landmark[i+1] *= decodeplugin::INPUT_H;\n            }\n        }\n    }\n\n    void DecodePlugin::forwardGpu(const float *const * inputs, float * output, cudaStream_t stream, int batchSize)\n    {\n        int num_elem = 0;\n        int base_step = 8;\n        int base_anchor = 16;\n        int thread_count;\n\n        int totalCount = 1;\n        totalCount += decodeplugin::INPUT_H / 8 * decodeplugin::INPUT_W / 8 * 2 * sizeof(decodeplugin::Detection) / sizeof(float);\n        totalCount += decodeplugin::INPUT_H / 16 * decodeplugin::INPUT_W / 16 * 2 * sizeof(decodeplugin::Detection) / sizeof(float);\n        totalCount += decodeplugin::INPUT_H / 32 * decodeplugin::INPUT_W / 32 * 2 * sizeof(decodeplugin::Detection) / sizeof(float);\n        for(int idx = 0 ; idx < batchSize; ++idx) {\n            cudaMemsetAsync(output + idx * totalCount, 0, sizeof(float), stream);\n        }\n\n        for (unsigned int i = 0; i < 3; ++i)\n        {\n            num_elem = batchSize * decodeplugin::INPUT_H / base_step * decodeplugin::INPUT_W / base_step;\n            thread_count = (num_elem < thread_count_) ? num_elem : thread_count_;\n            CalDetection<<< (num_elem + thread_count - 1) / thread_count, thread_count, 0, stream>>>\n                (inputs[i], output, num_elem, base_step, base_anchor, totalCount);\n            base_step *= 2;\n            base_anchor *= 4;\n        }\n    }\n\n    int DecodePlugin::enqueue(int batchSize, const void*const * inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT\n    {\n        //GPU\n        //CUDA_CHECK(cudaStreamSynchronize(stream));\n        forwardGpu((const float *const *)inputs, (float *)outputs[0], stream, batchSize);\n        return 0;\n    };\n\n    PluginFieldCollection DecodePluginCreator::mFC{};\n    std::vector<PluginField> DecodePluginCreator::mPluginAttributes;\n\n    DecodePluginCreator::DecodePluginCreator()\n    {\n        mPluginAttributes.clear();\n\n        mFC.nbFields = mPluginAttributes.size();\n        mFC.fields = mPluginAttributes.data();\n    }\n\n    const char* DecodePluginCreator::getPluginName() const TRT_NOEXCEPT\n    {\n        return \"Decode_TRT\";\n    }\n\n    const char* DecodePluginCreator::getPluginVersion() const TRT_NOEXCEPT\n    {\n        return \"1\";\n    }\n\n    const PluginFieldCollection* DecodePluginCreator::getFieldNames() TRT_NOEXCEPT\n    {\n        return &mFC;\n    }\n\n    IPluginV2IOExt* DecodePluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT\n    {\n        DecodePlugin* obj = new DecodePlugin();\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n    IPluginV2IOExt* DecodePluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT\n    {\n        // This object will be deleted when the network is destroyed, which will\n        // call PReluPlugin::destroy()\n        DecodePlugin* obj = new DecodePlugin(serialData, serialLength);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n}\n"
  },
  {
    "path": "retinaface/decode.h",
    "content": "#ifndef _DECODE_CU_H\n#define _DECODE_CU_H\n\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n#include \"macros.h\"\n\nnamespace decodeplugin\n{\n    struct alignas(float) Detection{\n        float bbox[4];  //x1 y1 x2 y2\n        float class_confidence;\n        float landmark[10];\n    };\n    static const int INPUT_H = 480;\n    static const int INPUT_W = 640;\n}\n\nnamespace nvinfer1\n{\n    class DecodePlugin: public IPluginV2IOExt\n    {\n        public:\n            DecodePlugin();\n            DecodePlugin(const void* data, size_t length);\n\n            ~DecodePlugin();\n\n            int getNbOutputs() const TRT_NOEXCEPT override\n            {\n                return 1;\n            }\n\n            Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n            int initialize() TRT_NOEXCEPT override;\n\n            virtual void terminate() TRT_NOEXCEPT override {};\n\n            virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0;}\n\n            virtual int enqueue(int batchSize, const void*const * inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT override;\n\n            virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n            virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n            bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const TRT_NOEXCEPT override {\n                return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n            }\n\n            const char* getPluginType() const TRT_NOEXCEPT override;\n\n            const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n            void destroy() TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n            void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n            const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n            DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override;\n\n            bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT override;\n\n            bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n            void attachToContext(\n                    cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n            void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT override;\n\n            void detachFromContext() TRT_NOEXCEPT override;\n\n            int input_size_;\n        private:\n            void forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize = 1);\n            int thread_count_ = 256;\n            const char* mPluginNamespace;\n    };\n\n    class DecodePluginCreator : public IPluginCreator\n    {\n        public:\n            DecodePluginCreator();\n\n            ~DecodePluginCreator() override = default;\n\n            const char* getPluginName() const TRT_NOEXCEPT override;\n\n            const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n            const PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT override;\n\n            void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override\n            {\n                mNamespace = libNamespace;\n            }\n\n            const char* getPluginNamespace() const TRT_NOEXCEPT override\n            {\n                return mNamespace.c_str();\n            }\n\n        private:\n            std::string mNamespace;\n            static PluginFieldCollection mFC;\n            static std::vector<PluginField> mPluginAttributes;\n    };\n    REGISTER_TENSORRT_PLUGIN(DecodePluginCreator);\n};\n\n#endif \n"
  },
  {
    "path": "retinaface/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"macros.h\"\n\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "retinaface/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "retinaface/retina_mnet.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include \"common.hpp\"\n#include \"calibrator.h\"\n\n#define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32\n#define DEVICE 0  // GPU id\n#define BATCH_SIZE 1\n#define CONF_THRESH 0.75\n#define IOU_THRESH 0.4\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = decodeplugin::INPUT_H;  // H, W must be able to  be divided by 32.\nstatic const int INPUT_W = decodeplugin::INPUT_W;;\nstatic const int OUTPUT_SIZE = (INPUT_H / 8 * INPUT_W / 8 + INPUT_H / 16 * INPUT_W / 16 + INPUT_H / 32 * INPUT_W / 32) * 2  * 15 + 1;\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nstatic Logger gLogger;\n\nILayer* conv_bn(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, int oup, int s = 1, float leaky = 0.1) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, oup, DimsHW{3, 3}, getWeights(weightMap, lname + \".0.weight\"), emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{1, 1});\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".1\", 1e-5);\n    auto lr = network->addActivation(*bn1->getOutput(0), ActivationType::kLEAKY_RELU);\n    lr->setAlpha(leaky);\n    assert(lr);\n    return lr;\n}\n\nILayer* conv_bn_no_relu(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, int oup, int s = 1) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, oup, DimsHW{3, 3}, getWeights(weightMap, lname + \".0.weight\"), emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{1, 1});\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".1\", 1e-5);\n    return bn1;\n}\n\nILayer* conv_bn1X1(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, int oup, int s = 1, float leaky = 0.1) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, oup, DimsHW{1, 1}, getWeights(weightMap, lname + \".0.weight\"), emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{0, 0});\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".1\", 1e-5);\n    auto lr = network->addActivation(*bn1->getOutput(0), ActivationType::kLEAKY_RELU);\n    lr->setAlpha(leaky);\n    assert(lr);\n    return lr;\n}\n\nILayer* conv_dw(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, int inp, int oup, int s = 1, float leaky = 0.1) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, inp, DimsHW{3, 3}, getWeights(weightMap, lname + \".0.weight\"), emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{1, 1});\n    conv1->setNbGroups(inp);\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".1\", 1e-5);\n    auto lr1 = network->addActivation(*bn1->getOutput(0), ActivationType::kLEAKY_RELU);\n    lr1->setAlpha(leaky);\n    assert(lr1);\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*lr1->getOutput(0), oup, DimsHW{1, 1}, getWeights(weightMap, lname + \".3.weight\"), emptywts);\n    assert(conv2);\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \".4\", 1e-5);\n    auto lr2 = network->addActivation(*bn2->getOutput(0), ActivationType::kLEAKY_RELU);\n    lr2->setAlpha(leaky);\n    assert(lr2);\n    return lr2;\n}\n\nIActivationLayer* ssh(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, int oup) {\n    auto conv3x3 = conv_bn_no_relu(network, weightMap, input, lname + \".conv3X3\", oup / 2);\n    auto conv5x5_1 = conv_bn(network, weightMap, input, lname + \".conv5X5_1\", oup / 4);\n    auto conv5x5 = conv_bn_no_relu(network, weightMap, *conv5x5_1->getOutput(0), lname + \".conv5X5_2\", oup / 4);\n    auto conv7x7 = conv_bn(network, weightMap, *conv5x5_1->getOutput(0), lname + \".conv7X7_2\", oup / 4);\n    conv7x7 = conv_bn_no_relu(network, weightMap, *conv7x7->getOutput(0), lname + \".conv7x7_3\", oup / 4);\n    ITensor* inputTensors[] = {conv3x3->getOutput(0), conv5x5->getOutput(0), conv7x7->getOutput(0)};\n    auto cat = network->addConcatenation(inputTensors, 3);\n    IActivationLayer* relu1 = network->addActivation(*cat->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    return relu1;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../retinaface.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    // ------------- backbone mobilenet0.25  ---------------\n    // stage 1\n    auto x = conv_bn(network, weightMap, *data, \"body.stage1.0\", 8, 2);\n    x = conv_dw(network, weightMap, *x->getOutput(0), \"body.stage1.1\", 8, 16);\n    x = conv_dw(network, weightMap, *x->getOutput(0), \"body.stage1.2\", 16, 32, 2);\n    x = conv_dw(network, weightMap, *x->getOutput(0), \"body.stage1.3\", 32, 32);\n    x = conv_dw(network, weightMap, *x->getOutput(0), \"body.stage1.4\", 32, 64, 2);\n    x = conv_dw(network, weightMap, *x->getOutput(0), \"body.stage1.5\", 64, 64);\n    auto stage1 = x;\n\n    // stage 2\n    x = conv_dw(network, weightMap, *x->getOutput(0), \"body.stage2.0\", 64, 128, 2);\n    x = conv_dw(network, weightMap, *x->getOutput(0), \"body.stage2.1\", 128, 128);\n    x = conv_dw(network, weightMap, *x->getOutput(0), \"body.stage2.2\", 128, 128);\n    x = conv_dw(network, weightMap, *x->getOutput(0), \"body.stage2.3\", 128, 128);\n    x = conv_dw(network, weightMap, *x->getOutput(0), \"body.stage2.4\", 128, 128);\n    x = conv_dw(network, weightMap, *x->getOutput(0), \"body.stage2.5\", 128, 128);\n    auto stage2 = x;\n\n    // stage 3\n    x = conv_dw(network, weightMap, *x->getOutput(0), \"body.stage3.0\", 128, 256, 2);\n    x = conv_dw(network, weightMap, *x->getOutput(0), \"body.stage3.1\", 256, 256);\n    auto stage3 = x;\n\n    //Dims d1 = stage1->getOutput(0)->getDimensions();\n    //std::cout << d1.d[0] << \" \" << d1.d[1] << \" \" << d1.d[2] << std::endl;\n    // ------------- FPN ---------------\n    auto output1 = conv_bn1X1(network, weightMap, *stage1->getOutput(0), \"fpn.output1\", 64);\n    auto output2 = conv_bn1X1(network, weightMap, *stage2->getOutput(0), \"fpn.output2\", 64);\n    auto output3 = conv_bn1X1(network, weightMap, *stage3->getOutput(0), \"fpn.output3\", 64);\n\n    float *deval = reinterpret_cast<float*>(malloc(sizeof(float) * 64 * 2 * 2));\n    for (int i = 0; i < 64 * 2 * 2; i++) {\n        deval[i] = 1.0;\n    }\n    Weights deconvwts{DataType::kFLOAT, deval, 64 * 2 * 2};\n    IDeconvolutionLayer* up3 = network->addDeconvolutionNd(*output3->getOutput(0), 64, DimsHW{2, 2}, deconvwts, emptywts);\n    assert(up3);\n    up3->setStrideNd(DimsHW{2, 2});\n    up3->setNbGroups(64);\n    weightMap[\"up3\"] = deconvwts;\n\n    output2 = network->addElementWise(*output2->getOutput(0), *up3->getOutput(0), ElementWiseOperation::kSUM);\n    output2 = conv_bn(network, weightMap, *output2->getOutput(0), \"fpn.merge2\", 64);\n\n    IDeconvolutionLayer* up2 = network->addDeconvolutionNd(*output2->getOutput(0), 64, DimsHW{2, 2}, deconvwts, emptywts);\n    assert(up2);\n    up2->setStrideNd(DimsHW{2, 2});\n    up2->setNbGroups(64);\n    output1 = network->addElementWise(*output1->getOutput(0), *up2->getOutput(0), ElementWiseOperation::kSUM);\n    output1 = conv_bn(network, weightMap, *output1->getOutput(0), \"fpn.merge1\", 64);\n\n    // ------------- SSH ---------------\n    auto ssh1 = ssh(network, weightMap, *output1->getOutput(0), \"ssh1\", 64);\n    auto ssh2 = ssh(network, weightMap, *output2->getOutput(0), \"ssh2\", 64);\n    auto ssh3 = ssh(network, weightMap, *output3->getOutput(0), \"ssh3\", 64);\n\n    //// ------------- Head ---------------\n    auto bbox_head1 = network->addConvolutionNd(*ssh1->getOutput(0), 2 * 4, DimsHW{1, 1}, weightMap[\"BboxHead.0.conv1x1.weight\"], weightMap[\"BboxHead.0.conv1x1.bias\"]);\n    auto bbox_head2 = network->addConvolutionNd(*ssh2->getOutput(0), 2 * 4, DimsHW{1, 1}, weightMap[\"BboxHead.1.conv1x1.weight\"], weightMap[\"BboxHead.1.conv1x1.bias\"]);\n    auto bbox_head3 = network->addConvolutionNd(*ssh3->getOutput(0), 2 * 4, DimsHW{1, 1}, weightMap[\"BboxHead.2.conv1x1.weight\"], weightMap[\"BboxHead.2.conv1x1.bias\"]);\n\n    auto cls_head1 = network->addConvolutionNd(*ssh1->getOutput(0), 2 * 2, DimsHW{1, 1}, weightMap[\"ClassHead.0.conv1x1.weight\"], weightMap[\"ClassHead.0.conv1x1.bias\"]);\n    auto cls_head2 = network->addConvolutionNd(*ssh2->getOutput(0), 2 * 2, DimsHW{1, 1}, weightMap[\"ClassHead.1.conv1x1.weight\"], weightMap[\"ClassHead.1.conv1x1.bias\"]);\n    auto cls_head3 = network->addConvolutionNd(*ssh3->getOutput(0), 2 * 2, DimsHW{1, 1}, weightMap[\"ClassHead.2.conv1x1.weight\"], weightMap[\"ClassHead.2.conv1x1.bias\"]);\n\n    auto lmk_head1 = network->addConvolutionNd(*ssh1->getOutput(0), 2 * 10, DimsHW{1, 1}, weightMap[\"LandmarkHead.0.conv1x1.weight\"], weightMap[\"LandmarkHead.0.conv1x1.bias\"]);\n    auto lmk_head2 = network->addConvolutionNd(*ssh2->getOutput(0), 2 * 10, DimsHW{1, 1}, weightMap[\"LandmarkHead.1.conv1x1.weight\"], weightMap[\"LandmarkHead.1.conv1x1.bias\"]);\n    auto lmk_head3 = network->addConvolutionNd(*ssh3->getOutput(0), 2 * 10, DimsHW{1, 1}, weightMap[\"LandmarkHead.2.conv1x1.weight\"], weightMap[\"LandmarkHead.2.conv1x1.bias\"]);\n\n    //// ------------- Decode bbox, conf, landmark ---------------\n    ITensor* inputTensors1[] = {bbox_head1->getOutput(0), cls_head1->getOutput(0), lmk_head1->getOutput(0)};\n    auto cat1 = network->addConcatenation(inputTensors1, 3);\n    ITensor* inputTensors2[] = {bbox_head2->getOutput(0), cls_head2->getOutput(0), lmk_head2->getOutput(0)};\n    auto cat2 = network->addConcatenation(inputTensors2, 3);\n    ITensor* inputTensors3[] = {bbox_head3->getOutput(0), cls_head3->getOutput(0), lmk_head3->getOutput(0)};\n    auto cat3 = network->addConcatenation(inputTensors3, 3);\n\n    auto creator = getPluginRegistry()->getPluginCreator(\"Decode_TRT\", \"1\");\n    PluginFieldCollection pfc;\n    IPluginV2 *pluginObj = creator->createPlugin(\"decode\", &pfc);\n    ITensor* inputTensors[] = {cat1->getOutput(0), cat2->getOutput(0), cat3->getOutput(0)};\n    auto decodelayer = network->addPluginV2(inputTensors, 3, *pluginObj);\n    assert(decodelayer);\n\n    decodelayer->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*decodelayer->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n#if defined(USE_FP16)\n    config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << builder->platformHasFastInt8() << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(BuilderFlag::kINT8);\n    Int8EntropyCalibrator2 *calibrator = new Int8EntropyCalibrator2(1, INPUT_W, INPUT_H, \"./widerface_calib/\", \"mnet_int8calib.table\", INPUT_BLOB_NAME);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*)(mem.second.values));\n        mem.second.values = NULL;\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv) {\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./retina_mnet -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./retina_mnet -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(BATCH_SIZE, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(\"retina_mnet.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"retina_mnet.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n    // prepare input data ---------------------------\n    static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];\n    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n    //    data[i] = 1.0;\n\n    cv::Mat img = cv::imread(\"worlds-largest-selfie.jpg\");\n    cv::Mat pr_img = preprocess_img(img, INPUT_W, INPUT_H);\n    //cv::imwrite(\"preprocessed.jpg\", pr_img);\n\n    // For multi-batch, I feed the same image multiple times.\n    // If you want to process different images in a batch, you need adapt it.\n    for (int b = 0; b < BATCH_SIZE; b++) {\n        float *p_data = &data[b * 3 * INPUT_H * INPUT_W];\n        for (int i = 0; i < INPUT_H * INPUT_W; i++) {\n            p_data[i] = pr_img.at<cv::Vec3b>(i)[0] - 104.0;\n            p_data[i + INPUT_H * INPUT_W] = pr_img.at<cv::Vec3b>(i)[1] - 117.0;\n            p_data[i + 2 * INPUT_H * INPUT_W] = pr_img.at<cv::Vec3b>(i)[2] - 123.0;\n        }\n    }\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    //ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n\n    // Run inference\n    static float prob[BATCH_SIZE * OUTPUT_SIZE];\n    auto start = std::chrono::system_clock::now();\n    doInference(*context, data, prob, BATCH_SIZE);\n    auto end = std::chrono::system_clock::now();\n    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << \"us\" << std::endl;\n\n    for (int b = 0; b < BATCH_SIZE; b++) {\n        std::vector<decodeplugin::Detection> res;\n        nms(res, &prob[b * OUTPUT_SIZE], IOU_THRESH);\n        std::cout << \"number of detections -> \" << prob[b * OUTPUT_SIZE] << std::endl;\n        std::cout << \" -> \" << prob[b * OUTPUT_SIZE + 10] << std::endl;\n        std::cout << \"after nms -> \" << res.size() << std::endl;\n        cv::Mat tmp = img.clone();\n        for (size_t j = 0; j < res.size(); j++) {\n            if (res[j].class_confidence < CONF_THRESH) continue;\n            cv::Rect r = get_rect_adapt_landmark(tmp, INPUT_W, INPUT_H, res[j].bbox, res[j].landmark);\n            cv::rectangle(tmp, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            //cv::putText(tmp, std::to_string((int)(res[j].class_confidence * 100)) + \"%\", cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0xFF, 0xFF, 0xFF), 1);\n            for (int k = 0; k < 10; k += 2) {\n                cv::circle(tmp, cv::Point(res[j].landmark[k], res[j].landmark[k + 1]), 1, cv::Scalar(255 * (k > 2), 255 * (k > 0 && k < 8), 255 * (k < 6)), 4);\n            }\n        }\n        cv::imwrite(std::to_string(b) + \"_result.jpg\", tmp);\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << i / 10 << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "retinaface/retina_r50.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include \"common.hpp\"\n#include \"calibrator.h\"\n\n#define USE_INT8  // set USE_INT8 or USE_FP16 or USE_FP32\n#define DEVICE 0  // GPU id\n#define BATCH_SIZE 1\n#define CONF_THRESH 0.75\n#define IOU_THRESH 0.4\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = decodeplugin::INPUT_H;  // H, W must be able to  be divided by 32.\nstatic const int INPUT_W = decodeplugin::INPUT_W;;\nstatic const int OUTPUT_SIZE = (INPUT_H / 8 * INPUT_W / 8 + INPUT_H / 16 * INPUT_W / 16 + INPUT_H / 32 * INPUT_W / 32) * 2  * 15 + 1;\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nstatic Logger gLogger;\n\nIActivationLayer* bottleneck(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int stride, std::string lname) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{1, 1}, weightMap[lname + \"conv1.weight\"], emptywts);\n    assert(conv1);\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"bn1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{3, 3}, weightMap[lname + \"conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{stride, stride});\n    conv2->setPaddingNd(DimsHW{1, 1});\n\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"bn2\", 1e-5);\n\n    IActivationLayer* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n\n    IConvolutionLayer* conv3 = network->addConvolutionNd(*relu2->getOutput(0), outch * 4, DimsHW{1, 1}, weightMap[lname + \"conv3.weight\"], emptywts);\n    assert(conv3);\n\n    IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"bn3\", 1e-5);\n\n    IElementWiseLayer* ew1;\n    if (stride != 1 || inch != outch * 4) {\n        IConvolutionLayer* conv4 = network->addConvolutionNd(input, outch * 4, DimsHW{1, 1}, weightMap[lname + \"downsample.0.weight\"], emptywts);\n        assert(conv4);\n        conv4->setStrideNd(DimsHW{stride, stride});\n\n        IScaleLayer* bn4 = addBatchNorm2d(network, weightMap, *conv4->getOutput(0), lname + \"downsample.1\", 1e-5);\n        ew1 = network->addElementWise(*bn4->getOutput(0), *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    } else {\n        ew1 = network->addElementWise(input, *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    IActivationLayer* relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n    return relu3;\n}\n\nILayer* conv_bn_relu(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int kernelsize, int stride, int padding, bool userelu, std::string lname) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{kernelsize, kernelsize}, getWeights(weightMap, lname + \".0.weight\"), emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{stride, stride});\n    conv1->setPaddingNd(DimsHW{padding, padding});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".1\", 1e-5);\n\n    if (!userelu) return bn1;\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    return relu1;\n}\n\nIActivationLayer* ssh(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname) {\n    auto conv3x3 = conv_bn_relu(network, weightMap, input, 256 / 2, 3, 1, 1, false, lname + \".conv3X3\");\n    auto conv5x5_1 = conv_bn_relu(network, weightMap, input, 256 / 4, 3, 1, 1, true, lname + \".conv5X5_1\");\n    auto conv5x5 = conv_bn_relu(network, weightMap, *conv5x5_1->getOutput(0), 256 / 4, 3, 1, 1, false, lname + \".conv5X5_2\");\n    auto conv7x7 = conv_bn_relu(network, weightMap, *conv5x5_1->getOutput(0), 256 / 4, 3, 1, 1, true, lname + \".conv7X7_2\");\n    conv7x7 = conv_bn_relu(network, weightMap, *conv7x7->getOutput(0), 256 / 4, 3, 1, 1, false, lname + \".conv7x7_3\");\n    ITensor* inputTensors[] = {conv3x3->getOutput(0), conv5x5->getOutput(0), conv7x7->getOutput(0)};\n    auto cat = network->addConcatenation(inputTensors, 3);\n    IActivationLayer* relu1 = network->addActivation(*cat->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    return relu1;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../retinaface.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    // ------------- backbone resnet50 ---------------\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 64, DimsHW{7, 7}, weightMap[\"body.conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{2, 2});\n    conv1->setPaddingNd(DimsHW{3, 3});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"body.bn1\", 1e-5);\n\n    // Add activation layer using the ReLU algorithm.\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    // Add max pooling layer with stride of 2x2 and kernel size of 2x2.\n    IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n    pool1->setPaddingNd(DimsHW{1, 1});\n\n    IActivationLayer* x = bottleneck(network, weightMap, *pool1->getOutput(0), 64, 64, 1, \"body.layer1.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 64, 1, \"body.layer1.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 64, 1, \"body.layer1.2.\");\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 128, 2, \"body.layer2.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"body.layer2.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"body.layer2.2.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"body.layer2.3.\");\n    IActivationLayer* layer2 = x;\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 256, 2, \"body.layer3.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"body.layer3.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"body.layer3.2.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"body.layer3.3.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"body.layer3.4.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"body.layer3.5.\");\n    IActivationLayer* layer3 = x;\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 512, 2, \"body.layer4.0.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 2048, 512, 1, \"body.layer4.1.\");\n    x = bottleneck(network, weightMap, *x->getOutput(0), 2048, 512, 1, \"body.layer4.2.\");\n    IActivationLayer* layer4 = x;\n\n    // ------------- FPN ---------------\n    auto output1 = conv_bn_relu(network, weightMap, *layer2->getOutput(0), 256, 1, 1, 0, true, \"fpn.output1\");\n    auto output2 = conv_bn_relu(network, weightMap, *layer3->getOutput(0), 256, 1, 1, 0, true, \"fpn.output2\");\n    auto output3 = conv_bn_relu(network, weightMap, *layer4->getOutput(0), 256, 1, 1, 0, true, \"fpn.output3\");\n\n    float *deval = reinterpret_cast<float*>(malloc(sizeof(float) * 256 * 2 * 2));\n    for (int i = 0; i < 256 * 2 * 2; i++) {\n        deval[i] = 1.0;\n    }\n    Weights deconvwts{DataType::kFLOAT, deval, 256 * 2 * 2};\n    IDeconvolutionLayer* up3 = network->addDeconvolutionNd(*output3->getOutput(0), 256, DimsHW{2, 2}, deconvwts, emptywts);\n    assert(up3);\n    up3->setStrideNd(DimsHW{2, 2});\n    up3->setNbGroups(256);\n    weightMap[\"up3\"] = deconvwts;\n\n    output2 = network->addElementWise(*output2->getOutput(0), *up3->getOutput(0), ElementWiseOperation::kSUM);\n    output2 = conv_bn_relu(network, weightMap, *output2->getOutput(0), 256, 3, 1, 1, true, \"fpn.merge2\");\n\n    IDeconvolutionLayer* up2 = network->addDeconvolutionNd(*output2->getOutput(0), 256, DimsHW{2, 2}, deconvwts, emptywts);\n    assert(up2);\n    up2->setStrideNd(DimsHW{2, 2});\n    up2->setNbGroups(256);\n    output1 = network->addElementWise(*output1->getOutput(0), *up2->getOutput(0), ElementWiseOperation::kSUM);\n    output1 = conv_bn_relu(network, weightMap, *output1->getOutput(0), 256, 3, 1, 1, true, \"fpn.merge1\");\n\n    // ------------- SSH ---------------\n    auto ssh1 = ssh(network, weightMap, *output1->getOutput(0), \"ssh1\");\n    auto ssh2 = ssh(network, weightMap, *output2->getOutput(0), \"ssh2\");\n    auto ssh3 = ssh(network, weightMap, *output3->getOutput(0), \"ssh3\");\n\n    // ------------- Head ---------------\n    auto bbox_head1 = network->addConvolutionNd(*ssh1->getOutput(0), 2 * 4, DimsHW{1, 1}, weightMap[\"BboxHead.0.conv1x1.weight\"], weightMap[\"BboxHead.0.conv1x1.bias\"]);\n    auto bbox_head2 = network->addConvolutionNd(*ssh2->getOutput(0), 2 * 4, DimsHW{1, 1}, weightMap[\"BboxHead.1.conv1x1.weight\"], weightMap[\"BboxHead.1.conv1x1.bias\"]);\n    auto bbox_head3 = network->addConvolutionNd(*ssh3->getOutput(0), 2 * 4, DimsHW{1, 1}, weightMap[\"BboxHead.2.conv1x1.weight\"], weightMap[\"BboxHead.2.conv1x1.bias\"]);\n\n    auto cls_head1 = network->addConvolutionNd(*ssh1->getOutput(0), 2 * 2, DimsHW{1, 1}, weightMap[\"ClassHead.0.conv1x1.weight\"], weightMap[\"ClassHead.0.conv1x1.bias\"]);\n    auto cls_head2 = network->addConvolutionNd(*ssh2->getOutput(0), 2 * 2, DimsHW{1, 1}, weightMap[\"ClassHead.1.conv1x1.weight\"], weightMap[\"ClassHead.1.conv1x1.bias\"]);\n    auto cls_head3 = network->addConvolutionNd(*ssh3->getOutput(0), 2 * 2, DimsHW{1, 1}, weightMap[\"ClassHead.2.conv1x1.weight\"], weightMap[\"ClassHead.2.conv1x1.bias\"]);\n\n    auto lmk_head1 = network->addConvolutionNd(*ssh1->getOutput(0), 2 * 10, DimsHW{1, 1}, weightMap[\"LandmarkHead.0.conv1x1.weight\"], weightMap[\"LandmarkHead.0.conv1x1.bias\"]);\n    auto lmk_head2 = network->addConvolutionNd(*ssh2->getOutput(0), 2 * 10, DimsHW{1, 1}, weightMap[\"LandmarkHead.1.conv1x1.weight\"], weightMap[\"LandmarkHead.1.conv1x1.bias\"]);\n    auto lmk_head3 = network->addConvolutionNd(*ssh3->getOutput(0), 2 * 10, DimsHW{1, 1}, weightMap[\"LandmarkHead.2.conv1x1.weight\"], weightMap[\"LandmarkHead.2.conv1x1.bias\"]);\n\n    // ------------- Decode bbox, conf, landmark ---------------\n    ITensor* inputTensors1[] = {bbox_head1->getOutput(0), cls_head1->getOutput(0), lmk_head1->getOutput(0)};\n    auto cat1 = network->addConcatenation(inputTensors1, 3);\n    ITensor* inputTensors2[] = {bbox_head2->getOutput(0), cls_head2->getOutput(0), lmk_head2->getOutput(0)};\n    auto cat2 = network->addConcatenation(inputTensors2, 3);\n    ITensor* inputTensors3[] = {bbox_head3->getOutput(0), cls_head3->getOutput(0), lmk_head3->getOutput(0)};\n    auto cat3 = network->addConcatenation(inputTensors3, 3);\n\n    auto creator = getPluginRegistry()->getPluginCreator(\"Decode_TRT\", \"1\");\n    PluginFieldCollection pfc;\n    IPluginV2 *pluginObj = creator->createPlugin(\"decode\", &pfc);\n    ITensor* inputTensors[] = {cat1->getOutput(0), cat2->getOutput(0), cat3->getOutput(0)};\n    auto decodelayer = network->addPluginV2(inputTensors, 3, *pluginObj);\n    assert(decodelayer);\n\n    decodelayer->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*decodelayer->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n#if defined(USE_FP16)\n    config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << builder->platformHasFastInt8() << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(BuilderFlag::kINT8);\n    Int8EntropyCalibrator2 *calibrator = new Int8EntropyCalibrator2(1, INPUT_W, INPUT_H, \"./widerface_calib/\", \"r50_int8calib.table\", INPUT_BLOB_NAME);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*)(mem.second.values));\n        mem.second.values = NULL;\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv) {\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./retina_r50 -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./retina_r50 -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(BATCH_SIZE, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(\"retina_r50.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"retina_r50.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n    // prepare input data ---------------------------\n    static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];\n    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n    //    data[i] = 1.0;\n\n    cv::Mat img = cv::imread(\"worlds-largest-selfie.jpg\");\n    cv::Mat pr_img = preprocess_img(img, INPUT_W, INPUT_H);\n    //cv::imwrite(\"preprocessed.jpg\", pr_img);\n\n    // For multi-batch, I feed the same image multiple times.\n    // If you want to process different images in a batch, you need adapt it.\n    for (int b = 0; b < BATCH_SIZE; b++) {\n        float *p_data = &data[b * 3 * INPUT_H * INPUT_W];\n        for (int i = 0; i < INPUT_H * INPUT_W; i++) {\n            p_data[i] = pr_img.at<cv::Vec3b>(i)[0] - 104.0;\n            p_data[i + INPUT_H * INPUT_W] = pr_img.at<cv::Vec3b>(i)[1] - 117.0;\n            p_data[i + 2 * INPUT_H * INPUT_W] = pr_img.at<cv::Vec3b>(i)[2] - 123.0;\n        }\n    }\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    //ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n\n    // Run inference\n    static float prob[BATCH_SIZE * OUTPUT_SIZE];\n    for (int cc = 0; cc < 1000; cc++) {\n    auto start = std::chrono::system_clock::now();\n    doInference(*context, data, prob, BATCH_SIZE);\n    auto end = std::chrono::system_clock::now();\n    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << \"us\" << std::endl;\n    }\n\n    for (int b = 0; b < BATCH_SIZE; b++) {\n        std::vector<decodeplugin::Detection> res;\n        nms(res, &prob[b * OUTPUT_SIZE], IOU_THRESH);\n        std::cout << \"number of detections -> \" << prob[b * OUTPUT_SIZE] << std::endl;\n        std::cout << \" -> \" << prob[b * OUTPUT_SIZE + 10] << std::endl;\n        std::cout << \"after nms -> \" << res.size() << std::endl;\n        cv::Mat tmp = img.clone();\n        for (size_t j = 0; j < res.size(); j++) {\n            if (res[j].class_confidence < CONF_THRESH) continue;\n            cv::Rect r = get_rect_adapt_landmark(tmp, INPUT_W, INPUT_H, res[j].bbox, res[j].landmark);\n            cv::rectangle(tmp, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            //cv::putText(tmp, std::to_string((int)(res[j].class_confidence * 100)) + \"%\", cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0xFF, 0xFF, 0xFF), 1);\n            for (int k = 0; k < 10; k += 2) {\n                cv::circle(tmp, cv::Point(res[j].landmark[k], res[j].landmark[k + 1]), 1, cv::Scalar(255 * (k > 2), 255 * (k > 0 && k < 8), 255 * (k < 6)), 4);\n            }\n        }\n        cv::imwrite(std::to_string(b) + \"_result.jpg\", tmp);\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << i / 10 << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "retinaface/retinaface_trt.py",
    "content": "\"\"\"\nUse TensorRT's Python api to make inferences.\n\"\"\"\n# -*- coding: utf-8 -*\nimport ctypes\nimport os\nimport random\nimport sys\nimport threading\nimport time\n\nimport cv2\nimport numpy as np\nimport pycuda.autoinit\nimport pycuda.driver as cuda\nimport tensorrt as trt\nimport torch\nimport torchvision\n\nINPUT_H = 480  #defined in decode.h\nINPUT_W = 640\nCONF_THRESH = 0.75\nIOU_THRESHOLD = 0.4\nnp.set_printoptions(threshold=np.inf)\n\ndef plot_one_box(x, landmark,img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n\n    param:\n        x:     a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n            line_thickness or round(0.001 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n\n    cv2.circle(img, (int(landmark[0]), int(landmark[1])), 1, (0, 0, 255), 4)\n    cv2.circle(img, (int(landmark[2]), int(landmark[3])), 1, (0, 255, 255), 4)\n    cv2.circle(img, (int(landmark[4]), int(landmark[5])), 1, (255, 0, 255), 4)\n    cv2.circle(img, (int(landmark[6]), int(landmark[7])), 1, (0, 255, 0), 4)\n    cv2.circle(img, (int(landmark[8]), int(landmark[9])), 1, (255, 0, 0), 4)\n\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n        )\n\n\nclass Retinaface_trt(object):\n    \"\"\"\n    description: A Retineface class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.cfx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n\n    def infer(self, input_image_path):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n\n        self.cfx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        engine = self.engine\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        input_image, image_raw, origin_h, origin_w = self.preprocess_image(\n            input_image_path\n        )\n        a = time.time()\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], input_image.ravel())\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.cfx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n\n        # Do postprocess\n        result_boxes, result_scores, result_landmark = self.post_process(\n            output, origin_h, origin_w\n        )\n        b = time.time()-a\n        print(b)\n\n        # Draw rectangles and labels on the original image\n\n        # Save image\n        for i in range(len(result_boxes)):\n            box = result_boxes[i]\n            landmark = result_landmark[i]\n            plot_one_box(\n                box,\n                landmark,\n                image_raw,\n                label=\"{}:{:.2f}\".format( 'Face', result_scores[i]))\n        parent, filename = os.path.split(input_image_path)\n        save_name = os.path.join(parent, \"output_\" + filename)\n\n        cv2.imwrite(save_name, image_raw)\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.cfx.pop()\n\n    def preprocess_image(self, input_image_path):\n        \"\"\"\n        description: Read an image from image path, resize and pad it to target size,\n                     normalize to [0,1],transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = cv2.imread(input_image_path)\n        h, w, c = image_raw.shape\n\n        # Calculate widht and height and paddings\n        r_w = INPUT_W / w\n        r_h = INPUT_H / h\n        if r_h > r_w:\n            tw = INPUT_W\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((INPUT_H - th) / 2)\n            ty2 = INPUT_H - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = INPUT_H\n            tx1 = int((INPUT_W - tw) / 2)\n            tx2 = INPUT_W - tw - tx1\n            ty1 = ty2 = 0\n\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image_raw, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n\n        # HWC to CHW format:\n        image -= (104, 117, 123)\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x,landmark):\n\n        y = torch.zeros_like(x) if isinstance(x, torch.Tensor) else np.zeros_like(x)\n\n        r_w = INPUT_W / origin_w\n        r_h = INPUT_H / origin_h\n\n        if r_h > r_w:\n            y[:, 0] = x[:, 0] / r_w\n            y[:, 2] = x[:, 2] / r_w\n            y[:, 1] = (x[:, 1] - (INPUT_H - r_w * origin_h) / 2) / r_w\n            y[:, 3] = (x[:, 3] - (INPUT_H - r_w * origin_h) / 2) / r_w\n            \n            landmark[:,0] = landmark[:,0]/r_w\n            landmark[:,1] = (landmark[:,1] - (INPUT_H - r_w * origin_h) / 2)/r_w\n            landmark[:,2] = landmark[:,2]/r_w\n            landmark[:,3] = (landmark[:,3] - (INPUT_H - r_w * origin_h) / 2)/r_w\n            landmark[:,4] = landmark[:,4]/r_w\n            landmark[:,5] = (landmark[:,5] - (INPUT_H - r_w * origin_h) / 2)/r_w\n            landmark[:,6] = landmark[:,6]/r_w\n            landmark[:,7] = (landmark[:,7] - (INPUT_H - r_w * origin_h) / 2)/r_w\n            landmark[:,8] = landmark[:,8]/r_w\n            landmark[:,9] = (landmark[:,9] - (INPUT_H - r_w * origin_h) / 2)/r_w\n        else:\n            y[:, 0] = (x[:, 0] - (INPUT_W - r_h * origin_w) / 2) / r_h\n            y[:, 2] = (x[:, 2] - (INPUT_W - r_h * origin_w) / 2) / r_h\n            y[:, 1] = x[:, 1] /r_h\n            y[:, 3] = x[:, 3] /r_h\n\n            landmark[:,0] = (landmark[:,0] - (INPUT_W - r_h * origin_w) / 2)/r_h\n            landmark[:,1] = landmark[:,1]/ r_h\n            landmark[:,2] = (landmark[:,2] - (INPUT_W - r_h * origin_w) / 2)/r_h\n            landmark[:,3] = landmark[:,3]/ r_h\n            landmark[:,4] = (landmark[:,4] - (INPUT_W - r_h * origin_w) / 2)/r_h\n            landmark[:,5] = landmark[:,5]/ r_h\n            landmark[:,6] = (landmark[:,6] - (INPUT_W - r_h * origin_w) / 2)/r_h\n            landmark[:,7] = landmark[:,7]/ r_h\n            landmark[:,8] = (landmark[:,8] - (INPUT_W - r_h * origin_w) / 2)/r_h\n            landmark[:,9] = landmark[:,9]/ r_h\n\n        return y, landmark\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A tensor likes [num_boxes,x1,y1,x2,y2,conf,landmark_x1,landmark_y1,\n            landmark_x2,landmark_y2,...]\n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes tensor, each row is a box [x1, y1, x2, y2]\n            result_scores: finally scores, a tensor, each element is the score correspoing to box\n            result_classid: finally classid, a tensor, each element is the classid correspoing to box\n        \"\"\"\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        pred = np.reshape(output[1:], (-1, 15))[:num, :]\n        # to  torch Tensor\n        pred = torch.Tensor(pred).cuda()\n        # Get the boxes\n        boxes = pred[:, :4]\n        # Get the scores\n        scores = pred[:, 4]\n        # Get the landmark\n        landmark = pred[:,5:15]\n        # Choose those boxes that score > CONF_THRESH\n        si = scores > CONF_THRESH\n        boxes = boxes[si, :]\n        scores = scores[si]\n\n        landmark = landmark[si,:]\n\n        # Get boxes and landmark\n        boxes,landmark = self.xywh2xyxy(origin_h, origin_w, boxes,landmark)\n        # Do nms\n        indices = torchvision.ops.nms(boxes, scores, iou_threshold=IOU_THRESHOLD).cpu()\n        result_boxes = boxes[indices, :].cpu()\n        result_scores = scores[indices].cpu()\n        result_landmark = landmark[indices].cpu()\n        return result_boxes, result_scores, result_landmark\n\nclass myThread(threading.Thread):\n    def __init__(self, func, args):\n        threading.Thread.__init__(self)\n        self.func = func\n        self.args = args\n\n    def run(self):\n        self.func(*self.args)\n\nif __name__ == \"__main__\":\n    # load custom plugins,make sure it has been generated\n    PLUGIN_LIBRARY = \"build/libdecodeplugin.so\"\n    ctypes.CDLL(PLUGIN_LIBRARY)\n    engine_file_path = \"build/retina_r50.engine\"\n\n    retinaface = Retinaface_trt(engine_file_path)\n    input_image_paths = [\"zidane.jpg\"]\n    for i in range(10):\n        for input_image_path in input_image_paths:\n            # create a new thread to do inference\n            thread = myThread(retinaface.infer, [input_image_path])\n            thread.start()\n            thread.join()\n\n    # destroy the instance\n    retinaface.destroy()\n"
  },
  {
    "path": "retinafaceAntiCov/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(retinafaceAntiCov)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\nif (CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n    message(\"embed_platform on\")\n    include_directories(/usr/local/cuda/targets/aarch64-linux/include)\n    link_directories(/usr/local/cuda/targets/aarch64-linux/lib)\nelse()\n    message(\"embed_platform off\")\n    # cuda\n    include_directories(/usr/local/cuda/include)\n    link_directories(/usr/local/cuda/lib64)\n\n    # tensorrt\n    include_directories(/home/lindsay/TensorRT-8.6.1.6/include)\n    link_directories(/home/lindsay/TensorRT-8.6.1.6/lib)\n    #  include_directories(/home/lindsay/TensorRT-7.2.3.4/include)\n    #  link_directories(/home/lindsay/TensorRT-7.2.3.4/lib)\n\n\nendif()\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\ncuda_add_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/decode.cu)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(retinafaceAntiCov ${PROJECT_SOURCE_DIR}/retinafaceAntiCov.cpp)\ntarget_link_libraries(retinafaceAntiCov nvinfer)\ntarget_link_libraries(retinafaceAntiCov cudart)\ntarget_link_libraries(retinafaceAntiCov myplugins)\ntarget_link_libraries(retinafaceAntiCov ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "retinafaceAntiCov/README.md",
    "content": "# RetinaFaceAntiCov\n\n The mxnet implementation is [deepinsight/insightface/RetinaFaceAntiCov](https://github.com/deepinsight/insightface/tree/master/RetinaFaceAntiCov).\n\n## Run\n\n```\n1. generate retinafaceAntiCov.wts from mxnet implementation.\n\ngit clone https://github.com/deepinsight/insightface.git\ncd insightface/RetinaFaceAntiCov\n// download its weights 'cov2.zip', put it into insightface/RetinaFaceAntiCov, and unzip it\n// put tensorrtx/retinafaceAntiCov/gen_wts.py into insightface/RetinaFaceAntiCov\npython gen_wts.py\n// a file 'retinafaceAntiCov.wts' will be generated.\n\n2. put retinafaceAntiCov.wts into tensorrtx/retinafaceAntiCov, build and run\n\ngit clone https://github.com/wang-xinyu/tensorrtx.git\ncd tensorrtx/retinafaceAntiCov\n// put retinafaceAntiCov.wts here\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./retinafaceAntiCov -s  // build and serialize model to file i.e. 'retinafaceAntiCov.engine'\nwget http://www.kaixian.tv/gd/d/file/201611/07/23efff3a26e2385620e719378c654fb1.jpg -O test.jpg\nsudo ./retinafaceAntiCov -d  // deserialize model file and run inference.\n\n3. check the image generated, as follows 'out.jpg'\n```\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/84776553-069c5f80-b013-11ea-893c-70a138b843d6.jpg\">\n</p>\n\n## Config\n\n- Input shape `INPUT_H`, `INPUT_W` defined in `decode.h`\n- FP16/FP32 can be selected by the macro `USE_FP16` in `retinafaceAntiCov.cpp`\n- GPU id can be selected by the macro `DEVICE` in `retinafaceAntiCov.cpp`\n\n## More Information\n\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n"
  },
  {
    "path": "retinafaceAntiCov/decode.cu",
    "content": "#include \"decode.h\"\n#include \"stdio.h\"\n\nnamespace nvinfer1\n{\n    DecodePlugin::DecodePlugin()\n    {\n    }\n\n    DecodePlugin::~DecodePlugin()\n    {\n    }\n\n    // create the plugin at runtime from a byte stream\n    DecodePlugin::DecodePlugin(const void* data, size_t length)\n    {\n    }\n\n    void DecodePlugin::serialize(void* buffer) const TRT_NOEXCEPT\n    {\n    }\n\n    size_t DecodePlugin::getSerializationSize() const TRT_NOEXCEPT\n    {  \n        return 0;\n    }\n\n    int DecodePlugin::initialize() TRT_NOEXCEPT\n    { \n        return 0;\n    }\n\n    Dims DecodePlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT\n    {\n        //output the result to channel\n        int totalCount = 0;\n        totalCount += decodeplugin::INPUT_H / 8 * decodeplugin::INPUT_W / 8 * 2 * sizeof(decodeplugin::Detection) / sizeof(float);\n        totalCount += decodeplugin::INPUT_H / 16 * decodeplugin::INPUT_W / 16 * 2 * sizeof(decodeplugin::Detection) / sizeof(float);\n        totalCount += decodeplugin::INPUT_H / 32 * decodeplugin::INPUT_W / 32 * 2 * sizeof(decodeplugin::Detection) / sizeof(float);\n\n        return Dims3(totalCount + 1, 1, 1);\n    }\n\n    // Set plugin namespace\n    void DecodePlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT\n    {\n        mPluginNamespace = pluginNamespace;\n    }\n\n    const char* DecodePlugin::getPluginNamespace() const TRT_NOEXCEPT\n    {\n        return mPluginNamespace;\n    }\n\n    // Return the DataType of the plugin output at the requested index\n    DataType DecodePlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT\n    {\n        return DataType::kFLOAT;\n    }\n\n    // Return true if output tensor is broadcast across a batch.\n    bool DecodePlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    // Return true if plugin can use input that is broadcast across batch without replication.\n    bool DecodePlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    void DecodePlugin::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT\n    {\n    }\n\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\n    void DecodePlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT\n    {\n    }\n\n    // Detach the plugin object from its execution context.\n    void DecodePlugin::detachFromContext() TRT_NOEXCEPT {}\n\n    const char* DecodePlugin::getPluginType() const TRT_NOEXCEPT\n    {\n        return \"Decode_TRT\";\n    }\n\n    const char* DecodePlugin::getPluginVersion() const TRT_NOEXCEPT\n    {\n        return \"1\";\n    }\n\n    void DecodePlugin::destroy() TRT_NOEXCEPT\n    {\n        delete this;\n    }\n\n    // Clone the plugin\n    IPluginV2IOExt* DecodePlugin::clone() const TRT_NOEXCEPT\n    {\n        DecodePlugin *p = new DecodePlugin();\n        p->setPluginNamespace(mPluginNamespace);\n        return p;\n    }\n\n    __device__ float Logist(float data){ return 1./(1. + expf(-data)); };\n\n    __global__ void CalDetection(const float *input, float *output, int num_elem, int step, int anchor) {\n\n        int idx = threadIdx.x + blockDim.x * blockIdx.x;\n        if (idx >= num_elem) return;\n\n        int h = decodeplugin::INPUT_H / step;\n        int w = decodeplugin::INPUT_W / step;\n        int y = idx / w;\n        int x = idx % w;\n        const float *cls_reg = &input[2 * num_elem];\n        const float *bbox_reg = &input[4 * num_elem];\n        const float *lmk_reg = &input[12 * num_elem];\n        const float *mask_reg = &input[36 * num_elem];\n\n        for (int k = 0; k < 2; ++k) {\n            float conf = cls_reg[idx + k * num_elem];\n            if (conf < 0.5) continue;\n\n            float *res_count = output;\n            int count = (int)atomicAdd(res_count, 1);\n            char* data = (char *)res_count + sizeof(float) + count * sizeof(decodeplugin::Detection);\n            decodeplugin::Detection* det = (decodeplugin::Detection*)(data);\n\n            float prior[4];\n            prior[0] = 7.5 + (float)(x * step);\n            prior[1] = 7.5 + (float)(y * step);\n            prior[2] = anchor * 2 / (k + 1);\n            prior[3] = prior[2];\n\n            //Location\n            det->bbox[0] = prior[0] + bbox_reg[idx + k * num_elem * 4] * prior[2];\n            det->bbox[1] = prior[1] + bbox_reg[idx + k * num_elem * 4 + num_elem] * prior[3];\n            det->bbox[2] = prior[2] * expf(bbox_reg[idx + k * num_elem * 4 + num_elem * 2]);\n            det->bbox[3] = prior[3] * expf(bbox_reg[idx + k * num_elem * 4 + num_elem * 3]);\n            det->bbox[0] -= (det->bbox[2] - 1) / 2;\n            det->bbox[1] -= (det->bbox[3] - 1) / 2;\n            det->bbox[2] += det->bbox[0];\n            det->bbox[3] += det->bbox[1];\n            det->class_confidence = conf;\n            for (int i = 0; i < 10; i += 2) {\n                det->landmark[i] = prior[0] + lmk_reg[idx + k * num_elem * 10 + num_elem * i] * 0.2 * prior[2];\n                det->landmark[i+1] = prior[1] + lmk_reg[idx + k * num_elem * 10 + num_elem * (i + 1)] * 0.2 * prior[3];\n            }\n            det->mask_confidence = mask_reg[idx + k * num_elem];;\n        }\n    }\n\n    void DecodePlugin::forwardGpu(const float *const * inputs, float * output, cudaStream_t stream, int batchSize) \n    {\n        int num_elem = 0;\n        int base_step = 8;\n        int base_anchor = 16;\n        int thread_count;\n        cudaMemset(output, 0, sizeof(float));\n        for (unsigned int i = 0; i < 3; ++i)\n        {\n            num_elem = decodeplugin::INPUT_H / base_step * decodeplugin::INPUT_W / base_step;\n            thread_count = (num_elem < thread_count_) ? num_elem : thread_count_;\n            CalDetection<<< (num_elem + thread_count - 1) / thread_count, thread_count>>>\n                (inputs[i], output, num_elem, base_step, base_anchor);\n            base_step *= 2;\n            base_anchor *= 4;\n        }\n    }\n\n    int DecodePlugin::enqueue(int batchSize, const void*const * inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT\n    {\n        //assert(batchSize == 1);\n        //GPU\n        //CUDA_CHECK(cudaStreamSynchronize(stream));\n        forwardGpu((const float *const *)inputs,(float *)outputs[0],stream,batchSize);\n\n        return 0;\n    };\n\n    PluginFieldCollection DecodePluginCreator::mFC{};\n    std::vector<PluginField> DecodePluginCreator::mPluginAttributes;\n\n    DecodePluginCreator::DecodePluginCreator()\n    {\n        mPluginAttributes.clear();\n\n        mFC.nbFields = mPluginAttributes.size();\n        mFC.fields = mPluginAttributes.data();\n    }\n\n    const char* DecodePluginCreator::getPluginName() const TRT_NOEXCEPT\n    {\n        return \"Decode_TRT\";\n    }\n\n    const char* DecodePluginCreator::getPluginVersion() const TRT_NOEXCEPT\n    {\n        return \"1\";\n    }\n\n    const PluginFieldCollection* DecodePluginCreator::getFieldNames() TRT_NOEXCEPT\n    {\n        return &mFC;\n    }\n\n    IPluginV2IOExt* DecodePluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT\n    {\n        DecodePlugin* obj = new DecodePlugin();\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n    IPluginV2IOExt* DecodePluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT\n    {\n        // This object will be deleted when the network is destroyed, which will\n        // call PReluPlugin::destroy()\n        DecodePlugin* obj = new DecodePlugin(serialData, serialLength);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n}\n"
  },
  {
    "path": "retinafaceAntiCov/decode.h",
    "content": "#ifndef _DECODE_CU_H\n#define _DECODE_CU_H\n\n#include <string>\n#include <vector>\n#include <iostream>\n#include \"NvInfer.h\"\n#include \"macros.h\"\n\n\n\nnamespace decodeplugin\n{\n    struct alignas(float) Detection{\n        float bbox[4];  //x1 y1 x2 y2\n        float class_confidence;\n        float landmark[10];\n        float mask_confidence;\n    };\n    static const int INPUT_H = 640;\n    static const int INPUT_W = 640;\n\n//    std::ostream& operator << (std::ostream& os, const decodeplugin::Detection& det) {\n//        for(int i = 0; i < 10; i += 2){\n//            os << det.mask_confidence << \" \";\n//        }\n//        return os;\n//    }\n}\n\n\nnamespace nvinfer1\n{\n    class DecodePlugin: public IPluginV2IOExt\n    {\n        public:\n            DecodePlugin();\n            DecodePlugin(const void* data, size_t length);\n\n            ~DecodePlugin();\n\n            int getNbOutputs() const TRT_NOEXCEPT override\n            {\n                return 1;\n            }\n\n            Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n            int initialize() TRT_NOEXCEPT override;\n\n            virtual void terminate() TRT_NOEXCEPT override {};\n\n            virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0;}\n\n            virtual int enqueue(int batchSize, const void*const * inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT override;\n\n            virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n            virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n            bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const TRT_NOEXCEPT override {\n                return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n            }\n\n            const char* getPluginType() const TRT_NOEXCEPT override;\n\n            const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n            void destroy() TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n            void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n            const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n            DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override;\n\n            bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT override;\n\n            bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n            void attachToContext(\n                    cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n            void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT override;\n\n            void detachFromContext() TRT_NOEXCEPT override;\n\n            int input_size_;\n        private:\n            void forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize = 1);\n            int thread_count_ = 256;\n            const char* mPluginNamespace;\n    };\n\n    class DecodePluginCreator : public IPluginCreator\n    {\n        public:\n            DecodePluginCreator();\n\n            ~DecodePluginCreator() TRT_NOEXCEPT override = default;\n\n            const char* getPluginName() const TRT_NOEXCEPT override;\n\n            const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n            const PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT override;\n\n            void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override\n            {\n                mNamespace = libNamespace;\n            }\n\n            const char* getPluginNamespace() const TRT_NOEXCEPT override\n            {\n                return mNamespace.c_str();\n            }\n\n        private:\n            std::string mNamespace;\n            static PluginFieldCollection mFC;\n            static std::vector<PluginField> mPluginAttributes;\n    };\n};\n\n#endif \n"
  },
  {
    "path": "retinafaceAntiCov/gen_wts.py",
    "content": "import struct\nfrom retinaface_cov import RetinaFaceCoV\n\ngpuid = 0\nmodel = RetinaFaceCoV('./cov2/mnet_cov2', 0, gpuid, 'net3l')\n\nf = open('retinafaceAntiCov.wts', 'w')\nf.write('{}\\n'.format(len(model.model.get_params()[0].keys()) + len(model.model.get_params()[1].keys())))\nfor k, v in model.model.get_params()[0].items():\n    vr = v.reshape(-1).asnumpy()\n    f.write('{} {} '.format(k, len(vr)))\n    for vv in vr:\n        f.write(' ')\n        f.write(struct.pack('>f',float(vv)).hex())\n    f.write('\\n')\nfor k, v in model.model.get_params()[1].items():\n    vr = v.reshape(-1).asnumpy()\n    f.write('{} {} '.format(k, len(vr)))\n    for vv in vr:\n        f.write(' ')\n        f.write(struct.pack('>f',float(vv)).hex())\n    f.write('\\n')\n\n"
  },
  {
    "path": "retinafaceAntiCov/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"macros.h\"\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "retinafaceAntiCov/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H"
  },
  {
    "path": "retinafaceAntiCov/retinafaceAntiCov.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <opencv2/opencv.hpp>\n#include <dirent.h>\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include \"decode.h\"\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\n//#define USE_FP16  // comment out this if want to use FP32\n#define DEVICE 0  // GPU id\n#define BATCH_SIZE 1  // currently, only support BATCH=1\n\nusing namespace nvinfer1;\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = decodeplugin::INPUT_H;\nstatic const int INPUT_W = decodeplugin::INPUT_W;\nstatic const int DETECTION_SIZE = sizeof(decodeplugin::Detection) / sizeof(float);\nstatic const int OUTPUT_SIZE = (INPUT_H / 8 * INPUT_W / 8 + INPUT_H / 16 * INPUT_W / 16 + INPUT_H / 32 * INPUT_W / 32) * 2  * DETECTION_SIZE + 1;\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\nstatic Logger gLogger;\nREGISTER_TENSORRT_PLUGIN(DecodePluginCreator);\n\ncv::Mat preprocess_img(cv::Mat& img) {\n    int w, h, x, y;\n    float r_w = INPUT_W / (img.cols*1.0);\n    float r_h = INPUT_H / (img.rows*1.0);\n    if (r_h > r_w) {\n        w = INPUT_W;\n        h = r_w * img.rows;\n        x = 0;\n        y = (INPUT_H - h) / 2;\n    } else {\n        w = r_h* img.cols;\n        h = INPUT_H;\n        x = (INPUT_W - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_CUBIC);\n    cv::Mat out(INPUT_H, INPUT_W, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\ncv::Rect get_rect_adapt_landmark(cv::Mat& img, float bbox[4], float lmk[10]) {\n    int l, r, t, b;\n    float r_w = INPUT_W / (img.cols * 1.0);\n    float r_h = INPUT_H / (img.rows * 1.0);\n    if (r_h > r_w) {\n        l = bbox[0] / r_w;\n        r = bbox[2] / r_w;\n        t = (bbox[1] - (INPUT_H - r_w * img.rows) / 2) / r_w;\n        b = (bbox[3] - (INPUT_H - r_w * img.rows) / 2) / r_w;\n        for (int i = 0; i < 10; i += 2) {\n            lmk[i] /= r_w;\n            lmk[i + 1] = (lmk[i + 1] - (INPUT_H - r_w * img.rows) / 2) / r_w;\n        }\n    } else {\n        l = (bbox[0] - (INPUT_W - r_h * img.cols) / 2) / r_h;\n        r = (bbox[2] - (INPUT_W - r_h * img.cols) / 2) / r_h;\n        t = bbox[1] / r_h;\n        b = bbox[3] / r_h;\n        for (int i = 0; i < 10; i += 2) {\n            lmk[i] = (lmk[i] - (INPUT_W - r_h * img.cols) / 2) / r_h;\n            lmk[i + 1] /= r_h;\n        }\n    }\n    return cv::Rect(l, t, r-l, b-t);\n}\n\nfloat iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n        std::max(lbox[0], rbox[0]), //left\n        std::min(lbox[2], rbox[2]), //right\n        std::max(lbox[1], rbox[1]), //top\n        std::min(lbox[3], rbox[3]), //bottom\n    };\n\n    if(interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS = (interBox[1] - interBox[0]) * (interBox[3] - interBox[2]);\n    return interBoxS / ((lbox[2] - lbox[0]) * (lbox[3] - lbox[1]) + (rbox[2] - rbox[0]) * (rbox[3] - rbox[1]) -interBoxS + 0.000001f);\n}\n\nbool cmp(decodeplugin::Detection& a, decodeplugin::Detection& b) {\n    return a.class_confidence > b.class_confidence;\n}\n\nvoid nms(std::vector<decodeplugin::Detection>& res, float *output, float nms_thresh = 0.4) {\n    std::vector<decodeplugin::Detection> dets;\n    for (int i = 0; i < output[0]; i++) {\n        if (output[DETECTION_SIZE * i + 1 + 4] <= 0.1) continue;\n        decodeplugin::Detection det;\n        memcpy(&det, &output[DETECTION_SIZE * i + 1], sizeof(decodeplugin::Detection));\n        dets.push_back(det);\n    }\n    std::sort(dets.begin(), dets.end(), cmp);\n    if (dets.size() > 5000) dets.erase(dets.begin() + 5000, dets.end());\n    for (size_t m = 0; m < dets.size(); ++m) {\n        auto& item = dets[m];\n        res.push_back(item);\n        //std::cout << item.class_confidence << \" bbox \" << item.bbox[0] << \", \" << item.bbox[1] << \", \" << item.bbox[2] << \", \" << item.bbox[3] << std::endl;\n        for (size_t n = m + 1; n < dets.size(); ++n) {\n            if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                dets.erase(dets.begin()+n);\n                --n;\n            }\n        }\n    }\n}\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \"_gamma\"].values;\n    float *beta = (float*)weightMap[lname + \"_beta\"].values;\n    float *mean = (float*)weightMap[lname + \"_moving_mean\"].values;\n    float *var = (float*)weightMap[lname + \"_moving_var\"].values;\n    int len = weightMap[lname + \"_moving_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* convBnRelu(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int num_filters, int k, int s, int p, int g, std::string lname) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv = network->addConvolutionNd(input, num_filters, DimsHW{k, k}, weightMap[lname + \"_conv2d_weight\"], emptywts);\n    assert(conv);\n    conv->setStrideNd(DimsHW{s, s});\n    conv->setPaddingNd(DimsHW{p, p});\n    conv->setNbGroups(g);\n    auto bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \"_batchnorm\", 1e-3);\n    IActivationLayer* relu = network->addActivation(*bn->getOutput(0), ActivationType::kRELU);\n    assert(relu);\n    return relu;\n}\n\nILayer* convBiasBnRelu(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int num_filters, int k, int s, int p, std::string lname) {\n    IConvolutionLayer* conv = network->addConvolutionNd(input, num_filters, DimsHW{k, k}, weightMap[lname + \"_weight\"], weightMap[lname + \"_bias\"]);\n    assert(conv);\n    conv->setStrideNd(DimsHW{s, s});\n    conv->setPaddingNd(DimsHW{p, p});\n    auto bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \"_bn\", 2e-5);\n    IActivationLayer* relu = network->addActivation(*bn->getOutput(0), ActivationType::kRELU);\n    assert(relu);\n    return relu;\n}\n\nILayer* head(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname) {\n    auto conv1 = network->addConvolutionNd(input, 32, DimsHW{3, 3}, weightMap[lname + \"_conv1_weight\"], weightMap[lname + \"_conv1_bias\"]);\n    assert(conv1);\n    conv1->setPaddingNd(DimsHW{1, 1});\n    auto conv1bn = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"_conv1_bn\", 2e-5);\n\n    auto ctxconv1 = convBiasBnRelu(network, weightMap, input, 16, 3, 1, 1, lname + \"_context_conv1\");\n\n    auto ctxconv2 = network->addConvolutionNd(*ctxconv1->getOutput(0), 16, DimsHW{3, 3}, weightMap[lname + \"_context_conv2_weight\"], weightMap[lname + \"_context_conv2_bias\"]);\n    assert(ctxconv2);\n    ctxconv2->setPaddingNd(DimsHW{1, 1});\n    auto ctxconv2bn = addBatchNorm2d(network, weightMap, *ctxconv2->getOutput(0), lname + \"_context_conv2_bn\", 2e-5);\n\n    auto ctxconv3_1 = convBiasBnRelu(network, weightMap, *ctxconv1->getOutput(0), 16, 3, 1, 1, lname + \"_context_conv3_1\");\n    auto ctxconv3_2 = network->addConvolutionNd(*ctxconv3_1->getOutput(0), 16, DimsHW{3, 3}, weightMap[lname + \"_context_conv3_2_weight\"], weightMap[lname + \"_context_conv3_2_bias\"]);\n    assert(ctxconv3_2);\n    ctxconv3_2->setPaddingNd(DimsHW{1, 1});\n    auto ctxconv3_2bn = addBatchNorm2d(network, weightMap, *ctxconv3_2->getOutput(0), lname + \"_context_conv3_2_bn\", 2e-5);\n\n    ITensor* inputTensors[] = {conv1bn->getOutput(0), ctxconv2bn->getOutput(0), ctxconv3_2bn->getOutput(0)};\n    auto cat = network->addConcatenation(inputTensors, 3);\n    assert(cat);\n\n    IActivationLayer* relu = network->addActivation(*cat->getOutput(0), ActivationType::kRELU);\n    assert(relu);\n    return relu;\n}\n\nILayer* reshapeSoftmax(INetworkDefinition *network, ITensor& input, int c) {\n    auto re1 = network->addShuffle(input);\n    assert(re1);\n    re1->setReshapeDimensions(Dims3(c / 2, -1, 0));\n\n    auto sm = network->addSoftMax(*re1->getOutput(0));\n    assert(sm);\n\n    auto re2 = network->addShuffle(*sm->getOutput(0));\n    assert(re2);\n    re2->setReshapeDimensions(Dims3(c, -1, 0));\n\n    return re2;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../retinafaceAntiCov.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    auto conv1 = convBnRelu(network, weightMap, *data, 16, 3, 2, 1, 1, \"conv_1\");\n    auto conv2 = convBnRelu(network, weightMap, *conv1->getOutput(0), 32, 1, 1, 0, 1, \"conv_2\");\n    auto conv3dw = convBnRelu(network, weightMap, *conv2->getOutput(0), 32, 3, 2, 1, 32, \"conv_3_dw\");\n    auto conv3 = convBnRelu(network, weightMap, *conv3dw->getOutput(0), 32, 1, 1, 0, 1, \"conv_3\");\n    auto conv4dw = convBnRelu(network, weightMap, *conv3->getOutput(0), 32, 3, 1, 1, 32, \"conv_4_dw\");\n    auto conv4 = convBnRelu(network, weightMap, *conv4dw->getOutput(0), 32, 1, 1, 0, 1, \"conv_4\");\n    auto conv5dw = convBnRelu(network, weightMap, *conv4->getOutput(0), 32, 3, 2, 1, 32, \"conv_5_dw\");\n    auto conv5 = convBnRelu(network, weightMap, *conv5dw->getOutput(0), 64, 1, 1, 0, 1, \"conv_5\");\n    auto conv6dw = convBnRelu(network, weightMap, *conv5->getOutput(0), 64, 3, 1, 1, 64, \"conv_6_dw\");\n    auto conv6 = convBnRelu(network, weightMap, *conv6dw->getOutput(0), 64, 1, 1, 0, 1, \"conv_6\");\n    // conv6 to c1\n    auto conv7dw = convBnRelu(network, weightMap, *conv6->getOutput(0), 64, 3, 2, 1, 64, \"conv_7_dw\");\n    auto conv7 = convBnRelu(network, weightMap, *conv7dw->getOutput(0), 128, 1, 1, 0, 1, \"conv_7\");\n    auto conv8dw = convBnRelu(network, weightMap, *conv7->getOutput(0), 128, 3, 1, 1, 128, \"conv_8_dw\");\n    auto conv8 = convBnRelu(network, weightMap, *conv8dw->getOutput(0), 128, 1, 1, 0, 1, \"conv_8\");\n    auto conv9dw = convBnRelu(network, weightMap, *conv8->getOutput(0), 128, 3, 1, 1, 128, \"conv_9_dw\");\n    auto conv9 = convBnRelu(network, weightMap, *conv9dw->getOutput(0), 128, 1, 1, 0, 1, \"conv_9\");\n    auto conv10dw = convBnRelu(network, weightMap, *conv9->getOutput(0), 128, 3, 1, 1, 128, \"conv_10_dw\");\n    auto conv10 = convBnRelu(network, weightMap, *conv10dw->getOutput(0), 128, 1, 1, 0, 1, \"conv_10\");\n    auto conv11dw = convBnRelu(network, weightMap, *conv10->getOutput(0), 128, 3, 1, 1, 128, \"conv_11_dw\");\n    auto conv11 = convBnRelu(network, weightMap, *conv11dw->getOutput(0), 128, 1, 1, 0, 1, \"conv_11\");\n    auto conv12dw = convBnRelu(network, weightMap, *conv11->getOutput(0), 128, 3, 1, 1, 128, \"conv_12_dw\");\n    auto conv12 = convBnRelu(network, weightMap, *conv12dw->getOutput(0), 128, 1, 1, 0, 1, \"conv_12\");\n    // conv12 to c2\n    auto conv13dw = convBnRelu(network, weightMap, *conv12->getOutput(0), 128, 3, 2, 1, 128, \"conv_13_dw\");\n    auto conv13 = convBnRelu(network, weightMap, *conv13dw->getOutput(0), 256, 1, 1, 0, 1, \"conv_13\");\n    auto conv14dw = convBnRelu(network, weightMap, *conv13->getOutput(0), 256, 3, 1, 1, 256, \"conv_14_dw\");\n    auto conv14 = convBnRelu(network, weightMap, *conv14dw->getOutput(0), 256, 1, 1, 0, 1, \"conv_14\");\n    auto conv_final = convBnRelu(network, weightMap, *conv14->getOutput(0), 256, 1, 1, 0, 1, \"conv_final\");\n    // convfinal to c3\n\n    auto rf_c3_lateral = convBiasBnRelu(network, weightMap, *conv_final->getOutput(0), 64, 1, 1, 0, \"rf_c3_lateral\");\n    auto rf_head_s32 = head(network, weightMap, *rf_c3_lateral->getOutput(0), \"rf_head_stride32\");\n    ILayer *cls_score_s32 = network->addConvolutionNd(*rf_head_s32->getOutput(0), 4, DimsHW{1, 1}, weightMap[\"face_rpn_cls_score_stride32_weight\"], weightMap[\"face_rpn_cls_score_stride32_bias\"]);\n    cls_score_s32 = reshapeSoftmax(network, *cls_score_s32->getOutput(0), 4);\n    auto bbox_s32 = network->addConvolutionNd(*rf_head_s32->getOutput(0), 8, DimsHW{1, 1}, weightMap[\"face_rpn_bbox_pred_stride32_weight\"], weightMap[\"face_rpn_bbox_pred_stride32_bias\"]);\n    auto landmark_s32 = network->addConvolutionNd(*rf_head_s32->getOutput(0), 20, DimsHW{1, 1}, weightMap[\"face_rpn_landmark_pred_stride32_weight\"], weightMap[\"face_rpn_landmark_pred_stride32_bias\"]);\n    auto rf_head2_s32 = head(network, weightMap, *rf_c3_lateral->getOutput(0), \"rf_head2_stride32\");\n    ILayer *type_score_s32 = network->addConvolutionNd(*rf_head2_s32->getOutput(0), 6, DimsHW{1, 1}, weightMap[\"face_rpn_type_score_stride32_weight\"], weightMap[\"face_rpn_type_score_stride32_bias\"]);\n    type_score_s32 = reshapeSoftmax(network, *type_score_s32->getOutput(0), 6);\n\n    float *deval = reinterpret_cast<float*>(malloc(sizeof(float) * 64 * 2 * 2));\n    for (int i = 0; i < 64 * 2 * 2; i++) {\n        deval[i] = 1.0;\n    }\n    Weights deconvwts{DataType::kFLOAT, deval, 64 * 2 * 2};\n    IDeconvolutionLayer* c3_deconv = network->addDeconvolutionNd(*rf_c3_lateral->getOutput(0), 64, DimsHW{2, 2}, deconvwts, emptywts);\n    assert(c3_deconv);\n    c3_deconv->setStrideNd(DimsHW{2, 2});\n    c3_deconv->setNbGroups(64);\n    weightMap[\"c3_deconv\"] = deconvwts;\n    auto rf_c2_lateral = convBiasBnRelu(network, weightMap, *conv12->getOutput(0), 64, 1, 1, 0, \"rf_c2_lateral\");\n    auto plus0 = network->addElementWise(*c3_deconv->getOutput(0), *rf_c2_lateral->getOutput(0), ElementWiseOperation::kSUM);\n    auto rf_c2_aggr = convBiasBnRelu(network, weightMap, *plus0->getOutput(0), 64, 3, 1, 1, \"rf_c2_aggr\");\n    auto rf_head_s16 = head(network, weightMap, *rf_c2_aggr->getOutput(0), \"rf_head_stride16\");\n    ILayer *cls_score_s16 = network->addConvolutionNd(*rf_head_s16->getOutput(0), 4, DimsHW{1, 1}, weightMap[\"face_rpn_cls_score_stride16_weight\"], weightMap[\"face_rpn_cls_score_stride16_bias\"]);\n    cls_score_s16 = reshapeSoftmax(network, *cls_score_s16->getOutput(0), 4);\n    auto bbox_s16 = network->addConvolutionNd(*rf_head_s16->getOutput(0), 8, DimsHW{1, 1}, weightMap[\"face_rpn_bbox_pred_stride16_weight\"], weightMap[\"face_rpn_bbox_pred_stride16_bias\"]);\n    auto landmark_s16 = network->addConvolutionNd(*rf_head_s16->getOutput(0), 20, DimsHW{1, 1}, weightMap[\"face_rpn_landmark_pred_stride16_weight\"], weightMap[\"face_rpn_landmark_pred_stride16_bias\"]);\n    auto rf_head2_s16 = head(network, weightMap, *rf_c2_aggr->getOutput(0), \"rf_head2_stride16\");\n    ILayer *type_score_s16 = network->addConvolutionNd(*rf_head2_s16->getOutput(0), 6, DimsHW{1, 1}, weightMap[\"face_rpn_type_score_stride16_weight\"], weightMap[\"face_rpn_type_score_stride16_bias\"]);\n    type_score_s16 = reshapeSoftmax(network, *type_score_s16->getOutput(0), 6);\n\n    IDeconvolutionLayer* c2_deconv = network->addDeconvolutionNd(*rf_c2_aggr->getOutput(0), 64, DimsHW{2, 2}, deconvwts, emptywts);\n    assert(c2_deconv);\n    c2_deconv->setStrideNd(DimsHW{2, 2});\n    c2_deconv->setNbGroups(64);\n    auto rf_c1_red = convBiasBnRelu(network, weightMap, *conv6->getOutput(0), 64, 1, 1, 0, \"rf_c1_red_conv\");\n    auto plus1 = network->addElementWise(*c2_deconv->getOutput(0), *rf_c1_red->getOutput(0), ElementWiseOperation::kSUM);\n    auto rf_c1_aggr = convBiasBnRelu(network, weightMap, *plus1->getOutput(0), 64, 3, 1, 1, \"rf_c1_aggr\");\n    auto rf_head_s8 = head(network, weightMap, *rf_c1_aggr->getOutput(0), \"rf_head_stride8\");\n    ILayer *cls_score_s8 = network->addConvolutionNd(*rf_head_s8->getOutput(0), 4, DimsHW{1, 1}, weightMap[\"face_rpn_cls_score_stride8_weight\"], weightMap[\"face_rpn_cls_score_stride8_bias\"]);\n    cls_score_s8 = reshapeSoftmax(network, *cls_score_s8->getOutput(0), 4);\n    auto bbox_s8 = network->addConvolutionNd(*rf_head_s8->getOutput(0), 8, DimsHW{1, 1}, weightMap[\"face_rpn_bbox_pred_stride8_weight\"], weightMap[\"face_rpn_bbox_pred_stride8_bias\"]);\n    auto landmark_s8 = network->addConvolutionNd(*rf_head_s8->getOutput(0), 20, DimsHW{1, 1}, weightMap[\"face_rpn_landmark_pred_stride8_weight\"], weightMap[\"face_rpn_landmark_pred_stride8_bias\"]);\n    auto rf_head2_s8 = head(network, weightMap, *rf_c1_aggr->getOutput(0), \"rf_head2_stride8\");\n    ILayer *type_score_s8 = network->addConvolutionNd(*rf_head2_s8->getOutput(0), 6, DimsHW{1, 1}, weightMap[\"face_rpn_type_score_stride8_weight\"], weightMap[\"face_rpn_type_score_stride8_bias\"]);\n    type_score_s8 = reshapeSoftmax(network, *type_score_s8->getOutput(0), 6);\n\n    ITensor* inputTensors_s32[] = {cls_score_s32->getOutput(0), bbox_s32->getOutput(0), landmark_s32->getOutput(0), type_score_s32->getOutput(0)};\n    auto cat_s32 = network->addConcatenation(inputTensors_s32, 4);\n    assert(cat_s32);\n\n    ITensor* inputTensors_s16[] = {cls_score_s16->getOutput(0), bbox_s16->getOutput(0), landmark_s16->getOutput(0), type_score_s16->getOutput(0)};\n    auto cat_s16 = network->addConcatenation(inputTensors_s16, 4);\n    assert(cat_s16);\n\n    ITensor* inputTensors_s8[] = {cls_score_s8->getOutput(0), bbox_s8->getOutput(0), landmark_s8->getOutput(0), type_score_s8->getOutput(0)};\n    auto cat_s8 = network->addConcatenation(inputTensors_s8, 4);\n    assert(cat_s8);\n\n    auto creator = getPluginRegistry()->getPluginCreator(\"Decode_TRT\", \"1\");\n    PluginFieldCollection pfc;\n    IPluginV2 *pluginObj = creator->createPlugin(\"decode\", &pfc);\n    ITensor* inputTensors[] = {cat_s8->getOutput(0), cat_s16->getOutput(0), cat_s32->getOutput(0)};\n    auto decodelayer = network->addPluginV2(inputTensors, 3, *pluginObj);\n    assert(decodelayer);\n\n    decodelayer->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*decodelayer->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#ifdef USE_FP16\n    config->setFlag(BuilderFlag::kFP16);\n#endif\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n                strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (argc == 2 && std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(BATCH_SIZE, &modelStream);\n        assert(modelStream != nullptr);\n        std::ofstream p(\"retinafaceAntiCov.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    } else if (argc == 2 && std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"retinafaceAntiCov.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./retinafaceAntiCov -s  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./retinafaceAntiCov -d  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // prepare input data ---------------------------\n    static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];\n    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n    //    data[i] = 1.0;\n    static float prob[BATCH_SIZE * OUTPUT_SIZE];\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    cv::Mat img = cv::imread(\"test.jpg\");\n    cv::Mat pr_img = preprocess_img(img);\n    for (int i = 0; i < INPUT_H * INPUT_W; i++) {\n        data[i] = ((float)pr_img.at<cv::Vec3b>(i)[2] - 127.5) * 0.0078125;\n        data[i + INPUT_H * INPUT_W] = ((float)pr_img.at<cv::Vec3b>(i)[1] - 127.5) * 0.0078125;\n        data[i + 2 * INPUT_H * INPUT_W] = ((float)pr_img.at<cv::Vec3b>(i)[0] - 127.5) * 0.0078125;\n    }\n\n    // Run inference\n    auto start = std::chrono::system_clock::now();\n    doInference(*context, data, prob, BATCH_SIZE);\n    auto end = std::chrono::system_clock::now();\n    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n    std::vector<decodeplugin::Detection> res;\n    nms(res, prob);\n\n    for (size_t j = 0; j < res.size(); j++) {\n        //if (res[j].class_confidence < 0.1) continue;\n        cv::Rect r = get_rect_adapt_landmark(img, res[j].bbox, res[j].landmark);\n        cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n        cv::putText(img, \"face: \" + std::to_string((int)(res[j].class_confidence * 100)) + \"%\", cv::Point(r.x, r.y + 20), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0xFF, 0xFF, 0xFF), 1);\n        for (int k = 0; k < 10; k += 2) {\n            cv::circle(img, cv::Point(res[j].landmark[k], res[j].landmark[k + 1]), 1, cv::Scalar(255 * (k > 2), 255 * (k > 0 && k < 8), 255 * (k < 6)), 4);\n        }\n        cv::putText(img, \"mask: \" + std::to_string((int)(res[j].mask_confidence * 100)) + \"%\", cv::Point(r.x, r.y + 40), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0x00, 0x00, 0xFF), 1);\n    }\n    cv::imwrite(\"out.jpg\", img);\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    //Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << i / 10 << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "scaled-yolov4/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(yolov4)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\ncuda_add_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/yololayer.cu ${PROJECT_SOURCE_DIR}/mish.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(yolov4csp ${PROJECT_SOURCE_DIR}/yolov4_csp.cpp)\ntarget_link_libraries(yolov4csp nvinfer)\ntarget_link_libraries(yolov4csp cudart)\ntarget_link_libraries(yolov4csp myplugins)\ntarget_link_libraries(yolov4csp ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "scaled-yolov4/README.md",
    "content": "# scaled-yolov4\n\nThe Pytorch implementation is from [WongKinYiu/ScaledYOLOv4 yolov4-csp branch](https://github.com/WongKinYiu/ScaledYOLOv4/tree/yolov4-csp). It can load yolov4-csp.cfg and yolov4-csp.weights(from AlexeyAB/darknet).\n\nNote: There is a slight difference in yolov4-csp.cfg for darknet and pytorch. Use the one given in the above repo.\n\n## Config\n\n- Input shape `INPUT_H`, `INPUT_W` defined in yololayer.h\n- Number of classes `CLASS_NUM` defined in yololayer.h\n- FP16/FP32 can be selected by the macro `USE_FP16` in yolov4_csp.cpp\n- GPU id can be selected by the macro `DEVICE` in yolov4_csp.cpp\n- NMS thresh `NMS_THRESH` in yolov4_csp.cpp\n- bbox confidence threshold `BBOX_CONF_THRESH` in yolov4_csp.cpp\n- `BATCH_SIZE` in yolov4_csp.cpp\n\n## How to run\n\n1. generate yolov4_csp.wts from pytorch implementation with yolov4-csp.cfg and yolov4-csp.weights.\n\n```\ngit clone https://github.com/wang-xinyu/tensorrtx.git\ngit clone -b yolov4-csp https://github.com/WongKinYiu/ScaledYOLOv4.git\n// download yolov4-csp.weights from https://github.com/WongKinYiu/ScaledYOLOv4/tree/yolov4-csp#yolov4-csp\ncp {tensorrtx}/scaled-yolov4/gen_wts.py {ScaledYOLOv4/}\ncd {ScaledYOLOv4/}\npython gen_wts.py yolov4-csp.weights\n// a file 'yolov4_csp.wts' will be generated.\n```\n\n2. put yolov4_csp.wts into {tensorrtx}/scaled-yolov4, build and run\n\n```\nmv yolov4_csp.wts {tensorrtx}/scaled-yolov4/\ncd {tensorrtx}/scaled-yolov4\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./yolov4csp -s                          // serialize model to plan file i.e. 'yolov4csp.engine'\nsudo ./yolov4csp -d ../../yolov3-spp/samples // deserialize plan file and run inference, the images in samples will be processed.\n```\n\n3. check the images generated, as follows. _zidane.jpg and _bus.jpg\n<p align=\"center\">\n<img src= https://user-images.githubusercontent.com/39617050/117172509-824cf980-ade9-11eb-8e4c-27dbe658e355.jpg>\n</p>\n\n<p align=\"center\">\n<img src= https://user-images.githubusercontent.com/39617050/117172880-dbb52880-ade9-11eb-839a-0814fd46198e.jpg>\n</p>\n\n\n## More Information\n\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n"
  },
  {
    "path": "scaled-yolov4/common.hpp",
    "content": "#include <fstream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <opencv2/opencv.hpp>\n\n#include \"NvInfer.h\"\n#include \"yololayer.h\"\n#include \"mish.h\"\n\n\nusing namespace nvinfer1;\n\ncv::Mat preprocess_img(cv::Mat& img) {\n    int w, h, x, y;\n    float r_w = Yolo::INPUT_W / (img.cols*1.0);\n    float r_h = Yolo::INPUT_H / (img.rows*1.0);\n    if (r_h > r_w) {\n        w = Yolo::INPUT_W;\n        h = r_w * img.rows;\n        x = 0;\n        y = (Yolo::INPUT_H - h) / 2;\n    } else {\n        w = r_h* img.cols;\n        h = Yolo::INPUT_H;\n        x = (Yolo::INPUT_W - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size());\n    cv::Mat out(Yolo::INPUT_H, Yolo::INPUT_W, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    int l, r, t, b;\n    float r_w = Yolo::INPUT_W / (img.cols * 1.0);\n    float r_h = Yolo::INPUT_H / (img.rows * 1.0);\n    if (r_h > r_w) {\n        l = bbox[0] - bbox[2]/2.f;\n        r = bbox[0] + bbox[2]/2.f;\n        t = bbox[1] - bbox[3]/2.f - (Yolo::INPUT_H - r_w * img.rows) / 2;\n        b = bbox[1] + bbox[3]/2.f - (Yolo::INPUT_H - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - bbox[2]/2.f - (Yolo::INPUT_W - r_h * img.cols) / 2;\n        r = bbox[0] + bbox[2]/2.f - (Yolo::INPUT_W - r_h * img.cols) / 2;\n        t = bbox[1] - bbox[3]/2.f;\n        b = bbox[1] + bbox[3]/2.f;\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    return cv::Rect(l, t, r-l, b-t);\n}\n\nfloat iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n        std::max(lbox[0] - lbox[2]/2.f , rbox[0] - rbox[2]/2.f), //left\n        std::min(lbox[0] + lbox[2]/2.f , rbox[0] + rbox[2]/2.f), //right\n        std::max(lbox[1] - lbox[3]/2.f , rbox[1] - rbox[3]/2.f), //top\n        std::min(lbox[1] + lbox[3]/2.f , rbox[1] + rbox[3]/2.f), //bottom\n    };\n\n    if(interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS =(interBox[1]-interBox[0])*(interBox[3]-interBox[2]);\n    return interBoxS/(lbox[2]*lbox[3] + rbox[2]*rbox[3] -interBoxS);\n}\n\nbool cmp(const Yolo::Detection& a, const Yolo::Detection& b) {\n    return a.det_confidence > b.det_confidence;\n}\n\nvoid nms(std::vector<Yolo::Detection>& res, float *output, float conf_thresh, float nms_thresh = 0.5) {\n    int det_size = sizeof(Yolo::Detection) / sizeof(float);\n    std::map<float, std::vector<Yolo::Detection>> m;\n    for (int i = 0; i < output[0] && i < Yolo::MAX_OUTPUT_BBOX_COUNT; i++) {\n        if (output[1 + det_size * i + 4] <= conf_thresh) continue;\n        Yolo::Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        if (m.count(det.class_id) == 0) m.emplace(det.class_id, std::vector<Yolo::Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        //std::cout << it->second[0].class_id << \" --- \" << std::endl;\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                    dets.erase(dets.begin()+n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* convBnMish(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int ksize, int s, int p, int linx) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[\"module_list.\" + std::to_string(linx) + \".Conv2d.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"module_list.\" + std::to_string(linx) + \".BatchNorm2d\", 1e-4);\n\n    auto creator = getPluginRegistry()->getPluginCreator(\"Mish_TRT\", \"1\");\n    const PluginFieldCollection* pluginData = creator->getFieldNames();\n    IPluginV2 *pluginObj = creator->createPlugin((\"mish\" + std::to_string(linx)).c_str(), pluginData);\n    ITensor* inputTensors[] = {bn1->getOutput(0)};\n    auto mish = network->addPluginV2(&inputTensors[0], 1, *pluginObj);\n    return mish;\n}"
  },
  {
    "path": "scaled-yolov4/gen_wts.py",
    "content": "import struct\nimport sys\nfrom models.models import *\nfrom utils import *\n\nmodel = Darknet('models/yolov4-csp.cfg', (512, 512))\nweights = sys.argv[1]\ndevice = torch_utils.select_device('0')\nif weights.endswith('.pt'):  # pytorch format\n    model.load_state_dict(torch.load(weights, map_location=device)['model'])\nelse:  # darknet format\n    load_darknet_weights(model, weights)\n\nwith open('yolov4_csp.wts', 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f',float(vv)).hex())\n        f.write('\\n')\n\n"
  },
  {
    "path": "scaled-yolov4/logging.h",
    "content": "/*\n * Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n        , mPrefix(other.mPrefix)\n        , mShouldLog(other.mShouldLog)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n            {\n                ss << \" \";\n            }\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//!         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H"
  },
  {
    "path": "scaled-yolov4/mish.cu",
    "content": "#include <cmath>\n#include <stdio.h>\n#include <cassert>\n#include <iostream>\n#include \"mish.h\"\n\nnamespace nvinfer1\n{\n    MishPlugin::MishPlugin()\n    {\n    }\n\n    MishPlugin::~MishPlugin()\n    {\n    }\n\n    // create the plugin at runtime from a byte stream\n    MishPlugin::MishPlugin(const void* data, size_t length)\n    {\n        assert(length == sizeof(input_size_));\n        input_size_ = *reinterpret_cast<const int*>(data);\n    }\n\n    void MishPlugin::serialize(void* buffer) const\n    {\n        *reinterpret_cast<int*>(buffer) = input_size_;\n    }\n\n    size_t MishPlugin::getSerializationSize() const\n    {  \n        return sizeof(input_size_);\n    }\n\n    int MishPlugin::initialize()\n    { \n        return 0;\n    }\n\n    Dims MishPlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims)\n    {\n        assert(nbInputDims == 1);\n        assert(index == 0);\n        input_size_ = inputs[0].d[0] * inputs[0].d[1] * inputs[0].d[2];\n        // Output dimensions\n        return Dims3(inputs[0].d[0], inputs[0].d[1], inputs[0].d[2]);\n    }\n\n    // Set plugin namespace\n    void MishPlugin::setPluginNamespace(const char* pluginNamespace)\n    {\n        mPluginNamespace = pluginNamespace;\n    }\n\n    const char* MishPlugin::getPluginNamespace() const\n    {\n        return mPluginNamespace;\n    }\n\n    // Return the DataType of the plugin output at the requested index\n    DataType MishPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const\n    {\n        return DataType::kFLOAT;\n    }\n\n    // Return true if output tensor is broadcast across a batch.\n    bool MishPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const\n    {\n        return false;\n    }\n\n    // Return true if plugin can use input that is broadcast across batch without replication.\n    bool MishPlugin::canBroadcastInputAcrossBatch(int inputIndex) const\n    {\n        return false;\n    }\n\n    void MishPlugin::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput)\n    {\n    }\n\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\n    void MishPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator)\n    {\n    }\n\n    // Detach the plugin object from its execution context.\n    void MishPlugin::detachFromContext() {}\n\n    const char* MishPlugin::getPluginType() const\n    {\n        return \"Mish_TRT\";\n    }\n\n    const char* MishPlugin::getPluginVersion() const\n    {\n        return \"1\";\n    }\n\n    void MishPlugin::destroy()\n    {\n        delete this;\n    }\n\n    // Clone the plugin\n    IPluginV2IOExt* MishPlugin::clone() const\n    {\n        MishPlugin *p = new MishPlugin();\n        p->input_size_ = input_size_;\n        p->setPluginNamespace(mPluginNamespace);\n        return p;\n    }\n\n    __device__ float tanh_activate_kernel(float x){return (2/(1 + expf(-2*x)) - 1);}\n\n    __device__ float softplus_kernel(float x, float threshold = 20) {\n        if (x > threshold) return x;                // too large\n        else if (x < -threshold) return expf(x);    // too small\n        return logf(expf(x) + 1);\n    }\n\n    __global__ void mish_kernel(const float *input, float *output, int num_elem) {\n\n        int idx = threadIdx.x + blockDim.x * blockIdx.x;\n        if (idx >= num_elem) return;\n\n        //float t = exp(input[idx]);\n        //if (input[idx] > 20.0) {\n        //    t *= t;\n        //    output[idx] = (t - 1.0) / (t + 1.0);\n        //} else {\n        //    float tt = t * t;\n        //    output[idx] = (tt + 2.0 * t) / (tt + 2.0 * t + 2.0);\n        //}\n        //output[idx] *= input[idx];\n        output[idx] = input[idx] * tanh_activate_kernel(softplus_kernel(input[idx]));\n    }\n\n    void MishPlugin::forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize) {\n        int block_size = thread_count_;\n        int grid_size = (input_size_ * batchSize + block_size - 1) / block_size;\n        mish_kernel<<<grid_size, block_size>>>(inputs[0], output, input_size_ * batchSize);\n    }\n\n    int MishPlugin::enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream)\n    {\n        //assert(batchSize == 1);\n        //GPU\n        //CUDA_CHECK(cudaStreamSynchronize(stream));\n        forwardGpu((const float *const *)inputs, (float*)outputs[0], stream, batchSize);\n        return 0;\n    }\n\n    PluginFieldCollection MishPluginCreator::mFC{};\n    std::vector<PluginField> MishPluginCreator::mPluginAttributes;\n\n    MishPluginCreator::MishPluginCreator()\n    {\n        mPluginAttributes.clear();\n\n        mFC.nbFields = mPluginAttributes.size();\n        mFC.fields = mPluginAttributes.data();\n    }\n\n    const char* MishPluginCreator::getPluginName() const\n    {\n            return \"Mish_TRT\";\n    }\n\n    const char* MishPluginCreator::getPluginVersion() const\n    {\n            return \"1\";\n    }\n\n    const PluginFieldCollection* MishPluginCreator::getFieldNames()\n    {\n            return &mFC;\n    }\n\n    IPluginV2IOExt* MishPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc)\n    {\n        MishPlugin* obj = new MishPlugin();\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n    IPluginV2IOExt* MishPluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength)\n    {\n        // This object will be deleted when the network is destroyed, which will\n        // call MishPlugin::destroy()\n        MishPlugin* obj = new MishPlugin(serialData, serialLength);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n}\n\n"
  },
  {
    "path": "scaled-yolov4/mish.h",
    "content": "#ifndef TRTX_MISH_PLUGIN_H\n#define TRTX_MISH_PLUGIN_H\n\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n\nnamespace nvinfer1\n{\n    class MishPlugin: public IPluginV2IOExt\n    {\n        public:\n            explicit MishPlugin();\n            MishPlugin(const void* data, size_t length);\n\n            ~MishPlugin();\n\n            int getNbOutputs() const override\n            {\n                return 1;\n            }\n\n            Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) override;\n\n            int initialize() override;\n\n            virtual void terminate() override {};\n\n            virtual size_t getWorkspaceSize(int maxBatchSize) const override { return 0;}\n\n            virtual int enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream) override;\n\n            virtual size_t getSerializationSize() const override;\n\n            virtual void serialize(void* buffer) const override;\n\n            bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const override {\n                return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n            }\n\n            const char* getPluginType() const override;\n\n            const char* getPluginVersion() const override;\n\n            void destroy() override;\n\n            IPluginV2IOExt* clone() const override;\n\n            void setPluginNamespace(const char* pluginNamespace) override;\n\n            const char* getPluginNamespace() const override;\n\n            DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const override;\n\n            bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const override;\n\n            bool canBroadcastInputAcrossBatch(int inputIndex) const override;\n\n            void attachToContext(\n                    cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) override;\n\n            void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) override;\n\n            void detachFromContext() override;\n\n            int input_size_;\n        private:\n            void forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize = 1);\n            int thread_count_ = 256;\n            const char* mPluginNamespace;\n    };\n\n    class MishPluginCreator : public IPluginCreator\n    {\n        public:\n            MishPluginCreator();\n\n            ~MishPluginCreator() override = default;\n\n            const char* getPluginName() const override;\n\n            const char* getPluginVersion() const override;\n\n            const PluginFieldCollection* getFieldNames() override;\n\n            IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) override;\n\n            IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) override;\n\n            void setPluginNamespace(const char* libNamespace) override\n            {\n                mNamespace = libNamespace;\n            }\n\n            const char* getPluginNamespace() const override\n            {\n                return mNamespace.c_str();\n            }\n\n        private:\n            std::string mNamespace;\n            static PluginFieldCollection mFC;\n            static std::vector<PluginField> mPluginAttributes;\n    };\n    REGISTER_TENSORRT_PLUGIN(MishPluginCreator);\n};\n\n#endif  // TRTX_MISH_PLUGIN_H"
  },
  {
    "path": "scaled-yolov4/utils.h",
    "content": "#ifndef __TRT_UTILS_H_\n#define __TRT_UTILS_H_\n\n#include <iostream>\n#include <vector>\n#include <algorithm>\n#include <cudnn.h>\n\n#ifndef CUDA_CHECK\n\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n\n#endif\n\nnamespace Tn\n{\n    template<typename T> \n    void write(char*& buffer, const T& val)\n    {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T> \n    void read(const char*& buffer, T& val)\n    {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n}\n\n#endif"
  },
  {
    "path": "scaled-yolov4/yololayer.cu",
    "content": "#include <assert.h>\n#include \"yololayer.h\"\n#include \"utils.h\"\n\nusing namespace Yolo;\n\nnamespace nvinfer1\n{\n    YoloLayerPlugin::YoloLayerPlugin()\n    {\n        mClassCount = CLASS_NUM;\n        mYoloKernel.clear();\n        mYoloKernel.push_back(yolo1);\n        mYoloKernel.push_back(yolo2);\n        mYoloKernel.push_back(yolo3);\n\n        mKernelCount = mYoloKernel.size();\n\n        CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n        size_t AnchorLen = sizeof(float)* CHECK_COUNT*2;\n        for(int ii = 0; ii < mKernelCount; ii ++)\n        {\n            CUDA_CHECK(cudaMalloc(&mAnchor[ii],AnchorLen));\n            const auto& yolo = mYoloKernel[ii];\n            CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n        }\n    }\n\n    YoloLayerPlugin::~YoloLayerPlugin()\n    {\n    }\n\n    // create the plugin at runtime from a byte stream\n    YoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length)\n    {\n        using namespace Tn;\n        const char *d = reinterpret_cast<const char *>(data), *a = d;\n        read(d, mClassCount);\n        read(d, mThreadCount);\n        read(d, mKernelCount);\n        mYoloKernel.resize(mKernelCount);\n        auto kernelSize = mKernelCount*sizeof(YoloKernel);\n        memcpy(mYoloKernel.data(),d,kernelSize);\n        d += kernelSize;\n\n        CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n        size_t AnchorLen = sizeof(float)* CHECK_COUNT*2;\n        for(int ii = 0; ii < mKernelCount; ii ++)\n        {\n            CUDA_CHECK(cudaMalloc(&mAnchor[ii],AnchorLen));\n            const auto& yolo = mYoloKernel[ii];\n            CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n        }\n\n        assert(d == a + length);\n    }\n\n    void YoloLayerPlugin::serialize(void* buffer) const\n    {\n        using namespace Tn;\n        char* d = static_cast<char*>(buffer), *a = d;\n        write(d, mClassCount);\n        write(d, mThreadCount);\n        write(d, mKernelCount);\n        auto kernelSize = mKernelCount*sizeof(YoloKernel);\n        memcpy(d,mYoloKernel.data(),kernelSize);\n        d += kernelSize;\n\n        assert(d == a + getSerializationSize());\n    }\n    \n    size_t YoloLayerPlugin::getSerializationSize() const\n    {  \n        return sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mKernelCount)  + sizeof(Yolo::YoloKernel) * mYoloKernel.size();\n    }\n\n    int YoloLayerPlugin::initialize()\n    { \n        return 0;\n    }\n    \n    Dims YoloLayerPlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims)\n    {\n        //output the result to channel\n        int totalsize = MAX_OUTPUT_BBOX_COUNT * sizeof(Detection) / sizeof(float);\n\n        return Dims3(totalsize + 1, 1, 1);\n    }\n\n    // Set plugin namespace\n    void YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace)\n    {\n        mPluginNamespace = pluginNamespace;\n    }\n\n    const char* YoloLayerPlugin::getPluginNamespace() const\n    {\n        return mPluginNamespace;\n    }\n\n    // Return the DataType of the plugin output at the requested index\n    DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const\n    {\n        return DataType::kFLOAT;\n    }\n\n    // Return true if output tensor is broadcast across a batch.\n    bool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const\n    {\n        return false;\n    }\n\n    // Return true if plugin can use input that is broadcast across batch without replication.\n    bool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const\n    {\n        return false;\n    }\n\n    void YoloLayerPlugin::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput)\n    {\n    }\n\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\n    void YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator)\n    {\n    }\n\n    // Detach the plugin object from its execution context.\n    void YoloLayerPlugin::detachFromContext() {}\n\n    const char* YoloLayerPlugin::getPluginType() const\n    {\n        return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloLayerPlugin::getPluginVersion() const\n    {\n        return \"1\";\n    }\n\n    void YoloLayerPlugin::destroy()\n    {\n        delete this;\n    }\n\n    // Clone the plugin\n    IPluginV2IOExt* YoloLayerPlugin::clone() const\n    {\n        YoloLayerPlugin *p = new YoloLayerPlugin();\n        p->setPluginNamespace(mPluginNamespace);\n        return p;\n    }\n\n    __device__ float Logist(float data){ return 1./(1. + exp(-data)); };\n\n    __global__ void CalDetection(const float *input, float *output, int noElements, \n            int yoloWidth, int yoloHeight, const float anchors[CHECK_COUNT*2],int classes,int outputElem) {\n \n        int idx = threadIdx.x + blockDim.x * blockIdx.x;\n        if (idx >= noElements) return;\n\n        int total_grid = yoloWidth * yoloHeight;\n        int bnIdx = idx / total_grid;\n        idx = idx - total_grid*bnIdx;\n        int info_len_i = 5 + classes;\n        const float* curInput = input + bnIdx * (info_len_i * total_grid * CHECK_COUNT);\n\n        for (int k = 0; k < 3; ++k) {\n            int class_id = 0;\n            float max_cls_prob = 0.0;\n            for (int i = 5; i < info_len_i; ++i) {\n                float p = Logist(curInput[idx + k * info_len_i * total_grid + i * total_grid]);\n                if (p > max_cls_prob) {\n                    max_cls_prob = p;\n                    class_id = i - 5;\n                }\n            }\n            float box_prob = Logist(curInput[idx + k * info_len_i * total_grid + 4 * total_grid]);\n            if (max_cls_prob < IGNORE_THRESH || box_prob < IGNORE_THRESH) continue;\n\n            float *res_count = output + bnIdx*outputElem;\n            int count = (int)atomicAdd(res_count, 1);\n            if (count >= MAX_OUTPUT_BBOX_COUNT) return;\n            char* data = (char * )res_count + sizeof(float) + count*sizeof(Detection);\n            Detection* det =  (Detection*)(data);\n\n            int row = idx / yoloWidth;\n            int col = idx % yoloWidth;\n\n            //Location\n            det->bbox[0] = (col + (2 * (Logist(curInput[idx + k * info_len_i * total_grid + 0 * total_grid]))) - 0.5) * INPUT_W / yoloWidth;\n            det->bbox[1] = (row + (2 * (Logist(curInput[idx + k * info_len_i * total_grid + 1 * total_grid]))) - 0.5) * INPUT_H / yoloHeight;\n            det->bbox[2] = (powf(2 * (Logist(curInput[idx + k * info_len_i * total_grid + 2 * total_grid])), 2)) * anchors[2*k];\n            det->bbox[3] = (powf(2 * (Logist(curInput[idx + k * info_len_i * total_grid + 3 * total_grid])), 2)) * anchors[2*k + 1];\n            det->det_confidence = box_prob;\n            det->class_id = class_id;\n            det->class_confidence = max_cls_prob;\n        }\n    }\n\n    void YoloLayerPlugin::forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize) {\n\n        int outputElem = 1 + MAX_OUTPUT_BBOX_COUNT * sizeof(Detection) / sizeof(float);\n\n        for(int idx = 0 ; idx < batchSize; ++idx) {\n            CUDA_CHECK(cudaMemset(output + idx*outputElem, 0, sizeof(float)));\n        }\n        int numElem = 0;\n        for (unsigned int i = 0;i< mYoloKernel.size();++i)\n        {\n            const auto& yolo = mYoloKernel[i];\n            numElem = yolo.width*yolo.height*batchSize;\n            if (numElem < mThreadCount)\n                mThreadCount = numElem;\n            CalDetection<<< (yolo.width*yolo.height*batchSize + mThreadCount - 1) / mThreadCount, mThreadCount>>>\n                (inputs[i],output, numElem, yolo.width, yolo.height, (float *)mAnchor[i], mClassCount ,outputElem);\n        }\n\n    }\n\n\n    int YoloLayerPlugin::enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream)\n    {\n        //assert(batchSize == 1);\n        //GPU\n        //CUDA_CHECK(cudaStreamSynchronize(stream));\n        forwardGpu((const float *const *)inputs, (float*)outputs[0], stream, batchSize);\n\n        return 0;\n    }\n\n    PluginFieldCollection YoloPluginCreator::mFC{};\n    std::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\n    YoloPluginCreator::YoloPluginCreator()\n    {\n        mPluginAttributes.clear();\n\n        mFC.nbFields = mPluginAttributes.size();\n        mFC.fields = mPluginAttributes.data();\n    }\n\n    const char* YoloPluginCreator::getPluginName() const\n    {\n            return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloPluginCreator::getPluginVersion() const\n    {\n            return \"1\";\n    }\n\n    const PluginFieldCollection* YoloPluginCreator::getFieldNames()\n    {\n            return &mFC;\n    }\n\n    IPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc)\n    {\n        YoloLayerPlugin* obj = new YoloLayerPlugin();\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n    IPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength)\n    {\n        // This object will be deleted when the network is destroyed, which will\n        // call MishPlugin::destroy()\n        YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n}"
  },
  {
    "path": "scaled-yolov4/yololayer.h",
    "content": "#ifndef _YOLO_LAYER_H\n#define _YOLO_LAYER_H\n\n#include <iostream>\n#include <vector>\n#include \"NvInfer.h\"\n\nnamespace Yolo\n{\n    static constexpr int CHECK_COUNT = 3;\n    static constexpr float IGNORE_THRESH = 0.1f;\n    static constexpr int MAX_OUTPUT_BBOX_COUNT = 1000;\n    static constexpr int CLASS_NUM = 80;\n    static constexpr int INPUT_H = 512;\n    static constexpr int INPUT_W = 512;\n\n    struct YoloKernel\n    {\n        int width;\n        int height;\n        float anchors[CHECK_COUNT*2];\n    };\n\n    static constexpr YoloKernel yolo1 = {\n        INPUT_W / 8,\n        INPUT_H / 8,\n        {12,16, 19,36, 40,28}\n    };\n    static constexpr YoloKernel yolo2 = {\n        INPUT_W / 16,\n        INPUT_H / 16,\n        {36,75, 76,55, 72,146}\n    };\n    static constexpr YoloKernel yolo3 = {\n        INPUT_W / 32,\n        INPUT_H / 32,\n        {142,110, 192,243, 459,401}\n    };\n\n    static constexpr int LOCATIONS = 4;\n    struct alignas(float) Detection{\n        //x y w h\n        float bbox[LOCATIONS];\n        float det_confidence;\n        float class_id;\n        float class_confidence;\n    };\n}\n\n\nnamespace nvinfer1\n{\n    class YoloLayerPlugin: public IPluginV2IOExt\n    {\n        public:\n            explicit YoloLayerPlugin();\n            YoloLayerPlugin(const void* data, size_t length);\n\n            ~YoloLayerPlugin();\n\n            int getNbOutputs() const override\n            {\n                return 1;\n            }\n\n            Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) override;\n\n            int initialize() override;\n\n            virtual void terminate() override {};\n\n            virtual size_t getWorkspaceSize(int maxBatchSize) const override { return 0;}\n\n            virtual int enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream) override;\n\n            virtual size_t getSerializationSize() const override;\n\n            virtual void serialize(void* buffer) const override;\n\n            bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const override {\n                return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n            }\n\n            const char* getPluginType() const override;\n\n            const char* getPluginVersion() const override;\n\n            void destroy() override;\n\n            IPluginV2IOExt* clone() const override;\n\n            void setPluginNamespace(const char* pluginNamespace) override;\n\n            const char* getPluginNamespace() const override;\n\n            DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const override;\n\n            bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const override;\n\n            bool canBroadcastInputAcrossBatch(int inputIndex) const override;\n\n            void attachToContext(\n                    cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) override;\n\n            void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) override;\n\n            void detachFromContext() override;\n\n        private:\n            void forwardGpu(const float *const * inputs,float * output, cudaStream_t stream, int batchSize = 1);\n            int mClassCount;\n            int mKernelCount;\n            std::vector<Yolo::YoloKernel> mYoloKernel;\n            int mThreadCount = 256;\n            void** mAnchor;\n            const char* mPluginNamespace;\n    };\n\n    class YoloPluginCreator : public IPluginCreator\n    {\n        public:\n            YoloPluginCreator();\n\n            ~YoloPluginCreator() override = default;\n\n            const char* getPluginName() const override;\n\n            const char* getPluginVersion() const override;\n\n            const PluginFieldCollection* getFieldNames() override;\n\n            IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) override;\n\n            IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) override;\n\n            void setPluginNamespace(const char* libNamespace) override\n            {\n                mNamespace = libNamespace;\n            }\n\n            const char* getPluginNamespace() const override\n            {\n                return mNamespace.c_str();\n            }\n\n        private:\n            std::string mNamespace;\n            static PluginFieldCollection mFC;\n            static std::vector<PluginField> mPluginAttributes;\n    };\n    REGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n};\n\n#endif \n"
  },
  {
    "path": "scaled-yolov4/yolov4_csp.cpp",
    "content": "#include <iostream>\n#include <chrono>\n#include <dirent.h>\n\n#include \"logging.h\"\n#include \"utils.h\"\n#include \"cuda_runtime_api.h\"\n#include \"common.hpp\"\n\n#define USE_FP16  // comment out this if want to use FP32\n#define DEVICE 0  // GPU id\n#define NMS_THRESH 0.4\n#define BBOX_CONF_THRESH 0.5\n#define BATCH_SIZE 1\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = Yolo::INPUT_H;\nstatic const int INPUT_W = Yolo::INPUT_W;\nstatic const int DETECTION_SIZE = sizeof(Yolo::Detection) / sizeof(float);\nstatic const int OUTPUT_SIZE = Yolo::MAX_OUTPUT_BBOX_COUNT * DETECTION_SIZE + 1;  // we assume the yololayer outputs no more than MAX_OUTPUT_BBOX_COUNT boxes that conf >= 0.1\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nstatic Logger gLogger;\n\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder -> createNetworkV2(0U);\n\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\n    ITensor* data = network -> addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../yolov4_csp.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    // define yolov4 csp layers\n    auto l0 = convBnMish(network, weightMap, *data, 32, 3, 1, 1, 0);\n    auto l1 = convBnMish(network, weightMap, *l0 -> getOutput(0), 64, 3, 2, 1, 1);\n    auto l2 = convBnMish(network, weightMap, *l1 -> getOutput(0), 32, 1, 1, 0, 2);\n    auto l3 = convBnMish(network, weightMap, *l2 -> getOutput(0), 64, 3, 1, 1, 3);\n    auto ew4 = network -> addElementWise(*l3 -> getOutput(0), *l1 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l5 = convBnMish(network, weightMap, *ew4 -> getOutput(0), 128, 3, 2, 1, 5);\n    auto l6 = convBnMish(network, weightMap, *l5 -> getOutput(0), 64, 1, 1, 0, 6);\n    auto l7 = l5;\n    auto l8 = convBnMish(network, weightMap, *l7 -> getOutput(0), 64, 1, 1, 0, 8);\n    auto l9 = convBnMish(network, weightMap, *l8 -> getOutput(0), 64, 1, 1, 0, 9);\n    auto l10 = convBnMish(network, weightMap, *l9 -> getOutput(0), 64, 3, 1, 1, 10);\n    auto ew11 = network -> addElementWise(*l10 -> getOutput(0), *l8 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l12 = convBnMish(network, weightMap, *ew11 -> getOutput(0), 64, 1, 1, 0, 12);\n    auto l13 = convBnMish(network, weightMap, *l12 -> getOutput(0), 64, 3, 1, 1, 13);\n    auto ew14 = network -> addElementWise(*l13 -> getOutput(0), *ew11 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l15 = convBnMish(network, weightMap, *ew14 -> getOutput(0), 64, 1, 1, 0, 15);\n\n    ITensor* inputTensors16[] = {l15 -> getOutput(0), l6 -> getOutput(0)};\n    auto cat16 = network -> addConcatenation(inputTensors16, 2);\n\n    auto l17 = convBnMish(network, weightMap, *cat16 -> getOutput(0), 128, 1, 1, 0, 17);\n    auto l18 = convBnMish(network, weightMap, *l17 -> getOutput(0), 256, 3, 2, 1, 18);\n    auto l19 = convBnMish(network, weightMap, *l18 -> getOutput(0), 128, 1, 1, 0, 19);\n    auto l20 = l18;\n    auto l21 = convBnMish(network, weightMap, *l20 -> getOutput(0), 128, 1, 1, 0, 21);\n    auto l22 = convBnMish(network, weightMap, *l21 -> getOutput(0), 128, 1, 1, 0, 22);\n    auto l23 = convBnMish(network, weightMap, *l22 -> getOutput(0), 128, 3, 1, 1, 23);\n    auto ew24 = network -> addElementWise(*l23 -> getOutput(0), *l21 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l25 = convBnMish(network, weightMap, *ew24 -> getOutput(0), 128, 1, 1, 0, 25);\n    auto l26 = convBnMish(network, weightMap, *l25 -> getOutput(0), 128, 3, 1, 1, 26);\n    auto ew27 = network -> addElementWise(*l26 -> getOutput(0), *ew24 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l28 = convBnMish(network, weightMap, *ew27 -> getOutput(0), 128, 1, 1, 0, 28);\n    auto l29 = convBnMish(network, weightMap, *l28 -> getOutput(0), 128, 3, 1, 1, 29);\n    auto ew30 = network -> addElementWise(*l29 -> getOutput(0), *ew27 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l31 = convBnMish(network, weightMap, *ew30 -> getOutput(0), 128, 1, 1, 0, 31);\n    auto l32 = convBnMish(network, weightMap, *l31 -> getOutput(0), 128, 3, 1, 1, 32);\n    auto ew33 = network -> addElementWise(*l32 -> getOutput(0), *ew30 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l34 = convBnMish(network, weightMap, *ew33 -> getOutput(0), 128, 1, 1, 0, 34);\n    auto l35 = convBnMish(network, weightMap, *l34 -> getOutput(0), 128, 3, 1, 1, 35);\n    auto ew36 = network -> addElementWise(*l35 -> getOutput(0), *ew33 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l37 = convBnMish(network, weightMap, *ew36 -> getOutput(0), 128, 1, 1, 0, 37);\n    auto l38 = convBnMish(network, weightMap, *l37 -> getOutput(0), 128, 3, 1, 1, 38);\n    auto ew39 = network -> addElementWise(*l38 -> getOutput(0), *ew36 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l40 = convBnMish(network, weightMap, *ew39 -> getOutput(0), 128, 1, 1, 0, 40);\n    auto l41 = convBnMish(network, weightMap, *l40 -> getOutput(0), 128, 3, 1, 1, 41);\n    auto ew42 = network -> addElementWise(*l41 -> getOutput(0), *ew39 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l43 = convBnMish(network, weightMap, *ew42 -> getOutput(0), 128, 1, 1, 0, 43);\n    auto l44 = convBnMish(network, weightMap, *l43 -> getOutput(0), 128, 3, 1, 1, 44);\n    auto ew45 = network -> addElementWise(*l44 -> getOutput(0), *ew42 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l46 = convBnMish(network, weightMap, *ew45 -> getOutput(0), 128, 1, 1, 0, 46);\n\n    ITensor* inputTensors47[] = {l46 -> getOutput(0), l19 -> getOutput(0)};\n    auto cat47 = network -> addConcatenation(inputTensors47, 2);\n\n    auto l48 = convBnMish(network, weightMap, *cat47 -> getOutput(0), 256, 1, 1, 0, 48);\n    auto l49 = convBnMish(network, weightMap, *l48 -> getOutput(0), 512, 3, 2, 1, 49);\n    auto l50 = convBnMish(network, weightMap, *l49 -> getOutput(0), 256, 1, 1, 0, 50);\n    auto l51 = l49;\n    auto l52 = convBnMish(network, weightMap, *l51 -> getOutput(0), 256, 1, 1, 0, 52);\n    auto l53 = convBnMish(network, weightMap, *l52 -> getOutput(0), 256, 1, 1, 0, 53);\n    auto l54 = convBnMish(network, weightMap, *l53 -> getOutput(0), 256, 3, 1, 1, 54);\n    auto ew55 = network -> addElementWise(*l54 -> getOutput(0), *l52 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l56 = convBnMish(network, weightMap, *ew55 -> getOutput(0), 256, 1, 1, 0, 56);\n    auto l57 = convBnMish(network, weightMap, *l56 -> getOutput(0), 256, 3, 1, 1, 57);\n    auto ew58 = network -> addElementWise(*l57 -> getOutput(0), *ew55 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l59 = convBnMish(network, weightMap, *ew58 -> getOutput(0), 256, 1, 1, 0, 59);\n    auto l60 = convBnMish(network, weightMap, *l59 -> getOutput(0), 256, 3, 1, 1, 60);\n    auto ew61 = network -> addElementWise(*l60 -> getOutput(0), *ew58 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l62 = convBnMish(network, weightMap, *ew61 -> getOutput(0), 256, 1, 1, 0, 62);\n    auto l63 = convBnMish(network, weightMap, *l62 -> getOutput(0), 256, 3, 1, 1, 63);\n    auto ew64 = network -> addElementWise(*l63 -> getOutput(0), *ew61 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l65 = convBnMish(network, weightMap, *ew64 -> getOutput(0), 256, 1, 1, 0, 65);\n    auto l66 = convBnMish(network, weightMap, *l65 -> getOutput(0), 256, 3, 1, 1, 66);\n    auto ew67 = network -> addElementWise(*l66 -> getOutput(0), *ew64 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l68 = convBnMish(network, weightMap, *ew67 -> getOutput(0), 256, 1, 1, 0, 68);\n    auto l69 = convBnMish(network, weightMap, *l68 -> getOutput(0), 256, 3, 1, 1, 69);\n    auto ew70 = network -> addElementWise(*l69 -> getOutput(0), *ew67 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l71 = convBnMish(network, weightMap, *ew70 -> getOutput(0), 256, 1, 1, 0, 71);\n    auto l72 = convBnMish(network, weightMap, *l71 -> getOutput(0), 256, 3, 1, 1, 72);\n    auto ew73 = network -> addElementWise(*l72 -> getOutput(0), *ew70 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l74 = convBnMish(network, weightMap, *ew73 -> getOutput(0), 256, 1, 1, 0, 74);\n    auto l75 = convBnMish(network, weightMap, *l74 -> getOutput(0), 256, 3, 1, 1, 75);\n    auto ew76 = network -> addElementWise(*l75 -> getOutput(0), *ew73 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l77 = convBnMish(network, weightMap, *ew76 -> getOutput(0), 256, 1, 1, 0, 77);\n\n    ITensor* inputTensors78[] = {l77 -> getOutput(0), l50 -> getOutput(0)};\n    auto cat78 = network -> addConcatenation(inputTensors78, 2);\n\n    auto l79 = convBnMish(network, weightMap, *cat78 -> getOutput(0), 512, 1, 1, 0, 79);\n    auto l80 = convBnMish(network, weightMap, *l79 -> getOutput(0), 1024, 3, 2, 1, 80);\n    auto l81 = convBnMish(network, weightMap, *l80 -> getOutput(0), 512, 1, 1, 0, 81);\n    auto l82 = l80;\n    auto l83 = convBnMish(network, weightMap, *l82 -> getOutput(0), 512, 1, 1, 0, 83);\n    auto l84 = convBnMish(network, weightMap, *l83 -> getOutput(0), 512, 1, 1, 0, 84);\n    auto l85 = convBnMish(network, weightMap, *l84 -> getOutput(0), 512, 3, 1, 1, 85);\n    auto ew86 = network -> addElementWise(*l85 -> getOutput(0), *l83 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l87 = convBnMish(network, weightMap, *ew86 -> getOutput(0), 512, 1, 1, 0, 87);\n    auto l88 = convBnMish(network, weightMap, *l87 -> getOutput(0), 512, 3, 1, 1, 88);\n    auto ew89 = network -> addElementWise(*l88 -> getOutput(0), *ew86 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l90 = convBnMish(network, weightMap, *ew89 -> getOutput(0), 512, 1, 1, 0, 90);\n    auto l91 = convBnMish(network, weightMap, *l90 -> getOutput(0), 512, 3, 1, 1, 91);\n    auto ew92 = network -> addElementWise(*l91 -> getOutput(0), *ew89 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l93 = convBnMish(network, weightMap, *ew92 -> getOutput(0), 512, 1, 1, 0, 93);\n    auto l94 = convBnMish(network, weightMap, *l93 -> getOutput(0), 512, 3, 1, 1, 94);\n    auto ew95 = network -> addElementWise(*l94 -> getOutput(0), *ew92 -> getOutput(0), ElementWiseOperation::kSUM);\n    auto l96 = convBnMish(network, weightMap, *ew95 -> getOutput(0), 512, 1, 1, 0, 96);\n\n    ITensor* inputTensors97[] = {l96 -> getOutput(0), l81 -> getOutput(0)};\n    \n    auto cat97 = network -> addConcatenation(inputTensors97, 2);\n\n    auto l98 = convBnMish(network, weightMap, *cat97 -> getOutput(0), 1024, 1, 1, 0, 98);\n\n    // ----\n    auto l99 = convBnMish(network, weightMap, *l98 -> getOutput(0), 512, 1, 1, 0, 99);\n    auto l100 = l98;\n    auto l101 = convBnMish(network, weightMap, *l100 -> getOutput(0), 512, 1, 1, 0, 101);\n    auto l102 = convBnMish(network, weightMap, *l101 -> getOutput(0), 512, 3, 1, 1, 102);\n    auto l103 = convBnMish(network, weightMap, *l102 -> getOutput(0), 512, 1, 1, 0, 103);\n\n    auto pool104 = network -> addPoolingNd(*l103 -> getOutput(0), PoolingType::kMAX, DimsHW{5, 5});\n    pool104 -> setPaddingNd(DimsHW{2, 2});\n    pool104 -> setStrideNd(DimsHW{1, 1});\n\n    auto l105 = l103;\n\n    auto pool106 = network -> addPoolingNd(*l105 -> getOutput(0), PoolingType::kMAX, DimsHW{9, 9});\n    pool106 -> setPaddingNd(DimsHW{4, 4});\n    pool106 -> setStrideNd(DimsHW{1, 1});\n\n    auto l107 = l103;\n\n    auto pool108 = network -> addPoolingNd(*l107 -> getOutput(0), PoolingType::kMAX, DimsHW{13, 13});\n    pool108 -> setPaddingNd(DimsHW{6, 6});\n    pool108 -> setStrideNd(DimsHW{1, 1});\n\n    ITensor* inputTensors109[] = {pool108 -> getOutput(0), pool106 -> getOutput(0), pool104 -> getOutput(0), l103 -> getOutput(0)};\n    auto cat109 = network -> addConcatenation(inputTensors109, 4);\n\n    // ---- end spp\n\n    auto l110 = convBnMish(network, weightMap, *cat109 -> getOutput(0), 512, 1, 1, 0, 110);\n    auto l111 = convBnMish(network, weightMap, *l110 -> getOutput(0), 512, 3, 1, 1, 111);\n\n    ITensor* inputTensors112[] =  { l111 -> getOutput(0), l99 -> getOutput(0) };\n    auto cat112 = network -> addConcatenation(inputTensors112, 2);\n\n    auto l113 = convBnMish(network, weightMap, *cat112 -> getOutput(0), 512, 1, 1, 0, 113);\n    auto l114 = convBnMish(network, weightMap, *l113 -> getOutput(0), 256, 1, 1, 0, 114);\n\n    float *deval = reinterpret_cast<float*>(malloc(sizeof(float) * 256 * 2 * 2));\n    for (int i = 0; i < 256 * 2 * 2; i++) {\n        deval[i] = 1.0;\n    }\n    Weights upsamplewts115{DataType::kFLOAT, deval, 256 * 2 * 2};\n    IDeconvolutionLayer* upsample115 = network -> addDeconvolutionNd(*l114 -> getOutput(0), 256, DimsHW{2, 2}, upsamplewts115, emptywts);\n    assert(upsample115);\n    upsample115 -> setStrideNd(DimsHW{2, 2});\n    upsample115 -> setNbGroups(256);\n    weightMap[\"upsample115\"] = upsamplewts115;\n\n    auto l116 = l79;\n    auto l117 = convBnMish(network, weightMap, *l116 -> getOutput(0), 256, 1, 1, 0, 117);\n\n    ITensor* inputTensors118[] = {l117 -> getOutput(0), upsample115 -> getOutput(0)};\n    auto cat118 = network -> addConcatenation(inputTensors118, 2);\n\n    auto l119 = convBnMish(network, weightMap, *cat118 -> getOutput(0), 256, 1, 1, 0, 119);\n    auto l120 = convBnMish(network, weightMap, *l119 -> getOutput(0), 256, 1, 1, 0, 120);\n    auto l121 = l119;\n    auto l122 = convBnMish(network, weightMap, *l121 -> getOutput(0), 256, 1, 1, 0, 122);\n    auto l123 = convBnMish(network, weightMap, *l122 -> getOutput(0), 256, 3, 1, 1, 123);\n    auto l124 = convBnMish(network, weightMap, *l123 -> getOutput(0), 256, 1, 1, 0, 124);\n    auto l125 = convBnMish(network, weightMap, *l124 -> getOutput(0), 256, 3, 1, 1, 125);\n    \n    ITensor* inputTensors126[] = {l125 -> getOutput(0), l120 -> getOutput(0)};\n    auto cat126 = network -> addConcatenation(inputTensors126, 2);\n\n    auto l127 = convBnMish(network, weightMap, *cat126 -> getOutput(0), 256, 1, 1, 0, 127);\n    auto l128 = convBnMish(network, weightMap, *l127 -> getOutput(0), 128, 1, 1, 0, 128);\n    \n    Weights upsamplewts129{DataType::kFLOAT, deval, 128 * 2 * 2};\n    IDeconvolutionLayer* upsample129 = network -> addDeconvolutionNd(*l128 -> getOutput(0), 128, DimsHW{2, 2}, upsamplewts129, emptywts);\n    assert(upsample129);\n    upsample129 -> setStrideNd(DimsHW{2, 2});\n    upsample129 -> setNbGroups(128);\n\n    auto l130 = l48;\n    auto l131 = convBnMish(network, weightMap, *l130 -> getOutput(0), 128, 1, 1, 0, 131);\n\n    ITensor* inputTensors132[] = {l131 -> getOutput(0), upsample129 -> getOutput(0)};\n    auto cat132 = network -> addConcatenation(inputTensors132, 2);\n\n    auto l133 = convBnMish(network, weightMap, *cat132 -> getOutput(0), 128, 1, 1, 0, 133);\n    auto l134 = convBnMish(network, weightMap, *l133 -> getOutput(0), 128, 1, 1, 0, 134);\n    auto l135 = l133;\n    auto l136 = convBnMish(network, weightMap, *l135 -> getOutput(0), 128, 1, 1, 0, 136);\n    auto l137 = convBnMish(network, weightMap, *l136 -> getOutput(0), 128, 3, 1, 1, 137);\n    auto l138 = convBnMish(network, weightMap, *l137 -> getOutput(0), 128, 1, 1, 0, 138);\n    auto l139 = convBnMish(network, weightMap, *l138 -> getOutput(0), 128, 3, 1, 1, 139);\n\n    ITensor* inputTensors140[] = {l139 -> getOutput(0), l134 -> getOutput(0)};\n    auto cat140 = network -> addConcatenation(inputTensors140, 2);\n\n    auto l141 = convBnMish(network, weightMap, *cat140 -> getOutput(0), 128, 1, 1, 0, 141);\n\n    // ---\n    auto l142 = convBnMish(network, weightMap, *l141 -> getOutput(0), 256, 3, 1, 1, 142);\n    IConvolutionLayer* conv143 = network -> addConvolutionNd(*l142 -> getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.143.Conv2d.weight\"], weightMap[\"module_list.143.Conv2d.bias\"]);\n    assert(conv143);\n\n    // 144 is yolo layer\n    auto l145 = l141;\n    auto l146 = convBnMish(network, weightMap, *l145 -> getOutput(0), 256, 3, 2, 1, 146);\n\n    ITensor* inputTensors147[] = {l146 -> getOutput(0), l127 -> getOutput(0)};\n    auto cat147 = network -> addConcatenation(inputTensors147, 2);\n\n    auto l148 = convBnMish(network, weightMap, *cat147 -> getOutput(0), 256, 1, 1, 0, 148);\n    auto l149 = convBnMish(network, weightMap, *l148 -> getOutput(0), 256, 1, 1, 0, 149);\n    auto l150 = l148;\n    auto l151 = convBnMish(network, weightMap, *l150 -> getOutput(0), 256, 1, 1, 0, 151);\n    auto l152 = convBnMish(network, weightMap, *l151 -> getOutput(0), 256, 3, 1, 1, 152);\n    auto l153 = convBnMish(network, weightMap, *l152 -> getOutput(0), 256, 1, 1, 0, 153);\n    auto l154 = convBnMish(network, weightMap, *l153 -> getOutput(0), 256, 3, 1, 1, 154);\n\n    ITensor* inputTensors155[] = {l154 -> getOutput(0), l149 -> getOutput(0)};\n    auto cat155 = network -> addConcatenation(inputTensors155, 2);\n\n    auto l156 = convBnMish(network, weightMap, *cat155 -> getOutput(0), 256, 1, 1, 0, 156);\n    auto l157 = convBnMish(network, weightMap, *l156 -> getOutput(0), 512, 3, 1, 1, 157);   \n    IConvolutionLayer* conv158 = network -> addConvolutionNd(*l157 -> getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.158.Conv2d.weight\"], weightMap[\"module_list.158.Conv2d.bias\"]);\n    assert(conv158);\n    // 159 is yolo layer\n\n    auto l160 = l156;\n    auto l161 = convBnMish(network, weightMap, *l160 -> getOutput(0), 512, 3, 2, 1, 161);\n\n    ITensor* inputTensors162[] = {l161 -> getOutput(0), l113 -> getOutput(0)};\n    auto cat162 = network -> addConcatenation(inputTensors162, 2);\n\n    auto l163 = convBnMish(network, weightMap, *cat162 -> getOutput(0), 512, 1, 1, 0, 163); \n    auto l164 = convBnMish(network, weightMap, *l163 -> getOutput(0), 512, 1, 1, 0, 164); \n    auto l165 = l163;\n    auto l166 = convBnMish(network, weightMap, *l165 -> getOutput(0), 512, 1, 1, 0, 166); \n    auto l167 = convBnMish(network, weightMap, *l166 -> getOutput(0), 512, 3, 1, 1, 167);\n    auto l168 = convBnMish(network, weightMap, *l167 -> getOutput(0), 512, 1, 1, 0, 168);\n    auto l169 = convBnMish(network, weightMap, *l168 -> getOutput(0), 512, 3, 1, 1, 169);\n\n    ITensor* inputTensors170[] = {l169 -> getOutput(0), l164 -> getOutput(0)};\n    auto cat170 = network -> addConcatenation(inputTensors170, 2);\n\n    auto l171 = convBnMish(network, weightMap, *cat170 -> getOutput(0), 512, 1, 1, 0, 171);\n    auto l172 = convBnMish(network, weightMap, *l171 -> getOutput(0), 1024, 3, 1, 1, 172);\n\n    IConvolutionLayer* conv173 = network -> addConvolutionNd(*l172 -> getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.173.Conv2d.weight\"], weightMap[\"module_list.173.Conv2d.bias\"]);\n    assert(conv173);\n    // 174 is yolo layer\n\n    // add yolo plugin\n    auto creator = getPluginRegistry() -> getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    const PluginFieldCollection* pluginData = creator -> getFieldNames();\n    IPluginV2* pluginObj = creator -> createPlugin(\"yololayer\", pluginData);\n    ITensor* inputTensorsYolo[] = {conv143 -> getOutput(0), conv158 -> getOutput(0), conv173 -> getOutput(0)};\n    auto yolo = network -> addPluginV2(inputTensorsYolo, 3, *pluginObj);\n\n    yolo -> getOutput(0) -> setName(OUTPUT_BLOB_NAME);\n    network -> markOutput(*yolo -> getOutput(0));\n\n    // Build engine\n    builder -> setMaxBatchSize(maxBatchSize);\n    config -> setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#ifdef USE_FP16\n    config -> setFlag(BuilderFlag::kFP16);\n#endif\n    std::cout << \"Building tensorrt engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder -> buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    network -> destroy();\n\n    \n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n\n    // create builder config\n    IBuilderConfig* config = builder -> createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // serialize the trt engine\n    (*modelStream) = engine -> serialize();\n    \n    // Close everything down\n    engine -> destroy();\n    builder -> destroy();\n    config -> destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CUDA_CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CUDA_CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(buffers[inputIndex]));\n    CUDA_CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint read_files_in_dir(const char* p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file -> d_name, \".\") != 0 &&\n            strcmp(p_file -> d_name, \"..\") != 0) {\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n    closedir(p_dir);\n    return 0;\n}\n\nint main(int argc, char** argv){\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (argc == 2 && std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(BATCH_SIZE, &modelStream);\n        assert(modelStream != nullptr);\n        std::ofstream p(\"yolov4csp.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    } else if (argc == 3 && std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"yolov4csp.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./yolov4 -s  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./yolov4 -d ../samples  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(argv[2], file_names) < 0) {\n        std::cout << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    // prepare input data ---------------------------\n    static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];\n    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n    //    data[i] = 1.0;\n    static float prob[BATCH_SIZE * OUTPUT_SIZE];\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    int fcount = 0;\n    for (int f = 0; f < (int)file_names.size(); f++) {\n        fcount++;\n        if (fcount < BATCH_SIZE && f + 1 != (int)file_names.size()) continue;\n        for (int b = 0; b < fcount; b++) {\n            cv::Mat img = cv::imread(std::string(argv[2]) + \"/\" + file_names[f - fcount + 1 + b]);\n            if (img.empty()) continue;\n            cv::Mat pr_img = preprocess_img(img);\n            for (int i = 0; i < INPUT_H * INPUT_W; i++) {\n                data[b * 3 * INPUT_H * INPUT_W + i] = pr_img.at<cv::Vec3b>(i)[2] / 255.0;\n                data[b * 3 * INPUT_H * INPUT_W + i + INPUT_H * INPUT_W] = pr_img.at<cv::Vec3b>(i)[1] / 255.0;\n                data[b * 3 * INPUT_H * INPUT_W + i + 2 * INPUT_H * INPUT_W] = pr_img.at<cv::Vec3b>(i)[0] / 255.0;\n            }\n        }\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, BATCH_SIZE);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n        std::vector<std::vector<Yolo::Detection>> batch_res(fcount);\n        for (int b = 0; b < fcount; b++) {\n            auto& res = batch_res[b];\n            nms(res, &prob[b * OUTPUT_SIZE], BBOX_CONF_THRESH, NMS_THRESH);\n        }\n        for (int b = 0; b < fcount; b++) {\n            auto& res = batch_res[b];\n            //std::cout << res.size() << std::endl;\n            cv::Mat img = cv::imread(std::string(argv[2]) + \"/\" + file_names[f - fcount + 1 + b]);\n            for (size_t j = 0; j < res.size(); j++) {\n                float *p = (float*)&res[j];\n                for (size_t k = 0; k < 7; k++) {\n                   std::cout << p[k] << \", \";\n                }\n                std::cout << std::endl;\n                cv::Rect r = get_rect(img, res[j].bbox);\n                cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n                cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n            }\n            cv::imwrite(\"_\" + file_names[f - fcount + 1 + b], img);\n        }\n        fcount = 0;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    //Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << i / 10 << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}"
  },
  {
    "path": "senet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(senet)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nadd_executable(se_resnet ${PROJECT_SOURCE_DIR}/se_resnet50.cpp)\ntarget_link_libraries(se_resnet nvinfer)\ntarget_link_libraries(se_resnet cudart)\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "senet/README.md",
    "content": "# SENet\n\nAn implementation of SENet, proposed in Squeeze-and-Excitation Networks by Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Enhua Wu\n\n[https://arxiv.org/abs/1709.01507](https://arxiv.org/abs/1709.01507)\n\nFor the Pytorch implementation, you can refer to [wang-xinyu/senet.pytorch](https://github.com/wang-xinyu/senet.pytorch), which is forked from [moskomule/senet.pytorch](https://github.com/moskomule/senet.pytorch).\n\n\n```\n// 1. generate se_resnet50.wts from [wang-xinyu/senet.pytorch](https://github.com/wang-xinyu/senet.pytorch)\n\n// 2. put se_resnet50.wts into tensorrtx/senet\n\n// 3. build and run\n\ncd tensorrtx/senet\n\nmkdir build\n\ncd build\n\ncmake ..\n\nmake\n\nsudo ./se_resnet -s   // serialize model to plan file i.e. 'se_resnet50.engine'\n\nsudo ./se_resnet -d   // deserialize plan file and run inference\n\n// 4. see if the output is same as [wang-xinyu/senet.pytorch]\n```\n\n"
  },
  {
    "path": "senet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "senet/se_resnet50.cpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <cmath>\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 1000;\n\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n    std::cout << \"len \" << len << std::endl;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* seLayer(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int c, int w, std::string lname) {\n    IPoolingLayer* l1 = network->addPoolingNd(input, PoolingType::kAVERAGE, DimsHW(w, w));\n    assert(l1);\n    l1->setStrideNd(DimsHW{w, w});\n    IFullyConnectedLayer* l2 = network->addFullyConnected(*l1->getOutput(0), c / 16, weightMap[lname + \"fc.0.weight\"], weightMap[lname+\"fc.0.bias\"]);\n    IActivationLayer* relu1 = network->addActivation(*l2->getOutput(0), ActivationType::kRELU);\n    IFullyConnectedLayer* l4 = network->addFullyConnected(*relu1->getOutput(0), c, weightMap[lname+\"fc.2.weight\"],weightMap[lname+\"fc.2.bias\"]);\n    IActivationLayer* l5 = network->addActivation(*l4->getOutput(0), ActivationType::kSIGMOID);\n    ILayer* se = network->addElementWise(input, *l5->getOutput(0), ElementWiseOperation::kPROD);\n    assert(se);\n    return se;\n}\n\nIActivationLayer* bottleneck(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int stride, std::string lname, int w) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{1, 1}, weightMap[lname + \"conv1.weight\"], emptywts);\n    assert(conv1);\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"bn1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{3, 3}, weightMap[lname + \"conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setStrideNd(DimsHW{stride, stride});\n    conv2->setPaddingNd(DimsHW{1, 1});\n\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"bn2\", 1e-5);\n\n    IActivationLayer* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n\n    IConvolutionLayer* conv3 = network->addConvolutionNd(*relu2->getOutput(0), outch * 4, DimsHW{1, 1}, weightMap[lname + \"conv3.weight\"], emptywts);\n    assert(conv3);\n\n    IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"bn3\", 1e-5);\n\n    ILayer *se = seLayer(network, weightMap, *bn3->getOutput(0), outch * 4, w, lname + \"se.\");\n\n    IElementWiseLayer* ew1;\n    if (stride != 1 || inch != outch * 4) {\n        IConvolutionLayer* conv4 = network->addConvolutionNd(input, outch * 4, DimsHW{1, 1}, weightMap[lname + \"downsample.0.weight\"], emptywts);\n        assert(conv4);\n        conv4->setStrideNd(DimsHW{stride, stride});\n\n        IScaleLayer* bn4 = addBatchNorm2d(network, weightMap, *conv4->getOutput(0), lname + \"downsample.1\", 1e-5);\n        ew1 = network->addElementWise(*bn4->getOutput(0), *se->getOutput(0), ElementWiseOperation::kSUM);\n    } else {\n        ew1 = network->addElementWise(input, *se->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    IActivationLayer* relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n    return relu3;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt)\n{\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape { 3, INPUT_H, INPUT_W } with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../se_resnet50.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 64, DimsHW{7, 7}, weightMap[\"conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{2, 2});\n    conv1->setPaddingNd(DimsHW{3, 3});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"bn1\", 1e-5);\n\n    // Add activation layer using the ReLU algorithm.\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n    pool1->setPaddingNd(DimsHW{1, 1});\n\n    IActivationLayer* x = bottleneck(network, weightMap, *pool1->getOutput(0), 64, 64, 1, \"layer1.0.\", 56);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 64, 1, \"layer1.1.\", 56);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 64, 1, \"layer1.2.\", 56);\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 128, 2, \"layer2.0.\", 28);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.1.\", 28);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.2.\", 28);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.3.\", 28);\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 256, 2, \"layer3.0.\", 14);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.1.\", 14);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.2.\", 14);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.3.\", 14);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.4.\", 14);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.5.\", 14);\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 512, 2, \"layer4.0.\", 7);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 2048, 512, 1, \"layer4.1.\", 7);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 2048, 512, 1, \"layer4.2.\", 7);\n\n    IPoolingLayer* pool2 = network->addPoolingNd(*x->getOutput(0), PoolingType::kAVERAGE, DimsHW{7, 7});\n    assert(pool2);\n    pool2->setStrideNd(DimsHW{1, 1});\n\n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 1000, weightMap[\"fc.weight\"], weightMap[\"fc.bias\"]);\n    assert(fc1);\n\n    fc1->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n    network->markOutput(*fc1->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream)\n{\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n    config->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize)\n{\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./se_resnet -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./se_resnet -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(\"se_resnet50.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"se_resnet50.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n    static float data[3 * INPUT_H * INPUT_W];\n    for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n        data[i] = 1.0;\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    static float prob[OUTPUT_SIZE];\n    for (int i = 0; i < 10; i++) {\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\n    {\n        std::cout << prob[i] << \", \";\n        if (i % 10 == 0) std::cout << std::endl;\n    }\n    std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "shufflenetv2/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.14)\n\nproject(\n  shufflenetv2\n  VERSION 0.1\n  LANGUAGES C CXX CUDA)\n\nif(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)\n  set(CMAKE_CUDA_ARCHITECTURES\n      60\n      70\n      72\n      75\n      80\n      86\n      89)\nendif()\n\nset(CMAKE_CXX_STANDARD 17)\nset(CMAKE_CXX_STANDARD_REQUIRED ON)\nset(CMAKE_CUDA_STANDARD 17)\nset(CMAKE_CUDA_STANDARD_REQUIRED ON)\nset(CMAKE_EXPORT_COMPILE_COMMANDS ON)\nset(CMAKE_INCLUDE_CURRENT_DIR TRUE)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME \"Use static cudaruntime library\" OFF)\n\nfind_package(Threads REQUIRED)\nfind_package(CUDAToolkit REQUIRED)\nfind_package(OpenCV REQUIRED)\n\nif(NOT TARGET TensorRT::TensorRT)\n  include(FindTensorRT.cmake)\nendif()\n\nadd_executable(${PROJECT_NAME} ${PROJECT_NAME}.cpp)\n\ntarget_include_directories(${PROJECT_NAME} PRIVATE ${OpenCV_INCLUDE_DIRS})\n\ntarget_link_libraries(${PROJECT_NAME} PRIVATE Threads::Threads CUDA::cudart\n                                              TensorRT::TensorRT ${OpenCV_LIBS})\n"
  },
  {
    "path": "shufflenetv2/FindTensorRT.cmake",
    "content": "cmake_minimum_required(VERSION 3.17.0)\n\nfunction(_guess_path var_name required_files)\n  set(_result \"\")\n\n  foreach(path_entry IN LISTS ARGN)\n    if(NOT EXISTS \"${path_entry}\")\n      message(DEBUG \"skip non-existing path '${path_entry}'\")\n      continue()\n    endif()\n\n    set(_ok TRUE)\n    foreach(required_file IN LISTS required_files)\n      if(NOT EXISTS \"${path_entry}/${required_file}\")\n        set(_ok FALSE)\n        message(DEBUG \"'${path_entry}' missing '${required_file}'\")\n        break()\n      endif()\n    endforeach()\n\n    if(_ok)\n      list(APPEND _result \"${path_entry}\")\n      message(DEBUG \"accept '${path_entry}'\")\n    else()\n      message(DEBUG \"reject '${path_entry}'\")\n    endif()\n  endforeach()\n\n  if(_result STREQUAL \"\")\n    message(\n      FATAL_ERROR\n        \"_guess_path(${var_name}) failed: no valid path found. required_files='${required_files}' candidates='${ARGN}'\"\n    )\n  endif()\n\n  set(${var_name}\n      \"${_result}\"\n      PARENT_SCOPE)\nendfunction()\n\n# add library\nadd_library(TensorRT IMPORTED INTERFACE)\nadd_library(TensorRT::TensorRT ALIAS TensorRT)\n\nset(TRT_VERSION\n    CACHE\n      STRING\n      \"TensorRT version, e.g. \\\"8.6.1.6\\\" or \\\"8.6.1.6+cuda12.0.1.011\\\", \\\"8.6.1.6.Windows10.x86_64.cuda-12.0\\\" etc\"\n)\n\nif(NOT TRT_VERSION STREQUAL \"\" AND NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  message(\n    WARNING\n      \"TRT_VERSION defined by cmake and environment variable both, using the later one\"\n  )\nendif()\n\nif(NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  set(TRT_VERSION $ENV{TRT_VERSION})\nendif()\n\nstring(REGEX MATCH \"([0-9]+)\" _match ${TRT_VERSION})\nset(TRT_MAJOR_VERSION \"${_match}\")\nunset(_match)\n\nif(WIN32)\n  set(TensorRT_DIR \"C:/Program Files/TensorRT-${TRT_VERSION}\")\n  if(NOT EXISTS \"${TensorRT_DIR}\")\n    message(\n      FATAL_ERROR\n        \"TensorRT_DIR=${TensorRT_DIR} does not exist!\"\n    )\n  endif()\n\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 10)\n    set(_modules nvinfer_10 nvinfer_plugin_10 nvinfer_vc_plugin_10\n                 nvinfer_dispatch_10 nvinfer_lean_10)\n    message(DEBUG \"Using ${_modules}\")\n  else()\n    set(_modules nvinfer nvinfer_plugin nvinfer_vc_plugin nvinfer_dispatch\n                 nvinfer_lean)\n  endif()\n\n  set(TensorRT_LIBRARY_DIR \"${TensorRT_DIR}/lib\")\n  set(TensorRT_INCLUDE_DIR \"${TensorRT_DIR}/include\")\nelseif(UNIX)\n  string(TOLOWER \"${CMAKE_SYSTEM_PROCESSOR}\" _trt_arch)\n  set(_trt_include_candidates)\n  if(_trt_arch MATCHES \"^(aarch64|arm64|arch64)$\")\n    set(_trt_include_candidates \"/usr/include/aarch64-linux-gnu\" \"/usr/include\"\n                                \"/usr/local/cuda/targets/aarch64-linux/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/aarch64-linux-gnu/lib\"\n        \"/usr/lib/aarch64-linux-gnu\" \"/usr/lib/aarch64-linux-gnu/tegra\"\n        \"/usr/lib\")\n  elseif(_trt_arch MATCHES \"^(x86_64|amd64)$\")\n    set(_trt_include_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/include\"\n        \"/usr/include/x86_64-linux-gnu\" \"/usr/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/lib\"\n        \"/usr/lib/x86_64-linux-gnu\" \"/usr/lib\")\n  else()\n    message(FATAL_ERROR \"Unknown architecture\")\n  endif()\n\n  set(_modules nvinfer nvinfer_plugin)\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 8)\n    list(APPEND _modules nvinfer_vc_plugin nvinfer_dispatch nvinfer_lean)\n  endif()\n\n  _guess_path(TensorRT_LIBRARY_DIR \"libnvinfer.so;libnvinfer_plugin.so\"\n              ${_trt_library_candidates})\n  message(STATUS \"TensorRT libraries: ${TensorRT_LIBRARY_DIR}\")\n  _guess_path(TensorRT_INCLUDE_DIR \"NvInfer.h\" ${_trt_include_candidates})\n  message(STATUS \"TensorRT includes: ${TensorRT_INCLUDE_DIR}\")\nendif()\n\nforeach(lib IN LISTS _modules)\n  find_library(\n    TensorRT_${lib}_LIBRARY\n    NAMES ${lib}\n    HINTS ${TensorRT_LIBRARY_DIR})\n  list(APPEND TensorRT_LIBRARIES ${TensorRT_${lib}_LIBRARY})\nendforeach()\n\ntarget_link_libraries(TensorRT INTERFACE ${TensorRT_LIBRARIES})\n\nmessage(STATUS \"Found TensorRT libs: ${TensorRT_LIBRARIES}\")\n\nset_target_properties(\n  TensorRT\n  PROPERTIES C_STANDARD 17\n             CXX_STANDARD 17\n             POSITION_INDEPENDENT_CODE ON\n             SKIP_BUILD_RPATH TRUE\n             BUILD_WITH_INSTALL_RPATH TRUE\n             INSTALL_RPATH \"$ORIGIN\"\n             INTERFACE_INCLUDE_DIRECTORIES \"${TensorRT_INCLUDE_DIR}\")\n\nunset(TRT_MAJOR_VERSION)\nunset(_modules)\nunset(_trt_include_candidates)\nunset(_trt_library_candidates)\nunset(_trt_arch)\n"
  },
  {
    "path": "shufflenetv2/README.md",
    "content": "# shufflenet v2\n\nShuffleNetV2 with 0.5x output channels, as described in: [ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design](https://arxiv.org/abs/1807.11164)\n\nFollowing tricks are used in this demo:\n\n- `torch.chunk` is used in shufflenet v2. We implemented the `chunk(2, dim=C)` by tensorrt plugin. Which is the simplest plugin in this tensorrtx project. You can learn the basic procedures of build tensorrt plugin.\n- shuffle layer is used, the `channel_shuffle()` in `pytorchx/shufflenet` can be implemented by two shuffle layers in tensorrt.\n- Batchnorm layer, implemented by scale layer.\n\n## Usage\n\n1. use `gen_wts.py` to generate wts file.\n\n```bash\npython3 gen_wts.py\n```\n\n2. build C++ code\n\n```bash\npushd tensorrtx/shufflenetv2\ncmake -S . -B build -G Ninja --fresh\ncmake --build build\n```\n\n3. serialize wts model to engine file.\n\n```bash\n./build/shufflenetv2 -s\n```\n\n4. run inference\n\n```bash\n./build/shufflenetv2 -i\n```\n\nThe inference output looks like:\n\n```bash\n...\n328us\n-5.481, -0.1151, 4.004, -1.47, 1.007, -5.943, -2.311, 1.708, 1.569, 0.3112, 1.589, 0.1816, -2.253, -3.261, -3.269, -0.9116, -2.132, -1.159, -2.108, -0.3869, -4.653,\n====\n...\nprediction result:\nTop: 0 idx: 285, logits: 10.44, label: Egyptian cat\nTop: 1 idx: 309, logits: 10.19, label: bee\nTop: 2 idx: 94, logits: 9.399, label: hummingbird\n```\n"
  },
  {
    "path": "shufflenetv2/gen_wts.py",
    "content": "import struct\n\nimport cv2\nimport numpy as np\nimport torch\nfrom torchvision.models.shufflenetv2 import (\n    shufflenet_v2_x0_5,\n    shufflenet_v2_x1_0,\n    shufflenet_v2_x1_5,\n    shufflenet_v2_x2_0,\n)\n\n\ndef read_imagenet_labels() -> dict[int, str]:\n    \"\"\"\n    read ImageNet 1000 labels\n\n    Returns:\n        dict[int, str]: labels dict\n    \"\"\"\n    clsid2label = {}\n    with open(\"../assets/imagenet1000_clsidx_to_labels.txt\", \"r\") as f:\n        for i in f.readlines():\n            k, v = i.split(\": \")\n            clsid2label.setdefault(int(k), v[1:-3])\n    return clsid2label\n\n\ndef preprocess(img: np.array) -> torch.Tensor:\n    \"\"\"\n    a preprocess method align with ImageNet dataset\n\n    Args:\n        img (np.array): input image\n\n    Returns:\n        torch.Tensor: preprocessed image in `NCHW` layout\n    \"\"\"\n    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0\n    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_LINEAR)\n    mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)\n    std = np.array([0.229, 0.224, 0.225], dtype=np.float32)\n    img = (img - mean) / std\n    img = img.transpose(2, 0, 1)[None, ...]\n    return torch.from_numpy(img)\n\n\nif __name__ == \"__main__\":\n    labels = read_imagenet_labels()\n    img = cv2.imread(\"../assets/cats.jpg\", cv2.IMREAD_COLOR)\n    img = preprocess(img)\n\n    \"\"\"\n    NOTE: comment out the model you don't want\n    \"\"\"\n    models = [\n        (\"shufflenet_v2_x0_5\", shufflenet_v2_x0_5(pretrained=True)),\n        (\"shufflenet_v2_x1_0\", shufflenet_v2_x1_0(pretrained=True)),\n        (\"shufflenet_v2_x1_5\", shufflenet_v2_x1_5(pretrained=True)),\n        (\"shufflenet_v2_x2_0\", shufflenet_v2_x2_0(pretrained=True)),\n    ]\n\n    for name, model in models:\n        model.eval()\n        with torch.inference_mode():\n            output = model(img)\n        print(f\"{name} result:\")\n        for i, batch in enumerate(torch.topk(output, k=3).indices):\n            for j, idx in enumerate(batch):\n                print(f\"\\tBatch: {i}, Top: {j}, logits: {output[i][idx]:.4f}, label: {labels[int(idx)]}\")\n        print(f\"{'=' * 32}\")\n\n        with open(f\"../models/{name}.wts\", \"w\") as f:\n            f.write(\"{}\\n\".format(len(model.state_dict().keys())))\n            for k, v in model.state_dict().items():\n                print(\"key: \", k)\n                print(\"value: \", v.shape)\n                vr = v.reshape(-1).cpu().numpy()\n                f.write(\"{} {}\".format(k, len(vr)))\n                for vv in vr:\n                    f.write(\" \")\n                    f.write(struct.pack(\">f\", float(vv)).hex())\n                f.write(\"\\n\")\n"
  },
  {
    "path": "shufflenetv2/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"NvInferRuntime.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "shufflenetv2/macros.h",
    "content": "#pragma once\n#include <NvInfer.h>\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#define TRT_VERSION \\\n    ((NV_TENSORRT_MAJOR * 1000) + (NV_TENSORRT_MINOR * 100) + (NV_TENSORRT_PATCH * 10) + NV_TENSORRT_BUILD)\n\n#if TRT_VERSION < 7220\n#error \"TensorRT >= 7.2.2 is required for this demo.\"\n#endif\n\n#if TRT_VERSION >= 8000\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n"
  },
  {
    "path": "shufflenetv2/shufflenetv2.cpp",
    "content": "#include <NvInfer.h>\n#include <chrono>\n#include <cmath>\n#include <iostream>\n#include <map>\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"logging.h\"\n#include \"utils.h\"\n\nstruct ShuffleNetV2Params {\n    std::array<int32_t, 3> repeat;\n    std::array<int32_t, 5> output_chn;\n};\n\n/**\n * @brief choose one below as the model to be built\n * @param v2_x0_5\n * @param v2_x1_0\n * @param v2_x1_5\n * @param v2_x2_0\n */\n[[maybe_unused]] static constexpr ShuffleNetV2Params v2_x0_5 = {{4, 8, 4}, {24, 48, 96, 192, 1024}};\n[[maybe_unused]] static constexpr ShuffleNetV2Params v2_x1_0 = {{4, 8, 4}, {24, 116, 232, 464, 1024}};\n[[maybe_unused]] static constexpr ShuffleNetV2Params v2_x1_5 = {{4, 8, 4}, {24, 176, 352, 704, 1024}};\n[[maybe_unused]] static constexpr ShuffleNetV2Params v2_x2_0 = {{4, 8, 4}, {24, 244, 488, 976, 2048}};\n\nconstexpr const std::size_t WORKSPACE_SIZE = 16 << 20;\n\n// stuff we know about shufflenet-v2\nconstexpr const int64_t N = 1;\nconstexpr const int32_t INPUT_H = 224;\nconstexpr const int32_t INPUT_W = 224;\nconstexpr const std::array<int32_t, 2> SIZES = {3 * INPUT_H * INPUT_W, 1000};\nconstexpr const std::array<const char*, 2> NAMES = {\"data\", \"logits\"};\nstatic constexpr const bool TRT_PREPROCESS = TRT_VERSION >= 8510 ? true : false;\nstatic constexpr const std::array<const float, 3> mean = {0.485f, 0.456f, 0.406f};\nstatic constexpr const std::array<const float, 3> stdv = {0.229f, 0.224f, 0.225f};\n\nstatic constexpr const char* WTS_PATH = \"../models/shufflenet_v2_x0_5.wts\";\nstatic constexpr const char* ENGINE_PATH = \"../models/shufflenet.engine\";\nstatic constexpr const char* LABELS_PATH = \"../assets/imagenet1000_clsidx_to_labels.txt\";\n\nusing namespace nvinfer1;\nusing WeightMap = std::map<std::string, Weights>;\nusing M = MatrixOperation;\nusing NDCF = nvinfer1::NetworkDefinitionCreationFlag;\n\nstatic Logger gLogger;\n\nDims debug_shape(const ILayer* l) {\n    Dims dims = l->getOutput(0)->getDimensions();\n    std::cout << l->getOutput(0)->getName() << \":\\t[\";\n    for (int i = 0; i < dims.nbDims; i++) {\n        std::cout << dims.d[i] << \", \";\n    }\n    std::cout << \"]\\n\";\n    return dims;\n}\n\nILayer* addBatchNorm2d(INetworkDefinition* network, WeightMap& weightMap, ITensor& input, const std::string& lname,\n                       float eps = 1e-3f) {\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\n    auto len = weightMap[lname + \".running_var\"].count;\n    std::cout << lname << \" running_var len: \" << len << \"\\n\";\n\n    auto* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n\n    auto* shval = static_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n    static const Weights power{DataType::kFLOAT, nullptr, 0ll};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\n/**\n * @brief a basic convolution+bn layer with an optional relu layer\n *\n * @param network network definition\n * @param m weight map\n * @param input input tensor\n * @param lname layer name\n * @param ch output channels\n * @param k kernel\n * @param s stride\n * @param p padding\n * @param g groups\n * @param with_relu true if with relu\n * @return ILayer*\n */\nILayer* CBR(INetworkDefinition* network, WeightMap& m, ITensor& input, const std::string& lname, int ch, int k,\n            int s = 1, int p = 0, int g = 1, bool with_relu = true, int start_index = 0) {\n    static const Weights emptywts{DataType::kFLOAT, nullptr, 0ll};\n    auto conv_name = lname + \".\" + std::to_string(start_index++);\n    auto* conv = network->addConvolutionNd(input, ch, DimsHW{k, k}, m[conv_name + \".weight\"], emptywts);\n\n    assert(conv);\n    conv->setStrideNd(DimsHW{s, s});\n    conv->setPaddingNd(DimsHW{p, p});\n    conv->setNbGroups(g);\n    conv->setName(conv_name.c_str());\n\n    auto bn_name = lname + \".\" + std::to_string(start_index++);\n    auto* bn = addBatchNorm2d(network, m, *conv->getOutput(0), bn_name, 1e-5f);\n    bn->setName((bn_name + \".bn\").c_str());\n\n    if (with_relu) {\n        auto* relu = network->addActivation(*bn->getOutput(0), ActivationType::kRELU);\n        auto relu_name = lname + \".\" + std::to_string(start_index) + \".relu\";\n        assert(relu);\n        relu->setName(relu_name.c_str());\n        return relu;\n    }\n    return bn;\n}\n\n/**\n * @brief invered residual block\n *\n * @param network network definition\n * @param m weight map\n * @param input input tensor\n * @param lname layer name\n * @param inch input channels\n * @param outch output channels\n * @param s stride\n * @return ILayer*\n */\nILayer* invertedRes(INetworkDefinition* net, WeightMap& m, ITensor& input, const std::string& lname, int inch,\n                    int outch, int s) {\n    if (s < 1 || s > 3) {\n        std::cerr << \"stride must be in [1, 3]\\n\";\n        std::abort();\n    }\n    int32_t bf /* branch features */ = outch / 2;\n    ITensor *x1{nullptr}, *x2{nullptr};\n\n    if (s == 1) {\n        auto d = input.getDimensions();\n        Dims4 stride{1, 1, 1, 1};\n        Dims4 half{d.d[0], d.d[1] / 2, d.d[2], d.d[3]};\n        auto* s1 = net->addSlice(input, Dims4{0, 0, 0, 0}, half, stride);\n        auto* s2 = net->addSlice(input, Dims4{0, d.d[1] / 2, 0, 0}, half, stride);\n        debug_shape(s2);\n        x1 = s1->getOutput(0);\n        x2 = s2->getOutput(0);\n    } else {\n        if (s > 1) {\n            auto* b1 = CBR(net, m, input, lname + \".branch1\", inch, 3, s, 1, inch, false, 0);\n            b1 = CBR(net, m, *b1->getOutput(0), lname + \".branch1\", inch, 1, 1, 0, 1, true, 2);\n            x1 = b1->getOutput(0);\n            debug_shape(b1);\n        } else {\n            x1 = &input;\n        }\n        x2 = &input;\n    }\n\n    auto* b2 = CBR(net, m, *x2, lname + \".branch2\", bf, 1, 1, 0, 1, true, 0);\n    b2 = CBR(net, m, *b2->getOutput(0), lname + \".branch2\", bf, 3, s, 1, bf, false, 3);\n    b2 = CBR(net, m, *b2->getOutput(0), lname + \".branch2\", bf, 1, 1, 0, 1, true, 5);\n    debug_shape(b2);\n\n    std::array<ITensor*, 2> cat_tensors = {x1, b2->getOutput(0)};\n    auto* cat = net->addConcatenation(cat_tensors.data(), 2);\n    auto cat_name = lname + \".cat\";\n    assert(cat);\n    cat->setName(cat_name.c_str());\n    cat->setAxis(1);\n    static_cast<void>(debug_shape(cat));\n\n    auto* sf1 = net->addShuffle(*cat->getOutput(0));\n    assert(sf1);\n    sf1->setName((lname + \".shuffle.1\").c_str());\n    auto d = cat->getOutput(0)->getDimensions();\n    auto dim_sf1 = Dims{5, {d.d[0], 2, d.d[1] / 2, d.d[2], d.d[3]}};\n    sf1->setReshapeDimensions(dim_sf1);\n    sf1->setSecondTranspose({0, 2, 1, 3, 4});\n\n    auto* sf2 = net->addShuffle(*sf1->getOutput(0));\n    assert(sf2);\n    sf2->setName((lname + \".shuffle.2\").c_str());\n    sf2->setReshapeDimensions(d);\n\n    return sf2;\n}\n\n/**\n * @brief Create a Engine object\n * \n * @param N max batch size\n * @param runtime runtime\n * @param builder builder\n * @param config config\n * @param dt data type\n * @param param the type of model to be built\n * @return ICudaEngine* \n */\nICudaEngine* createEngine(int32_t N, IRuntime* runtime, IBuilder* builder, IBuilderConfig* config, DataType dt,\n                          ShuffleNetV2Params param = v2_x0_5) {\n    WeightMap m = loadWeights(WTS_PATH);\n\n#if TRT_VERSION >= 11200\n    auto flag = 1U << static_cast<int>(NDCF::kSTRONGLY_TYPED);\n#elif TRT_VERSION >= 10000\n    auto flag = 0U;\n#else\n    auto flag = 1U << static_cast<int>(NDCF::kEXPLICIT_BATCH);\n#endif\n    auto* net = builder->createNetworkV2(flag);\n\n    int32_t in_ch = 3;\n    ITensor* input{nullptr};\n    if constexpr (TRT_PREPROCESS) {\n        // for simplicity, resize image on cpu side\n        dt = DataType::kUINT8;\n        input = net->addInput(NAMES[0], dt, Dims4{N, INPUT_H, INPUT_W, in_ch});\n        auto* trans = addTransformLayer(net, *input, true, mean, stdv);\n        input = trans->getOutput(0);\n    } else {\n        input = net->addInput(NAMES[0], dt, Dims4{N, in_ch, INPUT_H, INPUT_W});\n    }\n    assert(input);\n\n    /** conv1 and maxpool */\n    auto* cbr1 = CBR(net, m, *input, \"conv1\", param.output_chn[0], 3, 2, 1);\n    auto* pool1 = net->addPoolingNd(*cbr1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n    pool1->setPaddingNd(DimsHW{1, 1});\n    debug_shape(pool1);\n\n    /** stage 2, 3, 4 */\n    ILayer* _layer = pool1;\n    in_ch = param.output_chn[0];\n    for (int stage = 2; stage < 5; ++stage) {\n        int32_t out_ch = param.output_chn[stage - 1];\n        std::string lname = \"stage\" + std::to_string(stage);\n        std::cout << \"================ \" << lname << \" ================\\n\";\n        _layer = invertedRes(net, m, *_layer->getOutput(0), lname + \".0\", in_ch, out_ch, 2);\n        debug_shape(_layer);\n        for (int j = 1; j < param.repeat[stage - 2]; ++j) {\n            _layer = invertedRes(net, m, *_layer->getOutput(0), lname + \".\" + std::to_string(j), out_ch, out_ch, 1);\n        }\n        in_ch = out_ch;\n    }\n\n    /** conv5, mean and fully connected layer */\n    auto* conv5 = CBR(net, m, *_layer->getOutput(0), \"conv5\", param.output_chn[4], 1, 1, 0);\n    auto* mean = net->addReduce(*conv5->getOutput(0), ReduceOperation::kAVG, 0xc, false);\n    mean->setName(\"global_pool(mean)\");\n    auto* fcw = net->addConstant(DimsHW{1000, 1024}, m[\"fc.weight\"]);\n    auto* fcb = net->addConstant(DimsHW{1, 1000}, m[\"fc.bias\"]);\n    auto* _fc = net->addMatrixMultiply(*mean->getOutput(0), M::kNONE, *fcw->getOutput(0), M::kTRANSPOSE);\n    auto* fc = net->addElementWise(*_fc->getOutput(0), *fcb->getOutput(0), ElementWiseOperation::kSUM);\n    fc->getOutput(0)->setName(NAMES[1]);\n    debug_shape(fc);\n\n    net->markOutput(*fc->getOutput(0));\n\n#if TRT_VERSION >= 8000\n    config->setMemoryPoolLimit(MemoryPoolType::kWORKSPACE, WORKSPACE_SIZE);\n    IHostMemory* mem = builder->buildSerializedNetwork(*net, *config);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(mem->data(), mem->size());\n    delete net;\n#else\n    builder->setMaxBatchSize(N);\n    config->setMaxWorkspaceSize(WORKSPACE_SIZE);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*net, *config);\n    net->destroy();\n#endif\n    std::cout << \"build finished\\n\";\n\n    // Release host memory\n    for (auto& mem : m) {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(int32_t N, IRuntime* runtime, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(N, runtime, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n#if TRT_VERSION >= 8000\n    delete engine;\n    delete config;\n    delete builder;\n#else\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n#endif\n}\n\nauto doInference(IExecutionContext& context, void* input, int64_t batchSize) -> std::vector<std::vector<float>> {\n    ICudaEngine const& engine = context.getEngine();\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n    std::vector<void*> buffers;\n\n#if TRT_VERSION >= 8000\n    const int32_t nIO = engine.getNbIOTensors();\n#else\n    const int32_t nIO = engine.getNbBindings();\n#endif\n\n    buffers.resize(nIO);\n    for (auto i = 0; i < nIO; ++i) {\n        std::size_t size = 0;\n#if TRT_VERSION >= 8000\n        auto* tensor_name = engine.getIOTensorName(i);\n        auto s = getSize(engine.getTensorDataType(tensor_name));\n        size = s * batchSize * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n        context.setTensorAddress(tensor_name, buffers[i]);\n#else\n        const int32_t idx = engine.getBindingIndex(NAMES[i]);\n        auto s = getSize(engine.getBindingDataType(idx));\n        assert(idx == i);\n        size = s * batchSize * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n#endif\n    }\n\n#if TRT_VERSION >= 8000\n    assert(context.enqueueV3(stream));\n#else\n    assert(context.enqueueV2(buffers.data(), stream, nullptr));\n#endif\n\n    std::vector<std::vector<float>> prob;\n    for (int i = 1; i < nIO; ++i) {\n        std::vector<float> tmp(batchSize * SIZES[i], std::nanf(\"\"));\n        std::size_t size = batchSize * SIZES[i] * sizeof(float);\n        CHECK(cudaMemcpyAsync(tmp.data(), buffers[i], size, cudaMemcpyDeviceToHost, stream));\n        prob.emplace_back(tmp);\n    }\n    CHECK(cudaStreamSynchronize(stream));\n    // Release stream and buffers\n    CHECK(cudaStreamDestroy(stream));\n    for (auto& buffer : buffers) {\n        CHECK(cudaFree(buffer));\n    }\n    return prob;\n}\n\nint main(int argc, char** argv) {\n    checkTrtEnv();\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\\n\";\n        std::cerr << \"./shufflenet -s   // serialize model to plan file\\n\";\n        std::cerr << \"./shufflenet -d   // deserialize plan file and run inference\\n\";\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    char* trtModelStream{nullptr};\n    std::streamsize size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, runtime, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(ENGINE_PATH, std::ios::binary | std::ios::trunc);\n        if (!p) {\n            std::cerr << \"could not open plan output file\\n\";\n            return -1;\n        }\n        if (modelStream->size() > static_cast<std::size_t>(std::numeric_limits<std::streamsize>::max())) {\n            std::cerr << \"this model is too large to serialize\\n\";\n            return -1;\n        }\n        const auto* data_ptr = reinterpret_cast<const char*>(modelStream->data());\n        auto data_size = static_cast<std::streamsize>(modelStream->size());\n        p.write(data_ptr, data_size);\n#if TRT_VERSION >= 8000\n        delete modelStream;\n#else\n        modelStream->destroy();\n#endif\n        return 0;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(ENGINE_PATH, std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n#if TRT_VERSION >= 8000\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n#else\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n#endif\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    void* input = nullptr;\n    std::vector<float> flat_img;\n    cv::Mat img;\n    if constexpr (TRT_PREPROCESS) {\n        // for simplicity, resize image on cpu side\n        img = cv::imread(\"../assets/cats.jpg\", cv::IMREAD_COLOR);\n        cv::resize(img, img, cv::Size(INPUT_W, INPUT_H), 0, 0, cv::INTER_LINEAR);\n        input = static_cast<void*>(img.data);\n    } else {\n        img = cv::imread(\"../assets/cats.jpg\", cv::IMREAD_COLOR);\n        flat_img = preprocess_img(img, true, mean, stdv, N, INPUT_H, INPUT_W);\n        input = flat_img.data();\n    }\n    for (int i = 0; i < 100; ++i) {\n        auto start = std::chrono::system_clock::now();\n        auto prob = doInference(*context, input, N);\n        auto end = std::chrono::system_clock::now();\n        auto period = std::chrono::duration_cast<std::chrono::microseconds>(end - start);\n        std::cout << period.count() << \"us\\n\";\n\n        for (auto& vector : prob) {\n            int idx = 0;\n            for (auto& v : vector) {\n                std::cout << std::setprecision(4) << v << \", \" << std::flush;\n                if (++idx > 20) {\n                    std::cout << \"\\n====\\n\";\n                    break;\n                }\n            }\n        }\n\n        if (i == 99) {\n            std::cout << \"prediction result:\\n\";\n            auto labels = loadImagenetLabelMap(LABELS_PATH);\n            int _top = 0;\n            for (auto& [idx, logits] : topk(prob[0], 3)) {\n                std::cout << \"Top: \" << _top++ << \" idx: \" << idx << \", logits: \" << logits\n                          << \", label: \" << labels[idx] << \"\\n\";\n            }\n        }\n    }\n#if TRT_VERSION >= 8000\n    delete context;\n    delete engine;\n    delete runtime;\n#else\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n#endif\n\n    return 0;\n}\n"
  },
  {
    "path": "shufflenetv2/utils.h",
    "content": "#pragma once\n#include <cuda_runtime_api.h>\n#include <algorithm>\n#include <cassert>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <numeric>\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\nusing namespace nvinfer1;\n\n#define CHECK(status)                                     \\\n    do {                                                  \\\n        auto ret = (status);                              \\\n        if (ret != cudaSuccess) {                         \\\n            std::cerr << \"Cuda failure: \" << ret << \"\\n\"; \\\n            std::abort();                                 \\\n        }                                                 \\\n    } while (0)\n\nstatic void checkTrtEnv(int device = 0) {\n#if TRT_VERSION < 8000\n    CHECK(cudaGetDevice(&device));\n    cudaDeviceProp prop{};\n    CHECK(cudaGetDeviceProperties(&prop, device));\n    const int sm = prop.major * 10 + prop.minor;\n    if (sm > 86) {\n        std::cerr << \"TensorRT < 8 does not support SM > 86 on this GPU.\";\n        std::abort();\n    }\n#endif\n}\n\n/**\n * @brief TensorRT weight files have a simple space delimited format:\n * [type] [size] <data x size in hex>\n * \n * @param file input weight file path\n * @return std::map<std::string, nvinfer1::Weights> \n */\nstatic std::map<std::string, nvinfer1::Weights> loadWeights(const std::string& file) {\n    std::cout << \"Loading weights: \" << file << \"\\n\";\n    std::map<std::string, nvinfer1::Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> wt.count;\n\n        // Load blob\n        auto* val = new uint32_t[wt.count];\n        input >> std::hex;\n        for (auto x = 0ll; x < wt.count; ++x) {\n            input >> val[x];\n        }\n        wt.values = val;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\n/**\n * @brief a preprocess function aligning with ImageNet preprocess in torchvision, only support 3-channel image\n * \n * @param img opencv image with BGR layout\n * @param bgr2rgb whether to convert BGR to RGB\n * @param mean subtract mean\n * @param std divide std\n * @param n batch size\n * @param h resize height\n * @param w resize width\n * @return std::vector<float> contiguous flatten image data in float32 type\n */\nstatic std::vector<float> preprocess_img(cv::Mat& img, bool bgr2rgb, const std::array<const float, 3>& mean,\n                                         const std::array<const float, 3>& std, int n, int h, int w) {\n    const auto c = img.channels();\n    const auto size = c * h * w;\n    if (c != 3) {\n        std::cerr << \"this demo only supports 3 channel input image.\\n\";\n        std::abort();\n    }\n    if (bgr2rgb) {\n        cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n    }\n    cv::resize(img, img, cv::Size(w, h), 0, 0, cv::INTER_LINEAR);\n    img.convertTo(img, CV_32FC3, 1.f / 255);\n    img = (img - cv::Scalar(mean[0], mean[1], mean[2])) / cv::Scalar(std[0], std[1], std[2]);\n    std::vector<float> chw(static_cast<std::size_t>(n) * c * h * w, 0.f);\n\n    // fill all batch with the same input image\n    for (int i = 0; i < n; ++i) {\n        for (int y = 0; y < h; ++y) {\n            for (int x = 0; x < w; ++x) {\n                const cv::Vec3f v = img.at<cv::Vec3f>(y, x);\n                chw[i * size + 0 * h * w + y * w + x] = v[0];\n                chw[i * size + 1 * h * w + y * w + x] = v[1];\n                chw[i * size + 2 * h * w + y * w + x] = v[2];\n            }\n        }\n    }\n    return chw;\n}\n\nstatic auto topk(const std::vector<float>& v, int k) -> std::vector<std::pair<int, float>> {\n    if (k <= 0)\n        return {};\n    auto stride = std::min<std::ptrdiff_t>(k, static_cast<int64_t>(v.size()));\n\n    std::vector<int> idx(v.size());\n    std::iota(idx.begin(), idx.end(), 0);\n\n    std::partial_sort(idx.begin(), idx.begin() + k, idx.end(), [&](int a, int b) { return v[a] > v[b]; });\n\n    std::vector<std::pair<int, float>> out;\n    out.reserve(stride);\n    for (auto i = 0; i < stride; ++i)\n        out.emplace_back(idx[i], v[idx[i]]);\n    return out;\n}\n\nstatic std::map<int, std::string> loadImagenetLabelMap(const std::string& path) {\n    std::map<int, std::string> labels;\n    std::ifstream in(path);\n    if (!in.is_open()) {\n        return labels;\n    }\n    std::string line;\n    while (std::getline(in, line)) {\n        auto colon = line.find(':');\n        if (colon == std::string::npos) {\n            continue;\n        }\n        auto first_quote = line.find('\\'', colon);\n        if (first_quote == std::string::npos) {\n            continue;\n        }\n        auto second_quote = line.find('\\'', first_quote + 1);\n        if (second_quote == std::string::npos) {\n            continue;\n        }\n        int idx = std::stoi(line.substr(0, colon));\n        labels[idx] = line.substr(first_quote + 1, second_quote - first_quote - 1);\n    }\n    return labels;\n}\n\nstatic ILayer* addTransformLayer(INetworkDefinition* network, ITensor& input, bool bgr2rgb,\n                                 const std::array<const float, 3>& mean, const std::array<const float, 3>& std) {\n    struct ScaleParams {\n        std::array<float, 3> shift;\n        std::array<float, 3> scale;\n    };\n    static std::vector<std::unique_ptr<ScaleParams>> gScaleParams;\n    auto params = std::make_unique<ScaleParams>();\n    params->shift = {-mean[0] / std[0], -mean[1] / std[1], -mean[2] / std[2]};\n    params->scale = {1.f / (std[0] * 255.f), 1.f / (std[1] * 255.f), 1.f / (std[2] * 255.f)};\n\n    static const Weights empty{DataType::kFLOAT, nullptr, 0ll};\n    const Weights shift{DataType::kFLOAT, params->shift.data(), 3ll};\n    const Weights scale{DataType::kFLOAT, params->scale.data(), 3ll};\n\n    gScaleParams.emplace_back(std::move(params));\n\n    ITensor* in = &input;\n    if (input.getType() != DataType::kFLOAT) {\n#if TRT_VERSION >= 8000\n        auto* cast = network->addCast(input, DataType::kFLOAT);\n        assert(cast);\n        cast->setName(\"Cast to FP32\");\n        in = cast->getOutput(0);\n#else\n        auto* identity = network->addIdentity(input);\n        assert(identity);\n        identity->setName(\"Convert to FP32\");\n        identity->setOutputType(0, DataType::kFLOAT);\n        in = identity->getOutput(0);\n#endif\n    }\n    // Convert from NHWC to NCHW\n    auto* perm = network->addShuffle(*in);\n    assert(perm);\n    perm->setName(\"NHWC -> NCHW\");\n    perm->setFirstTranspose(Permutation{0, 3, 1, 2});\n\n    // Convert from BGR to RGB (optional)\n    ITensor* data{nullptr};\n    if (bgr2rgb) {\n        auto add_slice = [&](int c, const char* name) -> ITensor* {\n            auto dims = perm->getOutput(0)->getDimensions();\n            Dims4 start = {0, c, 0, 0}, stride = {1, 1, 1, 1};\n            Dims4 size = {dims.d[0], 1, dims.d[2], dims.d[3]};\n            auto* _slice = network->addSlice(*perm->getOutput(0), start, size, stride);\n            _slice->setName(name);\n            assert(_slice && _slice->getNbOutputs() == 1);\n            return _slice->getOutput(0);\n        };\n        std::array<ITensor*, 3> channels = {add_slice(2, \"R\"), add_slice(1, \"G\"), add_slice(0, \"B\")};\n        auto* cat = network->addConcatenation(channels.data(), 3);\n        assert(cat);\n        cat->setName(\"RGB\");\n        cat->setAxis(1);\n        data = cat->getOutput(0);\n    } else {\n        data = perm->getOutput(0);\n    }\n\n    // Normalize\n    auto* trans = network->addScale(*data, ScaleMode::kCHANNEL, shift, scale, empty);\n    assert(trans);\n    trans->setName(\"mean & std\");\n#if TRT_VERSION >= 8000\n    trans->setChannelAxis(1);\n#endif\n    return trans;\n}\n\nstatic size_t getSize(DataType dt) {\n    switch (dt) {\n#if TRT_VERSION >= 8510\n        case DataType::kUINT8:\n#endif\n        case DataType::kINT8:\n            return sizeof(int8_t);\n        case DataType::kFLOAT:\n            return sizeof(float);\n        case DataType::kHALF:\n            return sizeof(int16_t);\n        case DataType::kINT32:\n            return sizeof(int32_t);\n        default: {\n            std::cerr << \"Unsupported data type\\n\";\n            std::abort();\n        }\n    }\n}\n"
  },
  {
    "path": "squeezenet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.14)\n\nproject(\n  squeezenet\n  VERSION 0.1\n  LANGUAGES C CXX CUDA)\n\nif(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)\n  set(CMAKE_CUDA_ARCHITECTURES\n      60\n      70\n      72\n      75\n      80\n      86\n      89)\nendif()\n\nset(CMAKE_CXX_STANDARD 17)\nset(CMAKE_CXX_STANDARD_REQUIRED ON)\nset(CMAKE_CUDA_STANDARD 17)\nset(CMAKE_CUDA_STANDARD_REQUIRED ON)\nset(CMAKE_EXPORT_COMPILE_COMMANDS ON)\nset(CMAKE_INCLUDE_CURRENT_DIR TRUE)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME \"Use static cudaruntime library\" OFF)\n\nfind_package(Threads REQUIRED)\nfind_package(CUDAToolkit REQUIRED)\nfind_package(OpenCV REQUIRED)\n\nif(NOT TARGET TensorRT::TensorRT)\n  include(FindTensorRT.cmake)\nelse()\n  message(\"TensorRT has been found, skipping for ${PROJECT_NAME}\")\nendif()\n\nadd_executable(${PROJECT_NAME} \"${PROJECT_NAME}.cpp\")\ntarget_include_directories(${PROJECT_NAME} PRIVATE ${OpenCV_INCLUDE_DIRS})\ntarget_link_libraries(${PROJECT_NAME} PUBLIC Threads::Threads CUDA::cudart\n                                             TensorRT::TensorRT ${OpenCV_LIBS})\n"
  },
  {
    "path": "squeezenet/FindTensorRT.cmake",
    "content": "cmake_minimum_required(VERSION 3.17.0)\n\nset(TRT_VERSION\n    $ENV{TRT_VERSION}\n    CACHE\n      STRING\n      \"TensorRT version, e.g. \\\"8.6.1.6\\\" or \\\"8.6.1.6+cuda12.0.1.011\\\", etc\")\n\nfunction(_guess_path var_name required_files)\n  set(_result \"\")\n\n  foreach(path_entry IN LISTS ARGN)\n    if(NOT EXISTS \"${path_entry}\")\n      message(DEBUG \"skip non-existing path '${path_entry}'\")\n      continue()\n    endif()\n\n    set(_ok TRUE)\n    foreach(required_file IN LISTS required_files)\n      if(NOT EXISTS \"${path_entry}/${required_file}\")\n        set(_ok FALSE)\n        message(DEBUG \"'${path_entry}' missing '${required_file}'\")\n        break()\n      endif()\n    endforeach()\n\n    if(_ok)\n      list(APPEND _result \"${path_entry}\")\n      message(DEBUG \"accept '${path_entry}'\")\n    else()\n      message(DEBUG \"reject '${path_entry}'\")\n    endif()\n  endforeach()\n\n  if(_result STREQUAL \"\")\n    message(\n      FATAL_ERROR\n        \"_guess_path(${var_name}) failed: no valid path found. required_files='${required_files}' candidates='${ARGN}'\"\n    )\n  endif()\n\n  set(${var_name}\n      \"${_result}\"\n      PARENT_SCOPE)\nendfunction()\n\n# find TensorRT include folder\nif(NOT DEFINED TensorRT_INCLUDE_DIR)\n  if(CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n    _guess_path(\n      TensorRT_INCLUDE_DIR \"NvInfer.h\" \"/usr/include/aarch64-linux-gnu\"\n      \"/usr/include\" \"/usr/local/cuda/targets/aarch64-linux/include\")\n  else()\n    _guess_path(\n      TensorRT_INCLUDE_DIR \"NvInfer.h\"\n      \"/usr/local/tensorrt/targets/x86_64-linux-gnu/include\"\n      \"/usr/include/x86_64-linux-gnu\" \"/usr/include\")\n  endif()\n  message(STATUS \"TensorRT includes: ${TensorRT_INCLUDE_DIR}\")\nendif()\n\n# find TensorRT library folder\nif(NOT TensorRT_LIBRARY_DIR)\n  if(CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n    _guess_path(\n      TensorRT_LIBRARY_DIR \"libnvinfer.so;libnvinfer_plugin.so\"\n      \"/usr/lib/aarch64-linux-gnu;/usr/lib/aarch64-linux-gnu/tegra\" \"/usr/lib\")\n  else()\n    _guess_path(\n      TensorRT_LIBRARY_DIR\n      \"libnvinfer.so;libnvinfer_plugin.so\"\n      \"/usr/lib/x86_64-linux-gnu;/usr/local/tensorrt/targets/x86_64-linux-gnu/lib;/usr/lib\"\n    )\n  endif()\n  message(STATUS \"TensorRT libraries: ${TensorRT_LIBRARY_DIR}\")\nendif()\n\nset(TensorRT_LIBRARIES)\n\nmessage(STATUS \"Found TensorRT lib: ${TensorRT_LIBRARIES}\")\n\n# process for different TensorRT version\nif(DEFINED TRT_VERSION AND NOT TRT_VERSION STREQUAL \"\")\n  string(REGEX MATCH \"([0-9]+)\" _match ${TRT_VERSION})\n  set(TRT_MAJOR_VERSION \"${_match}\")\n  set(_modules nvinfer nvinfer_plugin)\n  unset(_match)\n\n  if(TRT_MAJOR_VERSION GREATER_EQUAL 8)\n    list(APPEND _modules nvinfer_vc_plugin nvinfer_dispatch nvinfer_lean)\n  endif()\nelse()\n  message(FATAL_ERROR \"Please set a environment variable \\\"TRT_VERSION\\\"\")\nendif()\n\n# find and add all modules of TensorRT into list\nforeach(lib IN LISTS _modules)\n  find_library(\n    TensorRT_${lib}_LIBRARY\n    NAMES ${lib}\n    HINTS ${TensorRT_LIBRARY_DIR})\n  list(APPEND TensorRT_LIBRARIES ${TensorRT_${lib}_LIBRARY})\nendforeach()\n\n# make the \"TensorRT target\"\nadd_library(TensorRT IMPORTED INTERFACE)\nadd_library(TensorRT::TensorRT ALIAS TensorRT)\ntarget_link_libraries(TensorRT INTERFACE ${TensorRT_LIBRARIES})\n\nset_target_properties(\n  TensorRT\n  PROPERTIES C_STANDARD 17\n             CXX_STANDARD 17\n             POSITION_INDEPENDENT_CODE ON\n             SKIP_BUILD_RPATH TRUE\n             BUILD_WITH_INSTALL_RPATH TRUE\n             INSTALL_RPATH \"$ORIGIN\"\n             INTERFACE_INCLUDE_DIRECTORIES \"${TensorRT_INCLUDE_DIR}\")\n\nunset(TRT_MAJOR_VERSION)\nunset(_modules)\n"
  },
  {
    "path": "squeezenet/README.md",
    "content": "# squeezenet v1.1\n\nSqueezeNet 1.1 model from the official SqueezeNet repo\n<https://github.com/DeepScale/SqueezeNet/tree/master/SqueezeNet_v1.1>\n\nSqueezeNet 1.1 has 2.4x less computation and slightly fewer parameters\nthan SqueezeNet 1.0, without sacrificing accuracy.\n\nFor the Pytorch implementation, you can refer to [pytorchx/squeezenet](https://github.com/wang-xinyu/pytorchx/tree/master/squeezenet)\n\n## Usage\n\n1. use `gen_wts.py` to generate wts file\n\n```bash\npython3 gen_wts.py\n```\n\n2. build C++ code\n\n```bash\npushd tensorrtx/squeezenet\ncmake -S . -B build -G Ninja --fresh\ncmake --build build\n```\n\n3. serialize wts model to engine file\n\n```bash\n./build/squeezenet -s\n```\n\n4. run inference\n\n```bash\n./build/squeezenet -d\n```\n\noutput looks like:\n\n```bash\n...\n====\nExecution time: 183us\n3.481, 3.901, 4.438, 4.346, 3.3, 6.519, 6.03, 10.89, 10.45, 10.39, 8.874, 5.889, 9.529, 3.703, 5.865, 6.982, 8.894, 7.76, 4.599, 7.89, 4.795,\n====\nprediction result:\nTop: 0 idx: 281, logits: 25.18, label: tabby, tabby cat\nTop: 1 idx: 282, logits: 23.2, label: tiger cat\nTop: 2 idx: 309, logits: 22.72, label: bee\n```\n"
  },
  {
    "path": "squeezenet/gen_wts.py",
    "content": "import struct\n\nimport cv2\nimport numpy as np\nimport torch\nimport torchvision\n\n\ndef read_imagenet_labels() -> dict[int, str]:\n    \"\"\"\n    read ImageNet 1000 labels\n\n    Returns:\n        dict[int, str]: labels dict\n    \"\"\"\n    clsid2label = {}\n    with open(\"../assets/imagenet1000_clsidx_to_labels.txt\", \"r\") as f:\n        for i in f.readlines():\n            k, v = i.split(\": \")\n            clsid2label.setdefault(int(k), v[1:-3])\n    return clsid2label\n\n\ndef preprocess(img: np.array) -> torch.Tensor:\n    \"\"\"\n    a preprocess method align with ImageNet dataset\n\n    Args:\n        img (np.array): input image\n\n    Returns:\n        torch.Tensor: preprocessed image in `NCHW` layout\n    \"\"\"\n    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0\n    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_LINEAR)\n    mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)\n    std = np.array([0.229, 0.224, 0.225], dtype=np.float32)\n    img = (img - mean) / std\n    img = img.transpose(2, 0, 1)[None, ...]\n    return torch.from_numpy(img)\n\n\ndef main():\n    labels = read_imagenet_labels()\n\n    model = torchvision.models.squeezenet1_1(pretrained=True)\n    model = model.eval()\n\n    img = cv2.imread(\"../assets/cats.jpg\", cv2.IMREAD_COLOR)\n    img = preprocess(img)\n\n    with torch.inference_mode():\n        output = model(img)\n        for i, batch in enumerate(torch.topk(output, k=3).indices):\n            for j, idx in enumerate(batch):\n                print(f\"\\tBatch: {i}, Top: {j}, logits: {output[i][idx]:.4f}, label: {labels[int(idx)]}\")\n        print(f\"{'=' * 32}\")\n\n    with open(\"../models/squeezenet.wts\", \"w\") as f:\n        f.write(\"{}\\n\".format(len(model.state_dict().keys())))\n        for k, v in model.state_dict().items():\n            vr = v.reshape(-1).cpu().numpy()\n            f.write(\"{} {} \".format(k, len(vr)))\n            print(k, v.shape)\n            for vv in vr:\n                f.write(\" \")\n                f.write(struct.pack(\">f\", float(vv)).hex())\n            f.write(\"\\n\")\n        f.close()\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "squeezenet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"NvInferRuntime.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "squeezenet/macros.h",
    "content": "#pragma once\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#define TRT_VERSION \\\n    ((NV_TENSORRT_MAJOR * 1000) + (NV_TENSORRT_MINOR * 100) + (NV_TENSORRT_PATCH * 10) + NV_TENSORRT_BUILD)\n\n#if TRT_VERSION >= 8000\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n"
  },
  {
    "path": "squeezenet/squeezenet.cpp",
    "content": "#include <NvInfer.h>\n#include <chrono>\n#include <cmath>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <opencv2/opencv.hpp>\n#include <vector>\n#include \"logging.h\"\n#include \"utils.h\"\n\n// stuff we know about squeezenet\nstatic constexpr const int N = 1;\nstatic constexpr const int INPUT_H = 224;\nstatic constexpr const int INPUT_W = 224;\nstatic constexpr const int SIZES[] = {3 * INPUT_H * INPUT_W, N * 1000};\nstatic constexpr const char* NAMES[] = {\"data\", \"prob\"};\nstatic constexpr const bool TRT_PREPROCESS = TRT_VERSION >= 8510 ? true : false;\nstatic constexpr const float mean[3] = {0.485f, 0.456f, 0.406f};\nstatic constexpr const float stdv[3] = {0.229f, 0.224f, 0.225f};\n\nstatic constexpr const char* WTS_PATH = \"../models/squeezenet.wts\";\nstatic constexpr const char* ENGINE_PATH = \"../models/squeezenet.engine\";\nstatic constexpr const char* LABELS_PATH = \"../assets/imagenet1000_clsidx_to_labels.txt\";\n\nusing namespace nvinfer1;\nusing WeightMap = std::map<std::string, Weights>;\n\nstatic Logger gLogger;\n\nILayer* fire(INetworkDefinition* network, WeightMap& m, ITensor& input, const std::string& lname,\n             int32_t squeeze_planes, int32_t e1x1_planes, int32_t e3x3_planes) {\n    auto* conv1 = network->addConvolutionNd(input, squeeze_planes, DimsHW{1, 1}, m[lname + \"squeeze.weight\"],\n                                            m[lname + \"squeeze.bias\"]);\n    assert(conv1);\n    auto* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU)->getOutput(0);\n\n    std::string _c = lname + \"expand1x1\";\n    auto* conv2 = network->addConvolutionNd(*relu1, e1x1_planes, DimsHW{1, 1}, m[_c + \".weight\"], m[_c + \".bias\"]);\n    assert(conv2);\n    auto* relu2 = network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n\n    _c = lname + \"expand3x3\";\n    auto* conv3 = network->addConvolutionNd(*relu1, e3x3_planes, DimsHW{3, 3}, m[_c + \".weight\"], m[_c + \".bias\"]);\n    assert(conv3);\n    conv3->setPaddingNd(DimsHW{1, 1});\n    auto* relu3 = network->addActivation(*conv3->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n\n    ITensor* inputTensors[] = {relu2->getOutput(0), relu3->getOutput(0)};\n    auto* concat = network->addConcatenation(inputTensors, 2);\n    assert(concat);\n    return concat;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(int32_t N, IRuntime* runtime, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    auto weightMap = loadWeights(WTS_PATH);\n#if TRT_VERSION >= 10000\n    auto* network = builder->createNetworkV2(0);\n#else\n    auto* network = builder->createNetworkV2(1u << static_cast<int>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n#endif\n\n    ITensor* data{nullptr};\n    if constexpr (TRT_PREPROCESS) {\n#if TRT_VERSION > 8510\n        dt = DataType::kUINT8;\n#else\n        dt = DataType::kINT8;\n#endif\n        data = network->addInput(NAMES[0], dt, Dims4{N, INPUT_H, INPUT_W, 3});\n        auto* trans = addTransformLayer(network, *data, true, mean, stdv);\n        data = trans->getOutput(0);\n    } else {\n        data = network->addInput(NAMES[0], dt, Dims4{N, 3, INPUT_H, INPUT_W});\n    }\n    assert(data);\n\n    auto* conv1 = network->addConvolutionNd(*data, 64, DimsHW{3, 3}, weightMap[\"features.0.weight\"],\n                                            weightMap[\"features.0.bias\"]);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{2, 2});\n    auto* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    auto* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n    pool1->setPaddingMode(PaddingMode::kEXPLICIT_ROUND_UP);\n\n    auto* cat1 = fire(network, weightMap, *pool1->getOutput(0), \"features.3.\", 16, 64, 64);\n    cat1 = fire(network, weightMap, *cat1->getOutput(0), \"features.4.\", 16, 64, 64);\n\n    auto* pool2 = network->addPoolingNd(*cat1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool2);\n    pool2->setStrideNd(DimsHW{2, 2});\n    pool2->setPaddingMode(PaddingMode::kEXPLICIT_ROUND_UP);\n    // pool2->setPostPadding(DimsHW{1, 1});\n\n    cat1 = fire(network, weightMap, *pool2->getOutput(0), \"features.6.\", 32, 128, 128);\n    cat1 = fire(network, weightMap, *cat1->getOutput(0), \"features.7.\", 32, 128, 128);\n\n    auto* pool3 = network->addPoolingNd(*cat1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool3);\n    pool3->setStrideNd(DimsHW{2, 2});\n    pool3->setPostPadding(DimsHW{1, 1});\n    pool3->setPaddingMode(PaddingMode::kEXPLICIT_ROUND_UP);\n\n    cat1 = fire(network, weightMap, *pool3->getOutput(0), \"features.9.\", 48, 192, 192);\n    cat1 = fire(network, weightMap, *cat1->getOutput(0), \"features.10.\", 48, 192, 192);\n    cat1 = fire(network, weightMap, *cat1->getOutput(0), \"features.11.\", 64, 256, 256);\n    cat1 = fire(network, weightMap, *cat1->getOutput(0), \"features.12.\", 64, 256, 256);\n\n    // classifier\n    auto* conv2 = network->addConvolutionNd(*cat1->getOutput(0), 1000, DimsHW{1, 1}, weightMap[\"classifier.1.weight\"],\n                                            weightMap[\"classifier.1.bias\"]);\n    assert(conv2);\n    auto* relu2 = network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n    auto* pool4 = network->addPoolingNd(*relu2->getOutput(0), PoolingType::kAVERAGE, DimsHW{14, 14});\n    assert(pool4);\n\n    pool4->getOutput(0)->setName(NAMES[1]);\n    network->markOutput(*pool4->getOutput(0));\n\n    // Build engine\n#if TRT_VERSION >= 8000\n    config->setMemoryPoolLimit(MemoryPoolType::kWORKSPACE, WORKSPACE_SIZE);\n    IHostMemory* mem = builder->buildSerializedNetwork(*network, *config);\n    auto* engine = runtime->deserializeCudaEngine(mem->data(), mem->size());\n    delete network;\n#else\n    builder->setMaxBatchSize(N);\n    config->setMaxWorkspaceSize(WORKSPACE_SIZE);\n    auto* engine = builder->buildEngineWithConfig(*network, *config);\n    network->destroy();\n#endif\n    std::cout << \"build out\" << std::endl;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(int32_t N, IRuntime* runtime, IHostMemory** modelStream) {\n    // Create builder\n    auto* builder = createInferBuilder(gLogger);\n    auto* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    auto* engine = createEngine(N, runtime, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n#if TRT_VERSION >= 8000\n    delete engine;\n    delete config;\n    delete builder;\n#else\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n#endif\n}\n\nstd::vector<std::vector<float>> doInference(IExecutionContext& context, void* input, int32_t batch_size) {\n    const auto& engine = context.getEngine();\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n    std::vector<void*> buffers;\n\n#if TRT_VERSION >= 8000\n    const int32_t nIO = engine.getNbIOTensors();\n#else\n    const int32_t nIO = engine.getNbBindings();\n#endif\n\n    buffers.resize(nIO);\n    for (auto i = 0; i < nIO; ++i) {\n        std::size_t size = 0;\n#if TRT_VERSION >= 8000\n        const auto* tensor_name = engine.getIOTensorName(i);\n        auto s = getSize(engine.getTensorDataType(tensor_name));\n        size = s * batch_size * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n        context.setTensorAddress(tensor_name, buffers[i]);\n#else\n        const int32_t idx = engine.getBindingIndex(NAMES[i]);\n        auto s = getSize(engine.getBindingDataType(idx));\n        assert(idx == i);\n        size = s * batch_size * SIZES[i];\n        CHECK(cudaMalloc(&buffers[i], size));\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n#endif\n    }\n\n#if TRT_VERSION >= 8000\n    assert(context.enqueueV3(stream));\n#else\n    assert(context.enqueueV2(buffers.data(), stream, nullptr));\n#endif\n\n    std::vector<std::vector<float>> prob;\n    for (int i = 1; i < nIO; ++i) {\n        std::vector<float> tmp(batch_size * SIZES[i], std::nan(\"\"));\n        std::size_t size = batch_size * SIZES[i] * sizeof(float);\n        CHECK(cudaMemcpyAsync(tmp.data(), buffers[i], size, cudaMemcpyDeviceToHost, stream));\n        prob.emplace_back(tmp);\n    }\n    CHECK(cudaStreamSynchronize(stream));\n\n    cudaStreamDestroy(stream);\n    for (auto i = 0; i < nIO; ++i) {\n        CHECK(cudaFree(buffers[i]));\n    }\n    return prob;\n}\n\nint main(int argc, char** argv) {\n    checkTrtEnv();\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./squeezenet -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./squeezenet -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    auto* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    char* trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, runtime, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(ENGINE_PATH, std::ios::binary | std::ios::trunc);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n#if TRT_VERSION >= 8000\n        delete modelStream;\n#else\n        modelStream->destroy();\n#endif\n        return 0;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(ENGINE_PATH, std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n#if TRT_VERSION >= 8000\n    auto* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n#else\n    auto* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n#endif\n    assert(engine != nullptr);\n    auto* context = engine->createExecutionContext();\n    assert(context != nullptr);\n\n    void* input = nullptr;\n    std::vector<float> flat_img;\n    cv::Mat img;\n    if constexpr (TRT_PREPROCESS) {\n        // for simplicity, resize image on cpu side\n        img = cv::imread(\"../assets/cats.jpg\", cv::IMREAD_COLOR);\n        cv::resize(img, img, cv::Size(INPUT_W, INPUT_H), 0, 0, cv::INTER_LINEAR);\n        input = static_cast<void*>(img.data);\n    } else {\n        img = cv::imread(\"../assets/cats.jpg\", cv::IMREAD_COLOR);\n        flat_img = preprocess_img(img, true, mean, stdv, N, INPUT_H, INPUT_W);\n        input = flat_img.data();\n    }\n\n    for (int32_t i = 0; i < 100; ++i) {\n        auto _start = std::chrono::system_clock::now();\n        auto prob = doInference(*context, input, N);\n        auto _end = std::chrono::system_clock::now();\n        auto _time = std::chrono::duration_cast<std::chrono::microseconds>(_end - _start).count();\n        std::cout << \"Execution time: \" << _time << \"us\" << std::endl;\n\n        for (auto vector : prob) {\n            int idx = 0;\n            for (auto v : vector) {\n                std::cout << std::setprecision(4) << v << \", \" << std::flush;\n                if (++idx > 20) {\n                    std::cout << \"\\n====\" << std::endl;\n                    break;\n                }\n            }\n        }\n\n        if (i == 99) {\n            std::cout << \"prediction result: \" << std::endl;\n            auto labels = loadImagenetLabelMap(LABELS_PATH);\n            int _top = 0;\n            for (auto& [idx, logits] : topk(prob[0], 3)) {\n                std::cout << \"Top: \" << _top++ << \" idx: \" << idx << \", logits: \" << logits\n                          << \", label: \" << labels[idx] << std::endl;\n            }\n        }\n    }\n\n    delete[] trtModelStream;\n    // Destroy the engine\n#if TRT_VERSION >= 8000\n    delete context;\n    delete engine;\n    delete runtime;\n#else\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n#endif\n    return 0;\n}\n"
  },
  {
    "path": "squeezenet/utils.h",
    "content": "#pragma once\n#include <NvInfer.h>\n#include <cuda_runtime_api.h>\n#include <algorithm>\n#include <cassert>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <numeric>\n#include <opencv2/opencv.hpp>\n#include <stdexcept>\n#include <string>\n#include <vector>\n\nusing namespace nvinfer1;\n\n#define WORKSPACE_SIZE (16 << 20)\n\n#define CHECK(status)                                          \\\n    do {                                                       \\\n        auto ret = (status);                                   \\\n        if (ret != cudaSuccess) {                              \\\n            std::cerr << \"Cuda failure: \" << ret << std::endl; \\\n            abort();                                           \\\n        }                                                      \\\n    } while (0)\n\nstatic void checkTrtEnv(int device = 0) {\n#if TRT_VERSION < 7220\n#error \"TensorRT >= 7.2.2 is required for this demo.\"\n#endif\n#if TRT_VERSION < 8000\n    CHECK(cudaGetDevice(&device));\n    cudaDeviceProp prop{};\n    CHECK(cudaGetDeviceProperties(&prop, device));\n    const int sm = prop.major * 10 + prop.minor;\n    if (sm > 86) {\n        throw std::runtime_error(\"TensorRT < 8 does not support SM > 86 on this GPU.\");\n    }\n#endif\n}\n\n/**\n * @brief TensorRT weight files have a simple space delimited format:\n * [type] [size] <data x size in hex>\n * \n * @param file input weight file path\n * @return std::map<std::string, nvinfer1::Weights> \n */\nstatic std::map<std::string, nvinfer1::Weights> loadWeights(const std::string& file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, nvinfer1::Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> wt.count;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * wt.count));\n        for (uint32_t x = 0; x < wt.count; ++x) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\n/**\n * @brief a preprocess function aligning with ImageNet preprocess in torchvision, only support 3-channel image\n * \n * @param img opencv image with BGR layout\n * @param bgr2rgb whether to convert BGR to RGB\n * @param mean subtract mean\n * @param std divide std\n * @param n batch size\n * @param h resize height\n * @param w resize width\n * @return std::vector<float> contiguous flatten image data in float32 type\n */\nstatic std::vector<float> preprocess_img(cv::Mat& img, bool bgr2rgb, const float mean[3], const float std[3], int n,\n                                         int h, int w) {\n    const int c = img.channels();\n    const std::size_t size = c * h * w;\n    if (c != 3) {\n        throw std::runtime_error(\"this demo only supports 3 channel input image.\");\n    }\n    if (bgr2rgb) {\n        cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n    }\n    cv::resize(img, img, cv::Size(w, h), 0, 0, cv::INTER_LINEAR);\n    img.convertTo(img, CV_32FC3, 1.f / 255);\n    img = (img - cv::Scalar(mean[0], mean[1], mean[2])) / cv::Scalar(std[0], std[1], std[2]);\n    std::vector<float> chw(n * c * h * w, 0.f);\n\n    // fill all batch with the same input image\n    for (int i = 0; i < n; ++i) {\n        for (int y = 0; y < h; ++y) {\n            for (int x = 0; x < w; ++x) {\n                const cv::Vec3f v = img.at<cv::Vec3f>(y, x);\n                chw[i * size + 0 * h * w + y * w + x] = v[0];\n                chw[i * size + 1 * h * w + y * h + x] = v[1];\n                chw[i * size + 2 * h * w + y * h + x] = v[2];\n            }\n        }\n    }\n    return chw;\n}\n\nstatic std::vector<std::pair<int, float>> topk(const std::vector<float>& v, int k) {\n    if (k <= 0)\n        return {};\n    k = std::min<int>(k, v.size());\n\n    std::vector<int> idx(v.size());\n    std::iota(idx.begin(), idx.end(), 0);\n\n    std::partial_sort(idx.begin(), idx.begin() + k, idx.end(), [&](int a, int b) { return v[a] > v[b]; });\n\n    std::vector<std::pair<int, float>> out;\n    out.reserve(k);\n    for (int i = 0; i < k; ++i)\n        out.emplace_back(idx[i], v[idx[i]]);\n    return out;\n}\n\nstatic std::map<int, std::string> loadImagenetLabelMap(const std::string& path) {\n    std::map<int, std::string> labels;\n    std::ifstream in(path);\n    if (!in.is_open()) {\n        return labels;\n    }\n    std::string line;\n    while (std::getline(in, line)) {\n        auto colon = line.find(':');\n        if (colon == std::string::npos) {\n            continue;\n        }\n        auto first_quote = line.find('\\'', colon);\n        if (first_quote == std::string::npos) {\n            continue;\n        }\n        auto second_quote = line.find('\\'', first_quote + 1);\n        if (second_quote == std::string::npos) {\n            continue;\n        }\n        int idx = std::stoi(line.substr(0, colon));\n        labels[idx] = line.substr(first_quote + 1, second_quote - first_quote - 1);\n    }\n    return labels;\n}\n\nstatic ILayer* addTransformLayer(INetworkDefinition* network, ITensor& input, bool bgr2rgb, const float mean[3],\n                                 const float std[3]) {\n    struct ScaleParams {\n        std::array<float, 3> shift;\n        std::array<float, 3> scale;\n    };\n    static std::vector<std::unique_ptr<ScaleParams>> gScaleParams;\n    auto params = std::make_unique<ScaleParams>();\n    params->shift = {-mean[0] / std[0], -mean[1] / std[1], -mean[2] / std[2]};\n    params->scale = {1.f / (std[0] * 255.f), 1.f / (std[1] * 255.f), 1.f / (std[2] * 255.f)};\n\n    static const Weights empty{DataType::kFLOAT, nullptr, 0ll};\n    const Weights shift{DataType::kFLOAT, params->shift.data(), 3ll};\n    const Weights scale{DataType::kFLOAT, params->scale.data(), 3ll};\n\n    gScaleParams.emplace_back(std::move(params));\n\n    ITensor* in = &input;\n    if (input.getType() != DataType::kFLOAT) {\n#if TRT_VERSION >= 8000\n        auto* cast = network->addCast(input, DataType::kFLOAT);\n        assert(cast);\n        cast->setName(\"Cast to FP32\");\n        in = cast->getOutput(0);\n#else\n        auto* identity = network->addIdentity(input);\n        assert(identity);\n        identity->setName(\"Convert to FP32\");\n        identity->setOutputType(0, DataType::kFLOAT);\n        in = identity->getOutput(0);\n#endif\n    }\n\n    // Convert from NHWC to NCHW\n    auto* perm = network->addShuffle(*in);\n    assert(perm);\n    perm->setName(\"NHWC -> NCHW\");\n    perm->setFirstTranspose(Permutation{0, 3, 1, 2});\n\n    // Convert from BGR to RGB (optional)\n    ITensor* data{nullptr};\n    if (bgr2rgb) {\n        auto add_slice = [&](int c, const char* name) -> ITensor* {\n            auto dims = perm->getOutput(0)->getDimensions();\n            Dims4 start = {0, c, 0, 0}, stride = {1, 1, 1, 1};\n            Dims4 size = {dims.d[0], 1, dims.d[2], dims.d[3]};\n            auto* _slice = network->addSlice(*perm->getOutput(0), start, size, stride);\n            _slice->setName(name);\n            assert(_slice && _slice->getNbOutputs() == 1);\n            auto d = _slice->getOutput(0)->getDimensions();\n            return _slice->getOutput(0);\n        };\n        ITensor* channels[] = {add_slice(2, \"R\"), add_slice(1, \"G\"), add_slice(0, \"B\")};\n        auto* cat = network->addConcatenation(channels, 3);\n        assert(cat);\n        cat->setName(\"RGB\");\n        cat->setAxis(1);\n        data = cat->getOutput(0);\n    } else {\n        data = perm->getOutput(0);\n    }\n\n    // Normalize\n    auto* trans = network->addScale(*data, ScaleMode::kCHANNEL, shift, scale, empty);\n    assert(trans);\n    trans->setName(\"mean & std\");\n#if TRT_VERSION >= 8000\n    trans->setChannelAxis(1);\n#endif\n    return trans;\n}\n\nstatic size_t getSize(DataType dt) {\n    switch (dt) {\n#if TRT_VERSION >= 8510\n        case DataType::kUINT8:\n#endif\n        case DataType::kINT8:\n            return sizeof(int8_t);\n        case DataType::kFLOAT:\n            return sizeof(float);\n        case DataType::kHALF:\n            return sizeof(int16_t);\n        case DataType::kINT32:\n            return sizeof(int32_t);\n        default:\n            throw std::runtime_error(\"Unsupported data type\");\n    }\n}\n"
  },
  {
    "path": "superpoint/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(SuperPointNet)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -pthread -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(supernet ${PROJECT_SOURCE_DIR}/supernet.cpp ${PROJECT_SOURCE_DIR}/utils.cpp)\ntarget_link_libraries(supernet nvinfer)\ntarget_link_libraries(supernet cudart)\ntarget_link_libraries(supernet ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)"
  },
  {
    "path": "superpoint/README.md",
    "content": "# SuperPoint\n\nThe PyTorch implementation is from [magicleap/SuperPointPretrainedNetwork.](https://github.com/magicleap/SuperPointPretrainedNetwork)\n\nThe pretrained models are from [magicleap/SuperPointPretrainedNetwork.](https://github.com/magicleap/SuperPointPretrainedNetwork)\n\n\n## Config\n\n- FP16/FP32 can be selected by the macro `USE_FP16` in supernet.cpp\n- GPU id and batch size can be selected by the macro `DEVICE` & `BATCH_SIZE` in supernet.cpp\n\n\n## How to Run\n1.Generate .wts file from the baseline pytorch implementation of pretrained model. The following example described how to generate superpoint_v1.wts from pytorch implementation of superpoint_v1. \n```\ngit clone https://github.com/xiang-wuu/SuperPointPretrainedNetwork\ncd SuperPointPretrainedNetwork\ngit checkout deploy\n// copy tensorrtx/superpoint/gen_wts.py to here(SuperPointPretrainedNetwork)\npython gen_wts.py\n// a file 'superpoint_v1.wts' will be generated.\n// before running gen_wts.py python script make sure you cloned private fork and checkout to deploy branch.\n```\n\n2.Put .wts file into tensorrtx/superpoint, build and run\n```\ncd tensorrtx/superpoint\nmkdir build\ncd build\ncmake ..\nmake\n./supernet -s SuperPointPretrainedNetwork/superpoint_v1.wts    // serialize model to plan file i.e. 'supernet.engine'\n```\n\n## Run Demo using SuperPointPretrainedNetwork Python Script\nThe live demo can be run by inffering TensorRT generated engine file or by the pre-trained pytorch weight file , the `demo_superpoint.py` script is modified to infer automatically by either using TensorRT or PyTorch based on the provided input weight file.\n```\ncd SuperPointPretrainedNetwork\npython demo_superpoint.py assets/nyu_snippet.mp4 --cuda --weights_path tensorrtx/superpoint/build/supernet.engine\n// provide absolute path to supernet.engine as input weight file \npython demo_superpoint.py assets/nyu_snippet.mp4 --cuda --weights_path superpoint_v1.pth\n// execute above command to infer using pytorch pre-trained weight files instead of tensorrt engine file.\n```\n\n## Output\nAs from the below result there is no significant difference in the inferred output!\n<table>\n<th>\nPyTorch\n</th>\n<th>\nTensorRT\n</th>\n<tr>\n<td>\n<img src=\"https://user-images.githubusercontent.com/107029401/177322379-2782ca66-bcac-4cf6-b6d3-e1b4d4a8e171.gif\"/>\n</td>\n<td>\n<img src=\"https://user-images.githubusercontent.com/107029401/177322387-c945b903-f233-4a43-bfd3-530c46f4f4db.gif\"/>\n</td>\n</tr>\n</table>\n\n## TODO\n- [ ] Optimizing post-processing using custom TensorRT layer.\n- [ ] Benchmark validation for speed accuracy tradeoff with [hpatches](https://github.com/hpatches/hpatches-benchmark) dataset\n"
  },
  {
    "path": "superpoint/gen_wts.py",
    "content": "import torch\nimport struct\nfrom model import SuperPointNet\n\nmodel_name = \"superpoint_v1\"\n\nnet = SuperPointNet()\nnet.load_state_dict(torch.load(\"superpoint_v1.pth\"))\nnet = net.cuda()\nnet.eval()\n\nf = open(model_name + \".wts\", \"w\")\nf.write(\"{}\\n\".format(len(net.state_dict().keys())))\nfor k, v in net.state_dict().items():\n    vr = v.reshape(-1).cpu().numpy()\n    f.write(\"{} {}\".format(k, len(vr)))\n    for vv in vr:\n        f.write(\" \")\n        f.write(struct.pack(\">f\", float(vv)).hex())\n    f.write(\"\\n\")"
  },
  {
    "path": "superpoint/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream &stream, const std::string &prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer &&other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm *tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream &mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream &stream, const std::string &prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity), std::ostream(&mBuffer) // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity), mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer &&other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog), std::ostream(&mBuffer) // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog), mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream &severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR:\n            return \"[F] \";\n        case Severity::kERROR:\n            return \"[E] \";\n        case Severity::kWARNING:\n            return \"[W] \";\n        case Severity::kINFO:\n            return \"[I] \";\n        case Severity::kVERBOSE:\n            return \"[V] \";\n        default:\n            assert(0);\n            return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger &getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char *msg) noexcept override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom &&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string &name, const std::string &cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string &name, const std::string &cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string &name, int argc, char const *const *argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom &testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom &testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom &testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom &testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom &testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom &testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char *severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR:\n            return \"[F] \";\n        case Severity::kERROR:\n            return \"[E] \";\n        case Severity::kWARNING:\n            return \"[W] \";\n        case Severity::kINFO:\n            return \"[I] \";\n        case Severity::kVERBOSE:\n            return \"[V] \";\n        default:\n            assert(0);\n            return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char *testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING:\n            return \"RUNNING\";\n        case TestResult::kPASSED:\n            return \"PASSED\";\n        case TestResult::kFAILED:\n            return \"FAILED\";\n        case TestResult::kWAIVED:\n            return \"WAIVED\";\n        default:\n            assert(0);\n            return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream &severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom &testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const *const *argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n    //!\n    //! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n    //!\n    //! Example usage:\n    //!\n    //!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n    //!\n    inline LogStreamConsumer LOG_VERBOSE(const Logger &logger)\n    {\n        return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n    }\n\n    //!\n    //! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n    //!\n    //! Example usage:\n    //!\n    //!     LOG_INFO(logger) << \"hello world\" << std::endl;\n    //!\n    inline LogStreamConsumer LOG_INFO(const Logger &logger)\n    {\n        return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n    }\n\n    //!\n    //! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n    //!\n    //! Example usage:\n    //!\n    //!     LOG_WARN(logger) << \"hello world\" << std::endl;\n    //!\n    inline LogStreamConsumer LOG_WARN(const Logger &logger)\n    {\n        return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n    }\n\n    //!\n    //! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n    //!\n    //! Example usage:\n    //!\n    //!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n    //!\n    inline LogStreamConsumer LOG_ERROR(const Logger &logger)\n    {\n        return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n    }\n\n    //!\n    //! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n    //         (\"fatal\" severity)\n    //!\n    //! Example usage:\n    //!\n    //!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n    //!\n    inline LogStreamConsumer LOG_FATAL(const Logger &logger)\n    {\n        return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n    }\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "superpoint/supernet.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <opencv2/opencv.hpp>\n#include <dirent.h>\n#include \"NvInfer.h\"\n#include \"utils.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n\n//#define USE_FP16  // comment out this if want to use FP32\n#define DEVICE 0     // GPU id\n#define BATCH_SIZE 1 // currently, only support BATCH=1\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 120;\nstatic const int INPUT_W = 160;\nconst char *INPUT_BLOB_NAME = \"data\";\nconst char *OUTPUT_BLOB_NAME_1 = \"semi\";\nconst char *OUTPUT_BLOB_NAME_2 = \"desc\";\n\nstatic Logger gLogger;\n\n// create the engine using only the API and not any parser.\nICudaEngine *createEngine(IBuilder *builder, IBuilderConfig *config, std::string path, DataType dt)\n{\n    INetworkDefinition *network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape { 3, INPUT_H, INPUT_W } with name INPUT_BLOB_NAME\n    ITensor *data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(path);\n\n    IConvolutionLayer *conv1a = network->addConvolutionNd(*data, 64, DimsHW{3, 3}, weightMap[\"conv1a.weight\"], weightMap[\"conv1a.bias\"]);\n    assert(conv1a);\n    conv1a->setStrideNd(DimsHW{1, 1});\n    conv1a->setPaddingNd(DimsHW{1, 1});\n    IActivationLayer *relu1 = network->addActivation(*conv1a->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IConvolutionLayer *conv1b = network->addConvolutionNd(*relu1->getOutput(0), 64, DimsHW{3, 3}, weightMap[\"conv1b.weight\"], weightMap[\"conv1b.bias\"]);\n    assert(conv1b);\n    conv1b->setStrideNd(DimsHW{1, 1});\n    conv1b->setPaddingNd(DimsHW{1, 1});\n    IActivationLayer *relu2 = network->addActivation(*conv1b->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n\n    IPoolingLayer *pool1 = network->addPoolingNd(*relu2->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n\n    IConvolutionLayer *conv2a = network->addConvolutionNd(*pool1->getOutput(0), 64, DimsHW{3, 3}, weightMap[\"conv2a.weight\"], weightMap[\"conv2a.bias\"]);\n    assert(conv2a);\n    conv2a->setStrideNd(DimsHW{1, 1});\n    conv2a->setPaddingNd(DimsHW{1, 1});\n    IActivationLayer *relu3 = network->addActivation(*conv2a->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n\n    IConvolutionLayer *conv2b = network->addConvolutionNd(*relu3->getOutput(0), 64, DimsHW{3, 3}, weightMap[\"conv2b.weight\"], weightMap[\"conv2b.bias\"]);\n    assert(conv2b);\n    conv2b->setStrideNd(DimsHW{1, 1});\n    conv2b->setPaddingNd(DimsHW{1, 1});\n    IActivationLayer *relu4 = network->addActivation(*conv2b->getOutput(0), ActivationType::kRELU);\n    assert(relu4);\n\n    IPoolingLayer *pool2 = network->addPoolingNd(*relu4->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    assert(pool2);\n    pool2->setStrideNd(DimsHW{2, 2});\n\n    IConvolutionLayer *conv3a = network->addConvolutionNd(*pool2->getOutput(0), 128, DimsHW{3, 3}, weightMap[\"conv3a.weight\"], weightMap[\"conv3a.bias\"]);\n    assert(conv3a);\n    conv3a->setStrideNd(DimsHW{1, 1});\n    conv3a->setPaddingNd(DimsHW{1, 1});\n    IActivationLayer *relu44 = network->addActivation(*conv3a->getOutput(0), ActivationType::kRELU);\n    assert(relu44);\n\n    IConvolutionLayer *conv3b = network->addConvolutionNd(*relu44->getOutput(0), 128, DimsHW{3, 3}, weightMap[\"conv3b.weight\"], weightMap[\"conv3b.bias\"]);\n    assert(conv3b);\n    conv3b->setStrideNd(DimsHW{1, 1});\n    conv3b->setPaddingNd(DimsHW{1, 1});\n    IActivationLayer *relu5 = network->addActivation(*conv3b->getOutput(0), ActivationType::kRELU);\n    assert(relu5);\n\n    IPoolingLayer *pool3 = network->addPoolingNd(*relu5->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    assert(pool3);\n    pool3->setStrideNd(DimsHW{2, 2});\n\n    IConvolutionLayer *conv4a = network->addConvolutionNd(*pool3->getOutput(0), 128, DimsHW{3, 3}, weightMap[\"conv4a.weight\"], weightMap[\"conv4a.bias\"]);\n    assert(conv4a);\n    conv4a->setStrideNd(DimsHW{1, 1});\n    conv4a->setPaddingNd(DimsHW{1, 1});\n    IActivationLayer *relu6 = network->addActivation(*conv4a->getOutput(0), ActivationType::kRELU);\n    assert(relu6);\n\n    IConvolutionLayer *conv4b = network->addConvolutionNd(*relu6->getOutput(0), 128, DimsHW{3, 3}, weightMap[\"conv4b.weight\"], weightMap[\"conv4b.bias\"]);\n    assert(conv4b);\n    conv4b->setStrideNd(DimsHW{1, 1});\n    conv4b->setPaddingNd(DimsHW{1, 1});\n    IActivationLayer *relu7 = network->addActivation(*conv4b->getOutput(0), ActivationType::kRELU);\n    assert(relu7);\n\n    IConvolutionLayer *convPa = network->addConvolutionNd(*relu7->getOutput(0), 256, DimsHW{3, 3}, weightMap[\"convPa.weight\"], weightMap[\"convPa.bias\"]);\n    assert(convPa);\n    convPa->setStrideNd(DimsHW{1, 1});\n    convPa->setPaddingNd(DimsHW{1, 1});\n    IActivationLayer *relu8 = network->addActivation(*convPa->getOutput(0), ActivationType::kRELU);\n    assert(relu8);\n\n    IConvolutionLayer *convPb = network->addConvolutionNd(*relu8->getOutput(0), 65, DimsHW{1, 1}, weightMap[\"convPb.weight\"], weightMap[\"convPb.bias\"]);\n    assert(convPb);\n    convPb->setStrideNd(DimsHW{1, 1});\n\n    IConvolutionLayer *convDa = network->addConvolutionNd(*relu7->getOutput(0), 256, DimsHW{3, 3}, weightMap[\"convDa.weight\"], weightMap[\"convDa.bias\"]);\n    assert(convDa);\n    convDa->setStrideNd(DimsHW{1, 1});\n    convDa->setPaddingNd(DimsHW{1, 1});\n    IActivationLayer *relu9 = network->addActivation(*convDa->getOutput(0), ActivationType::kRELU);\n    assert(relu9);\n\n    IConvolutionLayer *convDb = network->addConvolutionNd(*relu9->getOutput(0), 256, DimsHW{1, 1}, weightMap[\"convDb.weight\"], weightMap[\"convDb.bias\"]);\n    assert(convDb);\n    convDb->setStrideNd(DimsHW{1, 1});\n\n    convPb->getOutput(0)->setName(OUTPUT_BLOB_NAME_1);\n    std::cout << \"set name out1\" << std::endl;\n    network->markOutput(*convPb->getOutput(0));\n\n    convDb->getOutput(0)->setName(OUTPUT_BLOB_NAME_2);\n    std::cout << \"set name out2\" << std::endl;\n    network->markOutput(*convDb->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(BATCH_SIZE);\n    config->setMaxWorkspaceSize(1 << 20);\n\n#ifdef USE_FP16\n    config->setFlag(BuilderFlag::kFP16);\n#endif\n\n    ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto &mem : weightMap)\n    {\n        free((void *)(mem.second.values));\n    }\n\n    return engine;\n}\n\n// Creat the engine using only the API and not any parser.\n\nvoid APIToModel(std::string path, IHostMemory **modelStream)\n{\n    // Create builder\n    IBuilder *builder = createInferBuilder(gLogger);\n    IBuilderConfig *config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine *engine = createEngine(builder, config, path, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n}\n\nint main(int argc, char **argv)\n{\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (argc == 3 && std::string(argv[1]) == \"-s\")\n    {\n        IHostMemory *modelStream{nullptr};\n        APIToModel(std::string(argv[2]), &modelStream);\n        assert(modelStream != nullptr);\n        std::ofstream p(\"supernet.engine\", std::ios::binary);\n        if (!p)\n        {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char *>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    }\n    else\n    {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./supernet -s <path_to_.wts_file>  // serialize model to plan file\" << std::endl;\n        return -1;\n    }\n\n    return 0;\n}\n"
  },
  {
    "path": "superpoint/utils.cpp",
    "content": "#include \"utils.h\"\n#include <dirent.h>\n#include <string.h>\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t *val = reinterpret_cast<uint32_t *>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nint read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names)\n{\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr)\n    {\n        return -1;\n    }\n\n    struct dirent *p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr)\n    {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n            strcmp(p_file->d_name, \"..\") != 0)\n        {\n            // std::string cur_file_name(p_dir_name);\n            // cur_file_name += \"/\";\n            // cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\nvoid tokenize(const std::string &str, std::vector<std::string> &tokens, const std::string &delimiters)\n{\n    // Skip delimiters at beginning.\n    std::string::size_type lastPos = str.find_first_not_of(delimiters, 0);\n\n    // Find first non-delimiter.\n    std::string::size_type pos = str.find_first_of(delimiters, lastPos);\n\n    while (std::string::npos != pos || std::string::npos != lastPos)\n    {\n        // Found a token, add it to the vector.\n        tokens.push_back(str.substr(lastPos, pos - lastPos));\n\n        // Skip delimiters.\n        lastPos = str.find_first_not_of(delimiters, pos);\n\n        // Find next non-delimiter.\n        pos = str.find_first_of(delimiters, lastPos);\n    }\n}\n"
  },
  {
    "path": "superpoint/utils.h",
    "content": "#pragma once\n\n#include <map>\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"assert.h\"\n#include <fstream>\n#include <iostream>\n#include <memory>\n#include <vector>\n#include <opencv2/opencv.hpp>\n\n\nusing namespace nvinfer1;\n\n#define CHECK(status)                             \\\n    do                                            \\\n    {                                             \\\n        auto ret = (status);                      \\\n        if (ret != 0)                             \\\n        {                                         \\\n            std::cout << \"Cuda failure: \" << ret; \\\n            abort();                              \\\n        }                                         \\\n    } while (0)\n\n\nint read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names);\nstd::map<std::string, Weights> loadWeights(const std::string file);\nvoid tokenize(const std::string &str, std::vector<std::string> &tokens, const std::string &delimiters = \",\");"
  },
  {
    "path": "swin-transformer/semantic-segmentation/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.4)\r\n\r\nproject(swintransformer)\r\n\r\nset(OpenCV_DIR \"D:\\\\opencv\\\\opencv346\\\\build\")\r\nset(TENSORRT_DIR \"D:\\\\TensorRT-7.0.0.11.Windows10.x86_64.cuda-10.2.cudnn7.6\\\\TensorRT-7.0.0.11\")\r\n\r\nadd_definitions(-std=c++11)\r\nadd_definitions(-DAPI_EXPORTS)\r\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\r\nset(CMAKE_CXX_STANDARD 11)\r\nset(CMAKE_BUILD_TYPE Debug)\r\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -D_MWAITXINTRIN_H_INCLUDED\")\r\nif(WIN32)\r\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\r\nendif(WIN32)\r\n\r\n\r\nfind_package(CUDA REQUIRED)\r\nmessage(STATUS \"    libraries: ${CUDA_LIBRARIES}\")\r\nmessage(STATUS \"    include path: ${CUDA_INCLUDE_DIRS}\")\r\ninclude_directories(${CUDA_INCLUDE_DIRS})\r\nset(CUDA_NVCC_PLAGS ${CUDA_NVCC_PLAGS};-std=c++11; -g; -G;-gencode; arch=compute_75;code=sm_75)\r\nenable_language(CUDA)  # һӺ ͻvsвҪֶcuda \r\ninclude_directories(${TENSORRT_DIR}\\\\include)\r\nlink_directories(${TENSORRT_DIR}\\\\lib)\r\n\r\n# file(GLOB SOURCE_FILES \"*.cu\")\r\n# cuda_add_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/yololayer.cu ${PROJECT_SOURCE_DIR}/API.h)\r\n# target_link_libraries(myplugins nvinfer cudart)\r\n\r\n# opencvϢ\r\nfind_package(OpenCV QUIET\r\n    NO_MODULE\r\n    NO_DEFAULT_PATH\r\n    NO_CMAKE_PATH\r\n    NO_CMAKE_ENVIRONMENT_PATH\r\n    NO_SYSTEM_ENVIRONMENT_PATH\r\n    NO_CMAKE_PACKAGE_REGISTRY\r\n    NO_CMAKE_BUILDS_PATH\r\n    NO_CMAKE_SYSTEM_PATH\r\n    NO_CMAKE_SYSTEM_PACKAGE_REGISTRY\r\n)\r\n\r\nmessage(STATUS \"OpenCV library status:\")\r\nmessage(STATUS \"    version: ${OpenCV_VERSION}\")\r\nmessage(STATUS \"    libraries: ${OpenCV_LIBS}\")\r\nmessage(STATUS \"    include path: ${OpenCV_INCLUDE_DIRS}\")\r\n\r\ninclude_directories(${OpenCV_INCLUDE_DIRS})\r\n\r\n\r\nfile(GLOB SOURCE_FILES \"*.h\" \"*.cpp\" \"*.cu\")\r\nadd_executable(swintransformer ${SOURCE_FILES})\r\n\r\ntarget_link_libraries(swintransformer nvinfer nvonnxparser)\r\ntarget_link_libraries(swintransformer cudart)\r\ntarget_link_libraries(swintransformer ${OpenCV_LIBS})\r\n\r\n# if (WIN32)\r\n    # message(STATUS \"copy dll......: ${CMAKE_COMMAND} ${TENSORRT_DIR}\")\r\n    # add_custom_command(TARGET swintransformer POST_BUILD\r\n        # COMMAND ${CMAKE_COMMAND} -E copy_if_different ${TENSORRT_DIR}/lib/myelin64_1.dll ./${CMAKE_BUILD_TYPE}/myelin64_1.dll\r\n        # COMMAND ${CMAKE_COMMAND} -E copy_if_different ${TENSORRT_DIR}/lib/nvinfer.dll ./${CMAKE_BUILD_TYPE}/nvinfer.dll\r\n        # COMMAND ${CMAKE_COMMAND} -E copy_if_different ${TENSORRT_DIR}/lib/nvinfer_plugin.dll ./${CMAKE_BUILD_TYPE}/nvinfer_plugin.dll\r\n        # COMMAND ${CMAKE_COMMAND} -E copy_if_different ${TENSORRT_DIR}/lib/nvonnxparser.dll ./${CMAKE_BUILD_TYPE}/nvonnxparser.dll\r\n        # COMMAND ${CMAKE_COMMAND} -E copy_if_different ${TENSORRT_DIR}/lib/nvparsers.dll ./${CMAKE_BUILD_TYPE}/nvparsers.dll\r\n        # COMMAND ${CMAKE_COMMAND} -E copy_if_different ${TENSORRT_DIR}/lib/nvserialize.dll ./${CMAKE_BUILD_TYPE}/nvserialize.dll\r\n        # COMMAND ${CMAKE_COMMAND} -E copy_if_different ${CUDA_TOOLKIT_ROOT_DIR}/bin/cublas64_10.dll ./${CMAKE_BUILD_TYPE}/cublas64_10.dll\r\n    # )\r\n# endif(WIN32)\r\n\r\nif(UNIX)\r\nadd_definitions(-O2 -pthread)\r\nendif(UNIX)"
  },
  {
    "path": "swin-transformer/semantic-segmentation/README.md",
    "content": "# Swin Transform - Semantic Segmentation\r\n\r\nThe Pytorch implementation is [microsoft/Swin-Transformer](https://github.com/microsoft/Swin-Transformer.git).\r\n\r\nOnly support Swin-T, welcome the PR for other backbones.\r\n\r\n## Authors\r\n\r\n<a href=\"https://github.com/wdhao\"><img src=\"https://avatars.githubusercontent.com/u/58798355?v=4?s=48\" width=\"40px;\" alt=\"\"/></a> \r\n<a href=\"https://github.com/wang-xinyu\"><img src=\"https://avatars.githubusercontent.com/u/15235574?s=48&v=4\" width=\"40px;\" alt=\"\"/></a> \r\n\r\n## How to Run\r\n\r\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\r\n\r\n```\r\ngit clone https://github.com/microsoft/Swin-Transformer.git\r\ngit clone https://github.com/wang-xinyu/tensorrtx.git\r\n\r\npython gen_wts.py Swin-Transform.pt\r\n// a file 'Swin-Transform.wts' will be generated.\r\n```\r\n\r\n2. build tensorrtx/swin-transform and run\r\n\r\n```\r\ncd {tensorrtx}/swin-transform/semantic-segmentation/\r\nmkdir build\r\ncd build\r\ncp {microsoft}/Swin-Transformer/Swin-Transform.wts {tensorrtx}/swin-transformer/semantic-segmentation/build\r\ncmake ..\r\nmake\r\nsudo ./swintransformer -s [.wts] [.engine]   // serialize model to plan file\r\nsudo ./swintransformer -d [.engine] [image folder]  // deserialize and run inference, the images in [image folder] will be processed.\r\n\r\n```\r\n\r\n## More Information\r\n\r\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\r\n\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/UpsampleKernel.cu",
    "content": "#include \"UpsmapleKernel.h\"\r\n\r\n\r\n/**\r\n * @brief caculate the number of cuda kernel for upsample. (Cite from: 《GPU高性能编程CUDA实战》P46,P47)\r\n * \r\n * @param total_thread_num: the number of cuda thread of you want to used for upsample\r\n * @param max_thread_num: the gpu device property\r\n * @return int  the number of cuda kernel for upsample\r\n */\r\nint get_kernel_num(int total_thread_num, int max_thread_num)\r\n{\r\n    return (total_thread_num + max_thread_num - 1)/max_thread_num;\r\n}\r\n\r\nint get_max_thread_num()\r\n{\r\n    cudaDeviceProp prop;\r\n    cudaGetDeviceProperties(&prop, 0);\r\n    return prop.maxThreadsPerBlock;\r\n}\r\n\r\n__host__ __forceinline__ float linear_upsampling_compute_scale(int input_size, int output_size)\r\n{\r\n    return float(input_size)/float(output_size) ;\r\n}\r\n\r\n__device__ __forceinline__ float linear_upsampling_compute_source_index(float scale, int dst_index, int intput_size)\r\n{\r\n    float src_idx = scale * (dst_index + 0.5)-0.5;\r\n    return (src_idx>=0) ? src_idx : 0;\r\n}\r\n\r\n\r\n__device__ __forceinline__ int get_index(const int batch_idx, const int channel_idx, const int height_idx, const int width_idx, \r\n                const int batch_total, const int channel_total, const int width)\r\n{\r\n    int ret_idx = batch_idx * batch_total\r\n                    + channel_idx * channel_total\r\n                    + height_idx * width\r\n                    + width_idx;\r\n    return ret_idx;\r\n}\r\n\r\n/**\r\n * @brief \r\n * \r\n * @tparam T \r\n * @param n \r\n * @param input_shape: input data shape. such as [batch, channel, height, width] \r\n * @param rate_h \r\n * @param rate_w \r\n * @param inputs \r\n * @param outputs \r\n * @return __global__ BilinearKernel \r\n * @TODO: \r\n *  \r\n */\r\n\r\n\r\ntemplate <typename T>\r\n__global__ void BilinearKernel(\r\n        const int n,\r\n        int input_b,\r\n        int input_c,\r\n        int input_h,\r\n        int input_w,\r\n        int output_h,\r\n        int output_w,\r\n        const float rate_h,\r\n        const float rate_w,\r\n        const T* inputs,\r\n        T* outputs)\r\n{\r\n\r\n    int index = threadIdx.x + blockIdx.x * blockDim.x;\r\n    if(index < n)\r\n    {\r\n        const int w2 = index % output_w;\r\n        const int h2 = index / output_w;\r\n\r\n\r\n        const float h1r = linear_upsampling_compute_source_index(rate_h, h2, input_h);\r\n        const int h1 = int(h1r);\r\n        const int h1p = (h1 < input_h - 1) ? 1 : 0;\r\n        const float h1lambda = h1r - h1;\r\n        const float h0lambda = 1 - h1lambda;\r\n\r\n        const float w1r = linear_upsampling_compute_source_index(rate_w, w2, input_w);\r\n        const int w1 = int(w1r);\r\n        const int w1p = (w1 < input_w - 1) ? 1 : 0;\r\n        const float w1lambda = w1r - w1;\r\n        const float w0lambda = 1 - w1lambda;\r\n\r\n        int s_batch_total_1 = input_c * input_h * input_w;\r\n        int s_channel_total_1 = input_h * input_w;\r\n\r\n        int s_batch_total_2 = input_c * output_h * output_w;\r\n        int s_channel_total_2 = output_h * output_w;\r\n\r\n\r\n        const int batch_size = input_b;\r\n        const int channel_size = input_c;\r\n\r\n        for(int b_idx=0; b_idx<batch_size; b_idx++)\r\n        {\r\n            for(int c=0; c<channel_size; c++)\r\n            {\r\n                const T val = h0lambda * (w0lambda * inputs[get_index(b_idx, c, h1, w1, s_batch_total_1, s_channel_total_1, input_w)]\r\n                                    + w1lambda * inputs[get_index(b_idx, c, h1, w1+w1p, s_batch_total_1, s_channel_total_1, input_w)])\r\n                                    + h1lambda * (w0lambda * inputs[get_index(b_idx, c, h1+h1p, w1, s_batch_total_1, s_channel_total_1, input_w)]\r\n                                    + w1lambda * inputs[get_index(b_idx, c, h1+h1p, w1+w1p, s_batch_total_1, s_channel_total_1, input_w)]);\r\n                outputs[get_index(b_idx, c, h2, w2, s_batch_total_2, s_channel_total_2, output_w)] = val;\r\n                \r\n            }\r\n        }\r\n    }\r\n}\r\n\r\n\r\nint UpsampleInference(\r\n    cudaStream_t stream,\r\n    int n,\r\n    int input_b,\r\n    int input_c,\r\n    int input_h,\r\n    int input_w,\r\n    float scale_h,\r\n    float scale_w,\r\n    const void* inputs,\r\n    void* outputs)\r\n{\r\n    int output_h = int(input_h * scale_h);\r\n    int output_w = int(input_w * scale_w);\r\n    int max_threads = get_max_thread_num();\r\n    int kernel_num = get_kernel_num(n, max_threads);\r\n    float rate_h = linear_upsampling_compute_scale(input_h, output_h);\r\n    float rate_w = linear_upsampling_compute_scale(input_w, output_w);\r\n\r\n    BilinearKernel<float><<< kernel_num, max_threads, 0, stream>>>(n,input_b,input_c,input_h,input_w,\r\n                                                                                    output_h, output_w, \r\n                                                                                    rate_h, rate_w,\r\n                                                                                    static_cast<const float*>(inputs),\r\n                                                                                    static_cast<float*>(outputs));\r\n    return 0;\r\n}\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/UpsamplePlugin.cpp",
    "content": "#include <iostream>\r\n#include \"UpsmapleKernel.h\"\r\n#include \"UpsamplePlugin.h\"\r\n\r\n#include <cassert>\r\n#include <cstring>\r\n\r\nusing namespace nvinfer1;\r\n\r\n// Upsample plugin specific constants\r\nnamespace {\r\n    static const char* UPSAMPLE_PLUGIN_VERSION{\"1\"};\r\n    static const char* UPSAMPLE_PLUGIN_NAME{\"UpsamplePlugin\"};\r\n}\r\n\r\n// Static class fields initialization\r\nPluginFieldCollection UpsamplePluginCreator::mFC{};\r\nstd::vector<PluginField> UpsamplePluginCreator::mPluginAttributes;\r\n\r\nREGISTER_TENSORRT_PLUGIN(UpsamplePluginCreator);\r\n\r\ntemplate<typename T>\r\nvoid writeToBuffer(char*& buffer, const T& val)\r\n{\r\n    *reinterpret_cast<T*>(buffer) = val;\r\n    buffer += sizeof(T);\r\n}\r\n\r\n// Helper function for deserializing plugin\r\ntemplate<typename T>\r\nT readFromBuffer(const char*& buffer)\r\n{\r\n    T val = *reinterpret_cast<const T*>(buffer);\r\n    buffer += sizeof(T);\r\n    return val;\r\n}\r\n\r\nUpsamplePlugin::UpsamplePlugin(const std::string name, float scale_h, float scale_w)\r\n    : mLayerName(name)\r\n    , mScaleFactor_h(scale_h)\r\n    , mScaleFactor_w(scale_w)\r\n{\r\n    mInputShape.c() = -1;\r\n    mInputShape.h() = -1;\r\n    mInputShape.w() = -1;\r\n    mInputVolume = 0;\r\n}\r\n\r\nUpsamplePlugin::UpsamplePlugin(const std::string name, const void* data, size_t length)\r\n    : mLayerName(name)\r\n{\r\n    const char *d = static_cast<const char *>(data);\r\n    const char *a = d;\r\n\r\n    mScaleFactor_h = readFromBuffer<float>(d);\r\n    mScaleFactor_w = readFromBuffer<float>(d);\r\n    mInputVolume = readFromBuffer<size_t>(d);\r\n    mInputShape.c() = readFromBuffer<int>(d);\r\n    mInputShape.h() = readFromBuffer<int>(d);\r\n    mInputShape.w() = readFromBuffer<int>(d);\r\n\r\n    assert(d == (a + length));\r\n\r\n}\r\n\r\nconst char* UpsamplePlugin::getPluginType() const\r\n{\r\n    return UPSAMPLE_PLUGIN_NAME;\r\n}\r\n\r\nconst char* UpsamplePlugin::getPluginVersion() const\r\n{\r\n    return UPSAMPLE_PLUGIN_VERSION;\r\n}\r\n\r\nint UpsamplePlugin::getNbOutputs() const\r\n{\r\n    return 1;\r\n}\r\n\r\nDims UpsamplePlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims)\r\n{\r\n    assert(index == 0);\r\n    assert(nbInputDims == 1);\r\n    assert(inputs[0].nbDims == 3);\r\n    return nvinfer1::DimsCHW{inputs[0].d[0],int(inputs[0].d[1]*mScaleFactor_h), int(inputs[0].d[2]*mScaleFactor_w)};\r\n}\r\n\r\nint UpsamplePlugin::initialize()\r\n{\r\n    //printf(\"UpsamplePlugin::initialize\\n\");\r\n    return 0;\r\n}\r\n\r\n\r\nint UpsamplePlugin::enqueue(int batchSize, const void* const* inputs, void** outputs, void*, cudaStream_t stream)\r\n{\r\n    //printf(\"UpsamplePlugin::enqueue\\n\");\r\n    int status = -1;\r\n\r\n    // Our plugin outputs only one tensor\r\n    void* output = outputs[0];\r\n\r\n    // Launch CUDA kernel wrapper and save its return value\r\n    status = UpsampleInference(stream, mInputVolume, \r\n                                batchSize, mInputShape.c(), mInputShape.h(), mInputShape.w(),\r\n                                mScaleFactor_h,mScaleFactor_w,\r\n                                inputs[0], output);\r\n    return status;\r\n}\r\n\r\nsize_t UpsamplePlugin::getSerializationSize() const\r\n{\r\n    //printf(\"UpsamplePlugin::getSerializationSize\\n\");\r\n    return sizeof(mScaleFactor_h)  + sizeof(mScaleFactor_w) +\r\n            sizeof(mInputVolume) + sizeof(mInputShape.c()) + \r\n            sizeof(mInputShape.h()) + sizeof(mInputShape.w());\r\n}\r\n\r\n\r\nvoid UpsamplePlugin::serialize(void* buffer) const \r\n{\r\n    //printf(\"UpsamplePlugin::serialize\\n\");\r\n    char *d = static_cast<char *>(buffer);\r\n    const char *a = d;\r\n\r\n    writeToBuffer(d, mScaleFactor_h);\r\n    writeToBuffer(d, mScaleFactor_w);\r\n    writeToBuffer(d, mInputVolume);\r\n    writeToBuffer(d, mInputShape.c());\r\n    writeToBuffer(d, mInputShape.h());\r\n    writeToBuffer(d, mInputShape.w());\r\n\r\n    assert(d == a + getSerializationSize());\r\n}\r\n\r\nvoid UpsamplePlugin::configureWithFormat(const Dims* inputs, int nbInputs, const Dims* outputs, int nbOutputs, DataType type, PluginFormat format, int)\r\n{\r\n    assert(nbOutputs == 1);\r\n    assert(type == DataType::kFLOAT);\r\n    assert(format == PluginFormat::kNCHW);\r\n    assert(inputs[0].nbDims == 3);\r\n\r\n    size_t volume = int(inputs[0].d[1]*mScaleFactor_h) * int(inputs[0].d[2]*mScaleFactor_w);\r\n    mInputVolume = volume;\r\n    mInputShape.c() = inputs[0].d[0];\r\n    mInputShape.h() = inputs[0].d[1];\r\n    mInputShape.w() = inputs[0].d[2];\r\n}\r\n\r\nbool UpsamplePlugin::supportsFormat(DataType type, PluginFormat format) const\r\n{\r\n    if (type == DataType::kFLOAT && format == PluginFormat::kNCHW)\r\n        return true;\r\n    else\r\n        return false;\r\n}\r\n\r\nvoid UpsamplePlugin::terminate() {}\r\n\r\nvoid UpsamplePlugin::destroy() {\r\n    // This gets called when the network containing plugin is destroyed\r\n    delete this;\r\n}\r\n\r\nIPluginV2* UpsamplePlugin::clone() const\r\n{\r\n    return new UpsamplePlugin(mLayerName, mScaleFactor_h, mScaleFactor_w);\r\n}\r\n\r\nvoid UpsamplePlugin::setPluginNamespace(const char* libNamespace) \r\n{\r\n    mNamespace = libNamespace;\r\n}\r\n\r\nconst char* UpsamplePlugin::getPluginNamespace() const\r\n{\r\n    return mNamespace.c_str();\r\n}\r\n\r\nUpsamplePluginCreator::UpsamplePluginCreator()\r\n{\r\n    mPluginAttributes.emplace_back(PluginField(\"scaleFactor\", nullptr, PluginFieldType::kFLOAT32, 2));\r\n\r\n    mFC.nbFields = mPluginAttributes.size();\r\n    mFC.fields = mPluginAttributes.data();\r\n}\r\nconst char* UpsamplePluginCreator::getPluginName() const\r\n{\r\n    return UPSAMPLE_PLUGIN_NAME;\r\n}\r\n\r\nconst char* UpsamplePluginCreator::getPluginVersion() const\r\n{\r\n    return UPSAMPLE_PLUGIN_VERSION;\r\n}\r\n\r\nconst PluginFieldCollection* UpsamplePluginCreator::getFieldNames()\r\n{\r\n    return &mFC;\r\n}\r\n\r\nIPluginV2* UpsamplePluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc)\r\n{\r\n    float scaleFactor_h = 0.f;\r\n    float scaleFactor_w = 0.f;\r\n    const PluginField* fields = fc->fields;\r\n\r\n    assert(fc->nbFields == 1);\r\n    for (int i = 0; i < fc->nbFields; i++){\r\n    \r\n        if (strcmp(fields[i].name, \"scaleFactor\") == 0) {\r\n            assert(fields[i].type == PluginFieldType::kFLOAT32);\r\n            scaleFactor_h = *(static_cast<const float*>(fields[i].data));\r\n            scaleFactor_w = *(static_cast<const float*>(fields[i].data)+1);\r\n            //std::cout<<scaleFactor_h<< \" , \"<<scaleFactor_w<<std::endl;\r\n        } \r\n    }\r\n    return new UpsamplePlugin(name, scaleFactor_h, scaleFactor_w);\r\n}\r\n\r\nIPluginV2* UpsamplePluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength)\r\n{\r\n    return new UpsamplePlugin(name, serialData, serialLength);\r\n}\r\n\r\nvoid UpsamplePluginCreator::setPluginNamespace(const char* libNamespace) \r\n{\r\n    mNamespace = libNamespace;\r\n}\r\n\r\nconst char* UpsamplePluginCreator::getPluginNamespace() const\r\n{\r\n    return mNamespace.c_str();\r\n}\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/UpsamplePlugin.h",
    "content": "#ifndef UPSAMPLE_PLUGIN_H\r\n#define UPSAMPLE_PLUGIN_H\r\n\r\n#include \"NvInferPlugin.h\"\r\n#include <string>\r\n#include <vector>\r\n\r\n\r\nusing namespace nvinfer1;\r\n\r\nclass UpsamplePlugin : public IPluginV2\r\n{\r\npublic:\r\n    UpsamplePlugin(const std::string name, float scale_h,float scale_w);\r\n\r\n    UpsamplePlugin(const std::string name, const void* data, size_t length);\r\n\r\n    // It doesn't make sense to make UpsamplePlugin without arguments, so we delete default constructor.\r\n    UpsamplePlugin() = delete;\r\n\r\n    int getNbOutputs() const override;\r\n\r\n    Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) override;\r\n\r\n    int initialize() override;\r\n\r\n    void terminate() override;\r\n\r\n    size_t getWorkspaceSize(int) const override { return 0; };\r\n\r\n    int enqueue(int batchSize, const void* const* inputs, void** outputs, void* workspace, cudaStream_t stream) override;\r\n\r\n    size_t getSerializationSize() const override;\r\n\r\n    void serialize(void* buffer) const override;\r\n\r\n    void configureWithFormat(const Dims* inputDims, int nbInputs, const Dims* outputDims, int nbOutputs, DataType type, PluginFormat format, int maxBatchSize) override;\r\n\r\n    bool supportsFormat(DataType type, PluginFormat format) const override;\r\n\r\n    const char* getPluginType() const override;\r\n\r\n    const char* getPluginVersion() const override;\r\n\r\n    void destroy() override;\r\n\r\n    nvinfer1::IPluginV2* clone() const override;\r\n\r\n    void setPluginNamespace(const char* pluginNamespace) override;\r\n\r\n    const char* getPluginNamespace() const override;\r\n\r\nprivate:\r\n    const std::string mLayerName;\r\n    bool mAlignCorners;\r\n    float mScaleFactor_h;\r\n    float mScaleFactor_w;\r\n    size_t mInputVolume;\r\n    DimsCHW mInputShape;\r\n    std::string mNamespace;\r\n};\r\n\r\nclass UpsamplePluginCreator : public IPluginCreator\r\n{\r\npublic:\r\n    UpsamplePluginCreator();\r\n\r\n    const char* getPluginName() const override;\r\n\r\n    const char* getPluginVersion() const override;\r\n\r\n    const PluginFieldCollection* getFieldNames() override;\r\n\r\n    IPluginV2* createPlugin(const char* name, const PluginFieldCollection* fc) override;\r\n\r\n    IPluginV2* deserializePlugin(const char* name, const void* serialData, size_t serialLength) override;\r\n    \r\n    void setPluginNamespace(const char* pluginNamespace) override;\r\n\r\n    const char* getPluginNamespace() const override;\r\n\r\nprivate:\r\n    static PluginFieldCollection mFC;\r\n    static std::vector<PluginField> mPluginAttributes;\r\n    std::string mNamespace;\r\n};\r\n\r\n#endif\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/UpsmapleKernel.h",
    "content": "#ifndef UPSAMPLE_KERNEL_H\r\n#define UPSAMPLE_KERNEL_H\r\n\r\n#include <iostream>\r\n#include \"NvInfer.h\"\r\n\r\nint UpsampleInference(\r\n    cudaStream_t stream,\r\n    int n,\r\n    int input_b,\r\n    int input_c,\r\n    int input_h,\r\n    int input_w,\r\n    float scale_h,\r\n    float scale_w,\r\n    const void* inputs,\r\n    void* outputs);\r\n\r\n\r\n#endif\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/common.hpp",
    "content": "#ifndef COMMON_HPP\r\n#define COMMON_HPP\r\n\r\n#include \"layerNorm.h\"\r\n#include \"NvInfer.h\"\r\n#include \"NvInfer.h\"\r\n#include \"NvInferPlugin.h\"\r\n#include \"cuda_runtime_api.h\"\r\n#include <assert.h>\r\n#include <map>\r\n#include <fstream>\r\n#include<opencv2/core/core.hpp>\r\n#include<opencv2/imgproc/imgproc.hpp>\r\n#include<opencv2/imgcodecs/imgcodecs.hpp>\r\n#include<opencv2/dnn/dnn.hpp>\r\n\r\nusing namespace nvinfer1;\r\n#define CHECK(status) \\\r\n    do\\\r\n    {\\\r\n        auto ret = (status);\\\r\n        if (ret != 0)\\\r\n        {\\\r\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\r\n            abort();\\\r\n        }\\\r\n    } while (0)\r\n\r\nvoid mblobFromImages(cv::InputArrayOfArrays images_, cv::OutputArray blob_,\r\n    cv::Size size, const cv::Scalar& mean_, const cv::Scalar& std_, bool swapRB, bool crop)\r\n{\r\n    //CV_TRACE_FUNCTION();\r\n    std::vector<cv::Mat> images;\r\n    images_.getMatVector(images);\r\n    CV_Assert(!images.empty());\r\n    for (int i = 0; i < images.size(); i++)\r\n    {\r\n        cv::Size imgSize = images[i].size();\r\n        if (size == cv::Size())\r\n            size = imgSize;\r\n        if (size != imgSize)\r\n        {\r\n            if (crop)\r\n            {\r\n                float resizeFactor = std::max(size.width / (float)imgSize.width,\r\n                    size.height / (float)imgSize.height);\r\n                resize(images[i], images[i], cv::Size(), resizeFactor, resizeFactor, cv::INTER_LINEAR);\r\n                cv::Rect crop(cv::Point(0.5 * (images[i].cols - size.width),\r\n                    0.5 * (images[i].rows - size.height)),\r\n                    size);\r\n                images[i] = images[i](crop);\r\n            }\r\n            else\r\n                resize(images[i], images[i], size, 0, 0, cv::INTER_LINEAR);\r\n        }\r\n        if (images[i].depth() == CV_8U)\r\n            images[i].convertTo(images[i], CV_32F);\r\n        cv::Scalar mean = mean_;\r\n        cv::Scalar std_num = std_;\r\n        if (swapRB)\r\n        {\r\n            std::swap(mean[0], mean[2]);\r\n            std::swap(std_num[0], std_num[2]);\r\n        }\r\n\r\n        images[i] -= mean;\r\n        images[i] /= std_num;\r\n    }\r\n\r\n    size_t i, nimages = images.size();\r\n    cv::Mat image0 = images[0];\r\n    int nch = image0.channels();\r\n    CV_Assert(image0.dims == 2);\r\n    cv::Mat image;\r\n    if (nch == 3 || nch == 4)\r\n    {\r\n        int sz[] = { (int)nimages, nch, image0.rows, image0.cols };\r\n        blob_.create(4, sz, CV_32F);\r\n        cv::Mat blob = blob_.getMat();\r\n        cv::Mat ch[4];\r\n\r\n        for (i = 0; i < nimages; i++)\r\n        {\r\n            image = images[i];\r\n            CV_Assert(image.depth() == CV_32F);\r\n            nch = image.channels();\r\n            CV_Assert(image.dims == 2 && (nch == 3 || nch == 4));\r\n            CV_Assert(image.size() == image0.size());\r\n\r\n            for (int j = 0; j < nch; j++)\r\n                ch[j] = cv::Mat(image.rows, image.cols, CV_32F, blob.ptr((int)i, j));\r\n            if (swapRB)\r\n                std::swap(ch[0], ch[2]);\r\n            split(image, ch);\r\n        }\r\n    }\r\n    else\r\n    {\r\n        CV_Assert(nch == 1);\r\n        int sz[] = { (int)nimages, 1, image0.rows, image0.cols };\r\n        blob_.create(4, sz, CV_32F);\r\n        cv::Mat blob = blob_.getMat();\r\n\r\n        for (i = 0; i < nimages; i++)\r\n        {\r\n            cv::Mat image = images[i];\r\n            CV_Assert(image.depth() == CV_32F);\r\n            nch = image.channels();\r\n            CV_Assert(image.dims == 2 && (nch == 1));\r\n            CV_Assert(image.size() == image0.size());\r\n\r\n            image.copyTo(cv::Mat(image.rows, image.cols, CV_32F, blob.ptr((int)i, 0)));\r\n        }\r\n    }\r\n}\r\ncv::Mat BlobFromImages(cv::InputArrayOfArrays images, cv::Size size,\r\n    const cv::Scalar& mean, const cv::Scalar& std_num, bool swapRB, bool crop)\r\n{\r\n    //CV_TRACE_FUNCTION();\r\n    cv::Mat blob;\r\n    mblobFromImages(images, blob, size, mean, std_num, swapRB, crop);\r\n    return blob;\r\n}\r\nvoid debug_print(ITensor *input_tensor,std::string head)\r\n{\r\n    std::cout << head<< \" : \";\r\n\r\n       for (int i = 0; i < input_tensor->getDimensions().nbDims; i++)\r\n       {\r\n           std::cout << input_tensor->getDimensions().d[i] << \" \";\r\n       }\r\n       std::cout<<std::endl;\r\n\r\n}\r\nstd::map<std::string, Weights> loadWeights(const std::string file) {\r\n    std::cout << \"Loading weights: \" << file << std::endl;\r\n    std::map<std::string, Weights> weightMap;\r\n\r\n    // Open weights file\r\n    std::ifstream input(file);\r\n    assert(input.is_open() && \"Unable to load weight file.\");\r\n\r\n    // Read number of weight blobs\r\n    int32_t count;\r\n    input >> count;\r\n    assert(count > 0 && \"Invalid weight map file.\");\r\n\r\n    while (count--)\r\n    {\r\n        Weights wt{ DataType::kFLOAT, nullptr, 0 };\r\n        uint32_t size;\r\n\r\n        // Read name and type of blob\r\n        std::string name;\r\n        input >> name >> std::dec >> size;\r\n        wt.type = DataType::kFLOAT;\r\n\r\n        // Load blob\r\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\r\n        for (uint32_t x = 0, y = size; x < y; ++x)\r\n        {\r\n            input >> std::hex >> val[x];\r\n        }\r\n        wt.values = val;\r\n\r\n        wt.count = size;\r\n        weightMap[name] = wt;\r\n    }\r\n\r\n    return weightMap;\r\n}\r\n\r\nITensor* m_layerNorm(INetworkDefinition *m_Network,std::map<std::string, Weights> weightMap,ITensor *input, string lname)\r\n{\r\n    auto creator = getPluginRegistry()->getPluginCreator(\"layerNorm_trt\",\"1\");\r\n\r\n    PluginField pluginMultidata[2];\r\n\r\n    const PluginFieldCollection* pluginData = creator->getFieldNames();\r\n    IPluginV2 *pluginObj = creator->createPlugin(lname.c_str(), pluginData);\r\n    ITensor* inputTensors[] = {input};\r\n    auto ln_ms = m_Network->addPluginV2(inputTensors, 1, *pluginObj);\r\n    auto ln_m = m_Network->addElementWise(*input,*ln_ms->getOutput(0),ElementWiseOperation::kSUB);\r\n    auto ln = m_Network->addElementWise(*ln_m->getOutput(0),*ln_ms->getOutput(1),ElementWiseOperation::kDIV);\r\n    Weights W = weightMap[lname + \".weight\"];\r\n    int len = W.count;\r\n    Dims wb ;\r\n    wb.nbDims = ln->getOutput(0)->getDimensions().nbDims;\r\n    for (int i = 0 ; i < wb.nbDims; i++)\r\n    {\r\n        if (i != wb.nbDims -1)\r\n            wb.d[i] = 1;\r\n        else{\r\n            wb.d[i] = len;\r\n        }\r\n    }\r\n    auto wgts = m_Network->addConstant(wb,W);\r\n    auto p_w = m_Network->addElementWise(*ln->getOutput(0),*wgts->getOutput(0),ElementWiseOperation::kPROD);\r\n    Weights B = weightMap[lname + \".bias\"];\r\n    auto bias = m_Network->addConstant(wb,B);\r\n    auto sum_bias = m_Network->addElementWise(*p_w->getOutput(0),*bias->getOutput(0),ElementWiseOperation::kSUM);\r\n    debug_print(sum_bias->getOutput(0),lname);\r\n    return sum_bias->getOutput(0);\r\n}\r\nITensor* layerNorm(INetworkDefinition *m_Network,std::map<std::string, Weights> weightMap,ITensor *input, string lname)\r\n{\r\n    auto mean = m_Network->addReduce(*input, ReduceOperation::kAVG, 2, true);\r\n    assert(mean);\r\n\r\n    auto sub_mean = m_Network->addElementWise(*input, *mean->getOutput(0), ElementWiseOperation::kSUB);\r\n    assert(sub_mean);\r\n//    float SCALING_ONE = 1.0;\r\n//    float SHIFT_ZERO = 0.0;\r\n//    float POWER_TWO = 2.0;\r\n//    // implement pow2 with scale\r\n//    Weights scale{ DataType::kFLOAT, &SCALING_ONE, 1 };\r\n//    Weights shift{ DataType::kFLOAT, &SHIFT_ZERO, 1 };\r\n//    Weights power{ DataType::kFLOAT, &POWER_TWO, 1 };\r\n//    auto pow2 = m_Network->addScaleNd(*sub_mean->getOutput(0), ScaleMode::kUNIFORM, shift, scale, power,0);\r\n//    assert(pow2);\r\n    auto pow2 = m_Network->addElementWise(*sub_mean->getOutput(0), *sub_mean->getOutput(0), ElementWiseOperation::kPROD);\r\n    assert(pow2);\r\n    debug_print(pow2->getOutput(0),\"pow2\");\r\n    auto pow_mean = m_Network->addReduce(*pow2->getOutput(0), ReduceOperation::kAVG, 2, true);\r\n    assert(pow_mean);\r\n    debug_print(pow_mean->getOutput(0),\"pow_mean\");\r\n    float E = 1e-5;\r\n    Weights EPS{DataType::kFLOAT,nullptr,1};\r\n    EPS.values = &E;\r\n    auto eps = m_Network->addConstant(Dims2{1,1}, EPS);\r\n    assert(eps);\r\n\r\n    auto add_eps = m_Network->addElementWise(*pow_mean->getOutput(0), *eps->getOutput(0), ElementWiseOperation::kSUM);\r\n    assert(add_eps);\r\n\r\n    auto sqrt = m_Network->addUnary(*add_eps->getOutput(0), UnaryOperation::kSQRT);\r\n    assert(sqrt);\r\n\r\n    auto div = m_Network->addElementWise(*sub_mean->getOutput(0), *sqrt->getOutput(0), ElementWiseOperation::kDIV);\r\n    assert(div);\r\n    debug_print(div->getOutput(0),\"div\");\r\n\r\n    string weightsFile = lname + \".weight\";\r\n    string biasFile = lname + \".bias\";\r\n\r\n    int d_model = input->getDimensions().d[input->getDimensions().nbDims - 1];\r\n    cout<<\"d_model = \"<<d_model<<endl;\r\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * d_model));\r\n    for (int i = 0; i < d_model; i++) {\r\n        pval[i] = 1.0;\r\n    }\r\n    Weights norm1_power{ DataType::kFLOAT, pval, d_model };\r\n    auto affine = m_Network->addScaleNd(\r\n        *div->getOutput(0),\r\n        ScaleMode::kELEMENTWISE,\r\n        weightMap[biasFile],\r\n        weightMap[weightsFile],\r\n        norm1_power,1);\r\n    assert(affine);\r\n    return affine->getOutput(0);\r\n}\r\nITensor* conv(INetworkDefinition *m_Network,std::map<std::string, Weights> weightMap,ITensor *input, string lname,\r\n              int c_out,bool bias = true,int k = 4 , int s = 4, int p = 0)\r\n{\r\n    Weights Bias{ DataType::kFLOAT, nullptr, 0 };\r\n    if(bias)\r\n        Bias = weightMap[lname + \".bias\"];\r\n    auto out = m_Network->addConvolutionNd(*input,c_out,Dims2{k,k},weightMap[lname + \".weight\"],Bias);\r\n    out->setStrideNd(Dims2{s,s});\r\n    out->setPaddingNd(Dims2{p,p});\r\n    out->setNbGroups(1);\r\n    debug_print(out->getOutput(0),lname);\r\n    return out->getOutput(0);\r\n}\r\nITensor* shuffle_reshape(INetworkDefinition *m_Network,ITensor *input,Dims reshapeDims)\r\n{\r\n    auto out = m_Network->addShuffle(*input);\r\n    out->setReshapeDimensions(reshapeDims);\r\n    debug_print(out->getOutput(0),\"reshape\");\r\n    return out->getOutput(0);\r\n}\r\nITensor* shuffle_permute(INetworkDefinition *m_Network,ITensor *input,Permutation permutation)\r\n{\r\n    auto out = m_Network->addShuffle(*input);\r\n    out->setFirstTranspose(permutation);\r\n    debug_print(out->getOutput(0),\"permute\");\r\n    return out->getOutput(0);\r\n}\r\nITensor* shuffle_reshapeApermute(INetworkDefinition *m_Network,ITensor *input,Dims reshapeDims,\r\n                                 Permutation permutation,bool firstReshape)\r\n{\r\n    auto out = m_Network->addShuffle(*input);\r\n    out->setReshapeDimensions(reshapeDims);\r\n    if(firstReshape)\r\n        out->setSecondTranspose(permutation);\r\n    else\r\n        out->setFirstTranspose(permutation);\r\n    debug_print(out->getOutput(0),\"shuffle\");\r\n    return out->getOutput(0);\r\n}\r\nITensor* trt_transform_imgMask(INetworkDefinition *m_Network,int hw, int window_size, int shift_size)\r\n{\r\n    int Hp = hw;\r\n    int Wp = hw;\r\n    Weights Mask_param{DataType::kFLOAT,nullptr,Hp*Wp};\r\n    float *mask_param = new float[Hp*Wp];\r\n    for(int i = 0; i < Hp ; i++)\r\n    {\r\n        for(int j = 0; j < Wp; j++)\r\n        {\r\n            if(i<Hp-window_size && j<Wp-window_size)\r\n                mask_param[i*Wp + j] = 0.0;\r\n            else if(i<Hp-window_size && j>=Wp-window_size && j < Wp-shift_size)\r\n                mask_param[i*Wp + j] = 1.0;\r\n            else if(i<Hp-window_size &&  j >= Wp-shift_size)\r\n                mask_param[i*Wp + j] = 2.0;\r\n\r\n            else if(i >= Hp-window_size && i < Hp-shift_size && j<Wp-window_size)\r\n                mask_param[i*Wp + j] = 3.0;\r\n            else if(i >= Hp-window_size && i < Hp-shift_size && j>=Wp-window_size && j < Wp-shift_size)\r\n                mask_param[i*Wp + j] = 4.0;\r\n            else if(i >= Hp-window_size && i < Hp-shift_size && j >= Wp-shift_size)\r\n                mask_param[i*Wp + j] = 5.0;\r\n\r\n            else if(i >=  Hp-shift_size && j<Wp-window_size)\r\n                mask_param[i*Wp + j] = 6.0;\r\n            else if(i >=  Hp-shift_size && j>=Wp-window_size && j < Wp-shift_size)\r\n                mask_param[i*Wp + j] = 7.0;\r\n            else if(i >=  Hp-shift_size && j >= Wp-shift_size)\r\n                mask_param[i*Wp + j] = 8.0;\r\n            else{\r\n                cout<<\" i && j not limit\"<<endl;\r\n                return nullptr;\r\n            }\r\n        }\r\n    }\r\n    Mask_param.values = mask_param;\r\n    auto img_mask = m_Network->addConstant(Dims4{1,Hp,Wp,1},Mask_param);\r\n    auto img_mask_shuffle = m_Network->addShuffle(*img_mask->getOutput(0));\r\n    Dims shuffle1_dims;\r\n    shuffle1_dims.nbDims = 6;\r\n    int dims[] = {1,Hp/window_size,window_size,Wp/window_size,window_size,1};\r\n    for(int i = 0 ; i < 6; i++)\r\n        shuffle1_dims.d[i] = dims[i];\r\n    img_mask_shuffle->setReshapeDimensions(shuffle1_dims);\r\n    img_mask_shuffle->setSecondTranspose(Permutation{0,1,3,2,4,5});\r\n    auto img_mask_shuffle2 = m_Network->addShuffle(*img_mask_shuffle->getOutput(0));\r\n    img_mask_shuffle2->setReshapeDimensions(Dims3{-1,1,window_size*window_size});\r\n    auto img_mask_shuffle3 = m_Network->addShuffle(*img_mask_shuffle->getOutput(0)) ;\r\n    img_mask_shuffle3->setReshapeDimensions(Dims3{-1,window_size*window_size,1});\r\n    auto atten_mask = m_Network->addElementWise(*img_mask_shuffle2->getOutput(0),*img_mask_shuffle3->getOutput(0),ElementWiseOperation::kSUB);\r\n\r\n    auto creator = getPluginRegistry()->getPluginCreator(\"fillmaskLayer_TRT\", \"1\");\r\n    const PluginFieldCollection* pluginData = creator->getFieldNames();\r\n    IPluginV2 *pluginObj = creator->createPlugin(\"fillmask\", pluginData);\r\n    ITensor* inputTensors[] = {atten_mask->getOutput(0)};\r\n    auto fillmask = m_Network->addPluginV2(inputTensors, 1, *pluginObj);\r\n\r\n    debug_print(fillmask->getOutput(0),\"imgMask\");\r\n    return fillmask->getOutput(0);\r\n}\r\nITensor* trt_transform_pad(INetworkDefinition *m_Network,ITensor *input,int window_size)\r\n{\r\n    int h = input->getDimensions().d[0];\r\n    int w = input->getDimensions().d[1];\r\n    int c = input->getDimensions().d[2];\r\n    int pad_h = (window_size - h%window_size)%window_size;\r\n    int pad_w = (window_size - w%window_size)%window_size;\r\n\r\n    ITensor* temp = input;\r\n    if(pad_h != 0)\r\n    {\r\n        Weights pad1{DataType::kFLOAT,nullptr,pad_h*w*c};\r\n        cout<<pad_h*w*c<<endl;\r\n        float *p1 = new float[pad_h*w*c];\r\n        for(int i = 0 ; i < pad_h*w*c; i++)\r\n            p1[i] = 0.f;\r\n        pad1.values = p1;\r\n        auto Pad1 = m_Network->addConstant(Dims3{pad_h,w,c},pad1);\r\n        ITensor *cat1[2] = {temp,Pad1->getOutput(0)};\r\n        auto xp1 = m_Network->addConcatenation(cat1,2);\r\n        xp1->setAxis(0);\r\n        temp = xp1->getOutput(0);\r\n    }\r\n    if(pad_w != 0)\r\n    {\r\n        Weights pad2{DataType::kFLOAT,nullptr,pad_w*(h+pad_h)*c};\r\n        cout<<pad_w*(h+pad_h)*c<<endl;\r\n        float *p2 = new float[pad_w*(h+pad_h)*c];\r\n        for(int i = 0 ; i < pad_w*(h+pad_h)*c; i++)\r\n            p2[i] = 0.0f;\r\n        pad2.values = p2;\r\n        auto Pad2 = m_Network->addConstant(Dims3{(h+pad_h),pad_w,c},pad2);\r\n        ITensor *cat2[] = {temp,Pad2->getOutput(0)};\r\n        auto xp2 = m_Network->addConcatenation(cat2,2);\r\n        xp2->setAxis(1);\r\n        temp = xp2->getOutput(0);\r\n    }\r\n    debug_print(temp, \"pad\");\r\n    return  temp;\r\n}\r\nITensor* trt_swinRoll(INetworkDefinition *m_Network,ITensor *input,vector<int> shifts, vector<int> dims)\r\n{\r\n    int len = shifts.size();\r\n    Dims input_dim = input->getDimensions();\r\n    int nbdims = input_dim.nbDims;\r\n    ITensor *temp = input;\r\n    for(int i = 0 ; i < len; i++)\r\n    {\r\n        Dims start, size,stride;\r\n        start.nbDims = nbdims;\r\n        size.nbDims = nbdims;\r\n        stride.nbDims = nbdims;\r\n        if(shifts[i] > 0)\r\n        {\r\n            for(int j = 0 ; j < nbdims; j++)\r\n            {\r\n                if(j != (dims[i] -1 ))\r\n                {\r\n                    start.d[j] = 0;\r\n                    size.d[j] = input_dim.d[j];\r\n                    stride.d[j] = 1;\r\n                }\r\n                else{\r\n                    start.d[j] = 0;\r\n                    size.d[j] = input_dim.d[j] - shifts[i];\r\n                    stride.d[j] = 1;\r\n                }\r\n            }\r\n\r\n            auto cat1 = m_Network->addSlice(*temp,start,size,stride);\r\n\r\n            for(int j = 0 ; j < nbdims; j++)\r\n            {\r\n                if(j != (dims[i] - 1))\r\n                {\r\n                    start.d[j] = 0;\r\n                    size.d[j] = input_dim.d[j];\r\n                    stride.d[j] = 1;\r\n                }\r\n                else{\r\n                    start.d[j] = input_dim.d[j] - shifts[i];\r\n                    size.d[j] = shifts[i];\r\n                    stride.d[j] = 1;\r\n                }\r\n            }\r\n            auto cat2 = m_Network->addSlice(*temp,start,size,stride);\r\n            ITensor *cat[] ={cat2->getOutput(0),cat1->getOutput(0)};\r\n            auto Cat = m_Network->addConcatenation(cat,2);\r\n            Cat->setAxis(dims[i] - 1);\r\n            temp = Cat->getOutput(0);\r\n        }\r\n        if(shifts[i] < 0)\r\n        {\r\n            for(int j = 0 ; j < nbdims; j++)\r\n            {\r\n                if(j != (dims[i] - 1))\r\n                {\r\n                    start.d[j] = 0;\r\n                    size.d[j] = input_dim.d[j];\r\n                    stride.d[j] = 1;\r\n                }\r\n                else{\r\n                    start.d[j] = 0;\r\n                    size.d[j] = abs(shifts[i]);\r\n                    stride.d[j] = 1;\r\n                }\r\n            }\r\n            auto cat1 = m_Network->addSlice(*temp,start,size,stride);\r\n            debug_print(cat1->getOutput(0), \"cat1 dims : \");\r\n            for(int j = 0 ; j < nbdims; j++)\r\n            {\r\n                if(j != (dims[i] - 1))\r\n                {\r\n                    start.d[j] = 0;\r\n                    size.d[j] = input_dim.d[j];\r\n                    stride.d[j] = 1;\r\n                }\r\n                else{\r\n                    start.d[j] =  abs(shifts[i]);\r\n                    size.d[j] = input_dim.d[j] - abs(shifts[i]);\r\n                    stride.d[j] = 1;\r\n                }\r\n            }\r\n            auto cat2 = m_Network->addSlice(*temp,start,size,stride);\r\n            debug_print(cat2->getOutput(0), \"cat2 dims : \");\r\n            ITensor *cat[] ={cat2->getOutput(0),cat1->getOutput(0)};\r\n            auto Cat = m_Network->addConcatenation(cat,2);\r\n            Cat->setAxis(dims[i] - 1);\r\n            temp = Cat->getOutput(0);\r\n        }\r\n    }\r\n    return temp;\r\n}\r\nITensor* trt_transform_window_partition(INetworkDefinition *m_Network,ITensor *input,int window_size)\r\n{\r\n    auto shuffle1 = m_Network->addShuffle(*input);\r\n    Dims shuffle1_dims;\r\n    shuffle1_dims.nbDims = 5;\r\n    int h = input->getDimensions().d[0];\r\n    int w = input->getDimensions().d[1];\r\n    int c = input->getDimensions().d[2];\r\n\r\n    int dims[] = {h/window_size,window_size,w/window_size,window_size,c};\r\n    for(int i = 0 ; i < shuffle1_dims.nbDims; i++)\r\n        shuffle1_dims.d[i] = dims[i];\r\n    shuffle1->setReshapeDimensions(shuffle1_dims);\r\n    shuffle1->setSecondTranspose(Permutation{0,2,1,3,4});\r\n    debug_print(shuffle1->getOutput(0),\" shuffle1 dims : \");\r\n    auto shuffle2 = m_Network->addShuffle(*shuffle1->getOutput(0));\r\n    shuffle2->setReshapeDimensions(Dims3{-1,window_size*window_size,c});\r\n\r\n    debug_print(shuffle2->getOutput(0), \"window partition\");\r\n    return shuffle2->getOutput(0);\r\n}\r\nITensor* trt_swinLinear(INetworkDefinition *m_Network,std::map<std::string, Weights> weightMap,\r\n                        ITensor *input, string lname, bool bias = true)\r\n{\r\n    int c = input->getDimensions().d[input->getDimensions().nbDims-1];\r\n    string fc_wpath = lname + \".weight\";\r\n    Weights fcW = weightMap[fc_wpath];\r\n    int len_fcw = fcW.count;\r\n    if(len_fcw == 0)\r\n    {\r\n        cout<<\"file is not open,please check it's path: \"<<fc_wpath<<endl;\r\n        assert(0);\r\n    }\r\n    Dims fcWdims;\r\n    fcWdims.nbDims = input->getDimensions().nbDims;\r\n    if(fcWdims.nbDims == 2)\r\n    {\r\n        fcWdims.d[0] = len_fcw/c;\r\n        fcWdims.d[1] = c;\r\n    }\r\n    else {\r\n        fcWdims.d[0] = 1;\r\n        fcWdims.d[1] = len_fcw/c;\r\n        fcWdims.d[2] = c;\r\n    }\r\n    auto fc_w_constant = m_Network->addConstant(fcWdims,fcW);\r\n    auto fc_w_mm = m_Network->addMatrixMultiply(*input,MatrixOperation::kNONE,\r\n                                                *fc_w_constant->getOutput(0),MatrixOperation::kTRANSPOSE);\r\n\r\n    string fc_bpath = lname +\".bias\";\r\n    Weights fcB = weightMap[fc_bpath];\r\n    int len_fcb = fcB.count;\r\n    if(!bias)\r\n    {\r\n        cout<<lname<<\" bias is Null!\"<<endl;\r\n        debug_print(fc_w_mm->getOutput(0),lname);\r\n        return fc_w_mm->getOutput(0);\r\n    }\r\n    Dims fcBdims;\r\n    fcBdims.nbDims = input->getDimensions().nbDims;\r\n    if(fcBdims.nbDims == 2)\r\n    {\r\n        fcBdims.d[0] = 1;\r\n        fcBdims.d[1] = len_fcb;\r\n    }\r\n    else {\r\n        fcBdims.d[0] = 1;\r\n        fcBdims.d[1] = 1;\r\n        fcBdims.d[2] = len_fcb;\r\n    }\r\n    auto fc_b_constant = m_Network->addConstant(fcBdims,fcB);\r\n    auto fc = m_Network->addElementWise(*fc_w_mm->getOutput(0),*fc_b_constant->getOutput(0),ElementWiseOperation::kSUM);\r\n    debug_print(fc->getOutput(0),lname);\r\n    return fc->getOutput(0);\r\n}\r\nITensor* trt_trainsform_WindowAttention(INetworkDefinition *m_Network,std::map<std::string, Weights> weightMap,ITensor *input,\r\n                                        ITensor* mask,string lname,int dim, int num_heads,int window_size, int shift_size)\r\n{\r\n\r\n    int b = input->getDimensions().d[0];\r\n    int n = input->getDimensions().d[1];\r\n    int c = input->getDimensions().d[2];\r\n\r\n    auto qkv = trt_swinLinear(m_Network,weightMap,input,lname+\".qkv\");\r\n\r\n    Dims qkv_dim;\r\n    qkv_dim.nbDims = 5;\r\n    int d[5] = {b,n,3,num_heads,c/num_heads};\r\n    for(int i = 0; i < 5; i++)\r\n        qkv_dim.d[i] = d[i];\r\n    Permutation qkv_p;\r\n    int p[5] = {2, 0, 3, 1, 4};\r\n    for(int i = 0; i < 5; i++)\r\n        qkv_p.order[i] = p[i];\r\n    auto qkv_shuffle = shuffle_reshapeApermute(m_Network,qkv,qkv_dim,qkv_p,true);\r\n\r\n    Dims qkvDims = qkv_shuffle->getDimensions();\r\n    Dims qstart,kstart,vstart,sizes,stride;\r\n    qstart.nbDims = 5;\r\n    kstart.nbDims = 5;\r\n    vstart.nbDims = 5;\r\n    sizes.nbDims = 5;\r\n    stride.nbDims = 5;\r\n    for(int i = 0; i < 5; i++)\r\n    {\r\n        if(i == 0)\r\n        {\r\n            qstart.d[0] = 0;\r\n            kstart.d[0] = 1;\r\n            vstart.d[0] = 2;\r\n            sizes.d[0] = 1;\r\n            stride.d[0] =1;\r\n        }\r\n        else{\r\n            qstart.d[i] = 0;\r\n            kstart.d[i] = 0;\r\n            vstart.d[i] = 0;\r\n            sizes.d[i] = qkvDims.d[i];\r\n            stride.d[i] =1;\r\n        }\r\n    }\r\n    auto q = m_Network->addSlice(*qkv_shuffle,qstart,sizes,stride);\r\n    auto k = m_Network->addSlice(*qkv_shuffle,kstart,sizes,stride);\r\n    auto v = m_Network->addSlice(*qkv_shuffle,vstart,sizes,stride);\r\n\r\n    // q * s\r\n    int len = 1;\r\n    Weights scale_w{DataType::kFLOAT,nullptr,len};\r\n    float *scale = new float[len];\r\n    for(int i = 0 ; i < len; i++)\r\n        scale[i] = 1 / sqrt(dim/num_heads);\r\n    scale_w.values = scale;\r\n    Dims scale_dim;\r\n    scale_dim.nbDims = 5;\r\n\r\n    for(int i = 0 ; i < 5; i++)\r\n        scale_dim.d[i] = 1;\r\n    auto Scale = m_Network->addConstant(scale_dim,scale_w);\r\n    auto qs = m_Network->addElementWise(*q->getOutput(0),*Scale->getOutput(0),ElementWiseOperation::kPROD);\r\n    auto qs_ = m_Network->addShuffle(*qs->getOutput(0));\r\n    qs_->setReshapeDimensions(Dims4{qkvDims.d[1],qkvDims.d[2],qkvDims.d[3],qkvDims.d[4]});\r\n    auto k_ = m_Network->addShuffle(*k->getOutput(0));\r\n    k_->setReshapeDimensions(Dims4{qkvDims.d[1],qkvDims.d[2],qkvDims.d[3],qkvDims.d[4]});\r\n    auto attn = m_Network->addMatrixMultiply(*qs_->getOutput(0),MatrixOperation::kNONE,\r\n                                             *k_->getOutput(0),MatrixOperation::kTRANSPOSE);\r\n    auto relatbias = m_Network->addConstant(Dims2{(2*window_size -1)*(2*window_size -1),num_heads},weightMap[lname + \".relative_position_bias_table\"]);\r\n    Dims r_i_dims;\r\n    r_i_dims.nbDims = 1;\r\n    r_i_dims.d[0] = window_size*window_size * window_size*window_size;\r\n    Weights index{DataType::kINT32,nullptr,r_i_dims.d[0]};\r\n    int* idx = new int[r_i_dims.d[0]];\r\n    for (int i = 0; i < r_i_dims.d[0]; i++) {\r\n        idx[i] =(int)((float*)weightMap[lname+\".relative_position_index\"].values)[i];\r\n    }\r\n    //idx = (int*)weightMap[lname+\".relative_position_index\"].values;\r\n    //cout<<\"idx = \"<<((float*)weightMap[lname+\".relative_position_index\"].values)[0]<<endl;\r\n    index.values = idx;\r\n    auto relatidx = m_Network->addConstant(r_i_dims,index);\r\n    auto relat = m_Network->addGather(*relatbias->getOutput(0),*relatidx->getOutput(0),0);\r\n    auto relat_view = shuffle_reshapeApermute(m_Network,relat->getOutput(0),\r\n                                              Dims4{1,window_size*window_size,window_size*window_size,-1},\r\n                                              Permutation{0,3,1,2},true);\r\n    auto attn_rv = m_Network->addElementWise(*attn->getOutput(0),*relat_view,ElementWiseOperation::kSUM);\r\n    ITensor *Attn_rv = attn_rv->getOutput(0);\r\n    if (mask != nullptr)\r\n    {\r\n        Dims maskdims;\r\n        maskdims.nbDims = mask->getDimensions().nbDims +1;\r\n        maskdims.d[0] = mask->getDimensions().d[0];\r\n        maskdims.d[1] = 1;\r\n        for(int i = 2; i< maskdims.nbDims; i++)\r\n        {\r\n            maskdims.d[i] = mask->getDimensions().d[i-1];\r\n        }\r\n        auto maskshuffle = m_Network->addShuffle(*mask);\r\n        maskshuffle->setReshapeDimensions(maskdims);\r\n        auto attn_rnM = m_Network->addElementWise(*attn_rv->getOutput(0),*maskshuffle->getOutput(0),ElementWiseOperation::kSUM);\r\n        Attn_rv = attn_rnM->getOutput(0);\r\n    }\r\n    auto attn_rv_s = m_Network->addSoftMax(*Attn_rv);\r\n    attn_rv_s->setAxes(8);\r\n    auto v_ = m_Network->addShuffle(*v->getOutput(0));\r\n    v_->setReshapeDimensions(Dims4{qkvDims.d[1],qkvDims.d[2],qkvDims.d[3],qkvDims.d[4]});\r\n    auto attn_v = m_Network->addMatrixMultiply(*attn_rv_s->getOutput(0),MatrixOperation::kNONE,\r\n                                               *v_->getOutput(0),MatrixOperation::kNONE);\r\n    auto x_reshape = shuffle_reshapeApermute(m_Network,attn_v->getOutput(0),Dims3{b,n,c},Permutation{0,2,1,3},false);\r\n    auto x_linear = trt_swinLinear(m_Network,weightMap,x_reshape,lname+\".proj\");\r\n    return x_linear;\r\n}\r\nITensor* trt_window_reverse(INetworkDefinition *m_Network, ITensor *input, int window_size, int H, int W)\r\n{\r\n    Dims viewDims;\r\n    viewDims.nbDims = 5;\r\n    int d[5] = {H/window_size,W/window_size,window_size,window_size,-1};\r\n    for(int i = 0; i < 5; i++)\r\n        viewDims.d[i] = d[i];\r\n    auto x_view = shuffle_reshape(m_Network,input,viewDims);\r\n    auto output = shuffle_reshapeApermute(m_Network,x_view,Dims3{H,W,-1},Permutation{0,2,1,3,4},false);\r\n    return output;\r\n}\r\nITensor* gelu(INetworkDefinition *m_Network,ITensor *input)\r\n{\r\n    auto creator = getPluginRegistry()->getPluginCreator(\"geluLayer_TRT\", \"1\");\r\n    const PluginFieldCollection* pluginData = creator->getFieldNames();\r\n    IPluginV2 *pluginObj = creator->createPlugin(\"gelu\", pluginData);\r\n    ITensor* inputTensors[] = {input};\r\n    auto g = m_Network->addPluginV2(inputTensors, 1, *pluginObj);\r\n    return g->getOutput(0);\r\n}\r\n//ITensor* adaptiveAvgPool2d(INetworkDefinition *m_Network,ITensor *input)\r\n//{\r\n//    auto creator = getPluginRegistry()->getPluginCreator(\"adaptiveAvgPooling_TRT\", \"1\");\r\n//    const PluginFieldCollection* pluginData = creator->getFieldNames();\r\n//    IPluginV2 *pluginObj = creator->createPlugin(\"apAvgPool\", pluginData);\r\n//    ITensor* inputTensors[] = {input};\r\n//    auto g = m_Network->addPluginV2(inputTensors, 1, *pluginObj);\r\n//    return g->getOutput(0);\r\n//}\r\nITensor* trt_transform_mlp(INetworkDefinition *m_Network,std::map<std::string, Weights> weightMap,ITensor *input,\r\n                           string lname,int dim,int mlp_ratio = 4)\r\n{\r\n//    auto fc1 = m_Network->addFullyConnected(*input,dim * mlp_ratio,\r\n//                                            weightMap[lname+\".fc1.weight\"],weightMap[lname+\".fc1.bias\"]);\r\n    auto fc1 = trt_swinLinear(m_Network,weightMap,input,lname+\".fc1\");\r\n    auto act = gelu(m_Network,fc1);\r\n//    auto fc2 = m_Network->addFullyConnected(*act,dim ,\r\n//                                            weightMap[lname+\".fc2.weight\"],weightMap[lname+\".fc2.bias\"]);\r\n    auto fc2 = trt_swinLinear(m_Network,weightMap,act,lname+\".fc2\");\r\n    return fc2;\r\n}\r\nITensor* blk(INetworkDefinition *m_Network,std::map<std::string, Weights> weightMap,ITensor *input, ITensor* mask, string lname,\r\n             int hw,int dim, int num_heads,int window_size,int shift_size,int mlp_ratio = 4)\r\n{\r\n    int c = input->getDimensions().d[input->getDimensions().nbDims - 1];\r\n    auto x = input;\r\n    auto norm1 = m_layerNorm(m_Network,weightMap,x,lname+\".norm1\");\r\n    //auto norm1 = x;\r\n    auto view1 = shuffle_reshape(m_Network,norm1,Dims3{hw,hw,c});\r\n    auto pad = trt_transform_pad(m_Network,view1,window_size);\r\n    int hp = pad->getDimensions().d[0];\r\n    int wp = pad->getDimensions().d[1];\r\n    ITensor* shifted_x;\r\n    ITensor* atten_mask = nullptr;\r\n    if(shift_size > 0)\r\n    {\r\n        shifted_x = trt_swinRoll(m_Network,pad,{-3,-3},{1,2});\r\n        atten_mask = mask;\r\n    }\r\n    else\r\n    {\r\n        shifted_x = pad;\r\n    }\r\n    auto x_windows = trt_transform_window_partition(m_Network,shifted_x,window_size);\r\n    auto x_atten_windows = trt_trainsform_WindowAttention(m_Network,weightMap,x_windows,atten_mask,lname+\".attn\",dim,num_heads,\r\n                                                          window_size,shift_size);\r\n    auto x_atten_windows_view = shuffle_reshape(m_Network,x_atten_windows,Dims4{-1,window_size,window_size,c});\r\n\r\n    shifted_x = trt_window_reverse(m_Network,x_atten_windows_view,window_size,hp,wp);\r\n    if(shift_size > 0)\r\n    {\r\n        x = trt_swinRoll(m_Network,shifted_x,{3,3},{1,2});\r\n    }\r\n    else {\r\n        x = shifted_x;\r\n    }\r\n    if(hw < hp){\r\n        auto sss = m_Network->addSlice(*x,Dims3{0,0,0},Dims3{hw,hw,c},Dims3{1,1,1});\r\n        x = sss->getOutput(0);\r\n    }\r\n    x = shuffle_reshape(m_Network,x,Dims2{hw*hw,c});\r\n    x = m_Network->addElementWise(*x,*input,ElementWiseOperation::kSUM)->getOutput(0);\r\n    auto norm2 = m_layerNorm(m_Network,weightMap,x,lname+\".norm2\");\r\n    //auto norm2 = x;\r\n    auto mlp = trt_transform_mlp(m_Network,weightMap,norm2,lname+\".mlp\",dim);\r\n    auto out= m_Network->addElementWise(*x,*mlp,ElementWiseOperation::kSUM)->getOutput(0);\r\n    debug_print(out, \"blk\");\r\n    return out;\r\n}\r\nITensor* downsample(INetworkDefinition* m_Network,std::map<std::string, Weights> weightMap,ITensor *input,\r\n                    string lname, int hw)\r\n{\r\n    int c = input->getDimensions().d[input->getDimensions().nbDims - 1];\r\n    auto x = shuffle_reshape(m_Network,input,Dims3{hw,hw,c});\r\n    auto x0 = m_Network->addSlice(*x,Dims3{0,0,0},Dims3{hw/2,hw/2,c},Dims3{2,2,1});\r\n    auto x1 = m_Network->addSlice(*x,Dims3{1,0,0},Dims3{hw/2,hw/2,c},Dims3{2,2,1});\r\n    auto x2 = m_Network->addSlice(*x,Dims3{0,1,0},Dims3{hw/2,hw/2,c},Dims3{2,2,1});\r\n    auto x3 = m_Network->addSlice(*x,Dims3{1,1,0},Dims3{hw/2,hw/2,c},Dims3{2,2,1});\r\n    ITensor* inputTensors[] = { x0->getOutput(0), x1->getOutput(0), x2->getOutput(0), x3->getOutput(0) };\r\n    auto cat = m_Network->addConcatenation(inputTensors, 4);\r\n    cat->setAxis(2);\r\n    auto cat_view = shuffle_reshape(m_Network,cat->getOutput(0),Dims2{-1,4*c});\r\n    auto norm = m_layerNorm(m_Network,weightMap,cat_view,lname+\".norm\");\r\n    //auto norm = cat_view;\r\n    auto reduction = trt_swinLinear(m_Network,weightMap,norm,lname+\".reduction\",false);\r\n    return reduction;\r\n}\r\nITensor* addBatchNorm2d(\r\nINetworkDefinition *network,\r\nstd::map<std::string, Weights> weightMap,\r\nITensor* input,\r\nconst std::string& lname,\r\nfloat eps = 1e-5\r\n) {\r\n    float *gamma = (float*)(weightMap[lname + \".weight\"].values);\r\n    float *beta = (float*)(weightMap[lname + \".bias\"].values);\r\n    float *mean = (float*)(weightMap[lname + \".running_mean\"].values);\r\n    float *var = (float*)(weightMap[lname + \".running_var\"].values);\r\n    int len = weightMap[lname + \".running_var\"].count;\r\n\r\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights scale{ DataType::kFLOAT, scval, len };\r\n\r\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\r\n    }\r\n    Weights shift{ DataType::kFLOAT, shval, len };\r\n\r\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\r\n    for (int i = 0; i < len; i++) {\r\n        pval[i] = 1.0;\r\n    }\r\n    Weights power{ DataType::kFLOAT, pval, len };\r\n\r\n    weightMap[lname + \".scale\"] = scale;\r\n    weightMap[lname + \".shift\"] = shift;\r\n    weightMap[lname + \".power\"] = power;\r\n    IScaleLayer* scale_1 = network->addScale(*input, ScaleMode::kCHANNEL, shift, scale, power);\r\n    assert(scale_1);\r\n    return scale_1->getOutput(0);\r\n}\r\nITensor* transform_lateral_conv(INetworkDefinition* m_Network,std::map<std::string, Weights> weightMap,ITensor* input,\r\n                                string lname, int k = 1, int s = 1,int out_features = 512)\r\n{\r\n    Weights empty{DataType::kFLOAT,nullptr,0};\r\n    auto conv = m_Network->addConvolutionNd(*input,out_features,Dims2{k,k},weightMap[lname+\".conv.weight\"],empty);\r\n    conv->setStrideNd(Dims2{s,s});\r\n    conv->setNbGroups(1);\r\n    conv->setPaddingNd(Dims2{k/2,k/2});\r\n    ITensor* bn = addBatchNorm2d(m_Network,weightMap,conv->getOutput(0),lname+\".bn\");\r\n    auto act = m_Network->addActivation(*bn,ActivationType::kRELU);\r\n    return act->getOutput(0);\r\n}\r\nITensor* resize(INetworkDefinition* m_Network, ITensor* input, int grid)\r\n{\r\n    float scale_h = 2.0f;\r\n    float scale_w = 2.0f;\r\n\r\n    scale_h = 1.0*grid / input->getDimensions().d[1];\r\n    scale_w = 1.0*grid / input->getDimensions().d[2];\r\n\r\n    auto creator = getPluginRegistry()->getPluginCreator(\"UpsamplePlugin\", \"1\");\r\n    PluginField pField[1];\r\n    float *s = new float[2];\r\n    s[0] = scale_h;\r\n    s[1] = scale_w;\r\n    pField[0].data = s;\r\n    pField[0].length = 2;\r\n    pField[0].type = PluginFieldType::kFLOAT32;\r\n    pField[0].name = \"scaleFactor\";\r\n\r\n    PluginFieldCollection pluginData;\r\n    pluginData.nbFields = 1;\r\n    pluginData.fields = pField;\r\n    IPluginV2 *pluginObj = creator->createPlugin(\"upSample\", &pluginData);\r\n    ITensor* inputTensors[] = {input};\r\n    auto upS = m_Network->addPluginV2(inputTensors, 1, *pluginObj);\r\n    return upS->getOutput(0);\r\n}\r\nITensor* transform_psp(INetworkDefinition* m_Network,std::map<std::string, Weights> weightMap,ITensor* input,\r\n                       string lname, int output_Avg_Size, int out_features = 512)\r\n{\r\n    int inH = input->getDimensions().d[1];\r\n    int inW = input->getDimensions().d[2];\r\n    int kH = inH / output_Avg_Size;\r\n    int kW = inW / output_Avg_Size;\r\n    auto avgPool = m_Network->addPoolingNd(*input,PoolingType::kAVERAGE,Dims2{kH,kW});\r\n    avgPool->setStrideNd(Dims2{kH,kW});\r\n    auto cba = transform_lateral_conv(m_Network,weightMap,avgPool->getOutput(0),lname,1,1,out_features);\r\n    auto out = resize(m_Network,cba,inH);\r\n    return out;\r\n}\r\nITensor* up_Add(INetworkDefinition* m_Network,ITensor* input1,ITensor* input2)\r\n{\r\n    auto in1 = resize(m_Network,input1,input2->getDimensions().d[1]);\r\n    auto out = m_Network->addElementWise(*in1,*input2,ElementWiseOperation::kSUM);\r\n    return out->getOutput(0);\r\n}\r\n\r\n\r\n#endif // COMMON_HPP\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/fillmask.cu",
    "content": "#include \"fillmask.h\"\r\n#include <math.h>\r\nnamespace nvinfer1\r\n{\r\n    fillmask::fillmask()\r\n    {\r\n    }\r\n\r\n    fillmask::~fillmask()\r\n    {\r\n    }\r\n    // create the plugin at runtime from a byte stream\r\n    fillmask::fillmask(const void* data, size_t length)\r\n    {\r\n        const char *d = reinterpret_cast<const char *>(data), *a = d;\r\n        Tn::read(d, mInputSize);\r\n        assert(d == a + length);\r\n    }\r\n\r\n    void fillmask::serialize(void* buffer) const\r\n    {\r\n        char* d = static_cast<char*>(buffer), *a = d;\r\n        Tn::write(d, mInputSize);\r\n        assert(d == a + getSerializationSize());\r\n    }\r\n\r\n    size_t fillmask::getSerializationSize() const\r\n    {\r\n        return sizeof(mInputSize);\r\n    }\r\n\r\n    int fillmask::initialize()\r\n    {\r\n        return 0;\r\n    }\r\n\r\n    Dims fillmask::getOutputDimensions(int index, const Dims* inputs, int nbInputDims)\r\n    {\r\n        assert(nbInputDims == 1);\r\n        Dims outputDims;\r\n        outputDims.nbDims = inputs[0].nbDims;\r\n        for (int i = 0; i < inputs[0].nbDims; i++) {\r\n            outputDims.d[i] = inputs[0].d[i];\r\n        }\r\n        return outputDims;\r\n    }\r\n\r\n    // Set plugin namespace\r\n    void fillmask::setPluginNamespace(const char* pluginNamespace)\r\n    {\r\n        mPluginNamespace = pluginNamespace;\r\n    }\r\n\r\n    const char* fillmask::getPluginNamespace() const\r\n    {\r\n        return mPluginNamespace;\r\n    }\r\n\r\n    // Return the DataType of the plugin output at the requested index\r\n    DataType fillmask::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const\r\n    {\r\n        return DataType::kFLOAT;\r\n    }\r\n\r\n    // Return true if output tensor is broadcast across a batch.\r\n    bool fillmask::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const\r\n    {\r\n        return false;\r\n    }\r\n\r\n    // Return true if plugin can use input that is broadcast across batch without replication.\r\n    bool fillmask::canBroadcastInputAcrossBatch(int inputIndex) const\r\n    {\r\n        return false;\r\n    }\r\n\r\n    void fillmask::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput)\r\n    {\r\n\r\n        mInputSize = 1;\r\n        for (int i = 0; i < in[0].dims.nbDims; i++) {\r\n            mInputSize *= in[0].dims.d[i];\r\n        }\r\n    }\r\n\r\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\r\n    void fillmask::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator)\r\n    {\r\n    }\r\n\r\n    // Detach the plugin object from its execution context.\r\n    void fillmask::detachFromContext() {}\r\n\r\n    const char* fillmask::getPluginType() const\r\n    {\r\n        return \"fillmaskLayer_TRT\";\r\n    }\r\n\r\n    const char* fillmask::getPluginVersion() const\r\n    {\r\n        return \"1\";\r\n    }\r\n\r\n    void fillmask::destroy()\r\n    {\r\n        delete this;\r\n    }\r\n\r\n    // Clone the plugin\r\n    IPluginV2IOExt* fillmask::clone() const\r\n    {\r\n        fillmask *p = new fillmask();\r\n        p->setPluginNamespace(mPluginNamespace);\r\n        p->setInputSize(mInputSize);\r\n        return p;\r\n    }\r\n\r\n    __global__ void fillmaskKer(const float *in, float *out, int size) {\r\n        int idx = threadIdx.x + blockIdx.x * blockDim.x;\r\n        if (idx >= size)\r\n            return;\r\n        if (in[idx] != 0.0)\r\n            out[idx] = -100.0;\r\n        else\r\n            out[idx] = 0.0;\r\n    }\r\n    void fillmask::forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize) {\r\n\r\n        int numElem = batchSize * mInputSize;\r\n        fillmaskKer<<<(numElem + mThreadCount - 1) / mThreadCount, mThreadCount>>>\r\n            (inputs[0], output, numElem);\r\n    }\r\n\r\n    int fillmask::enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream)\r\n    {\r\n        forwardGpu((const float *const *)inputs, (float*)outputs[0], stream, batchSize);\r\n        return 0;\r\n    }\r\n\r\n    PluginFieldCollection fillmaskCreator::mFC{};\r\n    std::vector<PluginField> fillmaskCreator::mPluginAttributes;\r\n\r\n    fillmaskCreator::fillmaskCreator()\r\n    {\r\n        mPluginAttributes.clear();\r\n        mFC.nbFields = mPluginAttributes.size();\r\n        mFC.fields = mPluginAttributes.data();\r\n    }\r\n\r\n    const char* fillmaskCreator::getPluginName() const\r\n    {\r\n            return \"fillmaskLayer_TRT\";\r\n    }\r\n\r\n    const char* fillmaskCreator::getPluginVersion() const\r\n    {\r\n            return \"1\";\r\n    }\r\n\r\n    const PluginFieldCollection* fillmaskCreator::getFieldNames()\r\n    {\r\n            return &mFC;\r\n    }\r\n\r\n    IPluginV2IOExt* fillmaskCreator::createPlugin(const char* name, const PluginFieldCollection* fc)\r\n    {\r\n        fillmask* obj = new fillmask();\r\n        obj->setPluginNamespace(mNamespace.c_str());\r\n        return obj;\r\n    }\r\n\r\n    IPluginV2IOExt* fillmaskCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength)\r\n    {\r\n        // This object will be deleted when the network is destroyed, which will\r\n        fillmask* obj = new fillmask(serialData, serialLength);\r\n        obj->setPluginNamespace(mNamespace.c_str());\r\n        return obj;\r\n    }\r\n\r\n\r\n}\r\n\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/fillmask.h",
    "content": "#ifndef FILLMASK_H\r\n#define FILLMASK_H\r\n\r\n\r\n#include <vector>\r\n#include <string>\r\n#include \"NvInfer.h\"\r\n#include \"myhpp.h\"\r\n#include <assert.h>\r\n#include \"utilsn.h\"\r\n\r\nnamespace nvinfer1\r\n{\r\n    class fillmask:public IPluginV2IOExt\r\n    {\r\n    public:\r\n        explicit fillmask();\r\n        fillmask(const void* data, size_t length);\r\n        ~fillmask();\r\n        int getNbOutputs() const override\r\n        {\r\n            return 1;\r\n        }\r\n\r\n        Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) override;\r\n        int initialize() override;\r\n        virtual void terminate() override {};\r\n        virtual size_t getWorkspaceSize(int maxBatchSize) const override { return 0;}\r\n        virtual int enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream) override;\r\n        virtual size_t getSerializationSize() const override;\r\n        virtual void serialize(void* buffer) const override;\r\n\r\n        bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const override {\r\n            return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\r\n        }\r\n\r\n        const char* getPluginType() const override;\r\n        const char* getPluginVersion() const override;\r\n        void destroy() override;\r\n        IPluginV2IOExt* clone() const override;\r\n        void setPluginNamespace(const char* pluginNamespace) override;\r\n        const char* getPluginNamespace() const override;\r\n        DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const override;\r\n        bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const override;\r\n        bool canBroadcastInputAcrossBatch(int inputIndex) const override;\r\n        void attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) override;\r\n        void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) override;\r\n        void detachFromContext() override;\r\n\r\n        void setInputSize(int s) {\r\n            mInputSize = s;\r\n        }\r\n\r\n    private:\r\n        void forwardGpu(const float *const * inputs,float * output, cudaStream_t stream,int batchSize = 1);\r\n        int mThreadCount = 256;\r\n        int mInputSize;\r\n        const char* mPluginNamespace;\r\n    };\r\n\r\n    class fillmaskCreator : public IPluginCreator\r\n    {\r\n        public:\r\n            fillmaskCreator();\r\n            ~fillmaskCreator() override = default;\r\n            const char* getPluginName() const override;\r\n            const char* getPluginVersion() const override;\r\n            const PluginFieldCollection* getFieldNames() override;\r\n            IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) override;\r\n            IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) override;\r\n\r\n            void setPluginNamespace(const char* libNamespace) override\r\n            {\r\n                mNamespace = libNamespace;\r\n            }\r\n\r\n            const char* getPluginNamespace() const override\r\n            {\r\n                return mNamespace.c_str();\r\n            }\r\n\r\n        private:\r\n            std::string mNamespace;\r\n            static PluginFieldCollection mFC;\r\n            static std::vector<PluginField> mPluginAttributes;\r\n    };\r\n    REGISTER_TENSORRT_PLUGIN(fillmaskCreator);\r\n};\r\n\r\n#endif // FILLMASK_H\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/gelu.cu",
    "content": "#include \"gelu.h\"\r\n#include <math.h>\r\nnamespace nvinfer1\r\n{\r\n    gelu::gelu()\r\n    {\r\n    }\r\n\r\n    gelu::~gelu()\r\n    {\r\n    }\r\n    // create the plugin at runtime from a byte stream\r\n    gelu::gelu(const void* data, size_t length)\r\n    {\r\n        const char *d = reinterpret_cast<const char *>(data), *a = d;\r\n        Tn::read(d, mInputSize);\r\n        assert(d == a + length);\r\n    }\r\n\r\n    void gelu::serialize(void* buffer) const\r\n    {\r\n        char* d = static_cast<char*>(buffer), *a = d;\r\n        Tn::write(d, mInputSize);\r\n        assert(d == a + getSerializationSize());\r\n    }\r\n\r\n    size_t gelu::getSerializationSize() const\r\n    {\r\n        return sizeof(mInputSize);\r\n    }\r\n\r\n    int gelu::initialize()\r\n    {\r\n        return 0;\r\n    }\r\n\r\n    Dims gelu::getOutputDimensions(int index, const Dims* inputs, int nbInputDims)\r\n    {\r\n        assert(nbInputDims == 1);\r\n        Dims outputDims;\r\n        outputDims.nbDims = inputs[0].nbDims;\r\n        for (int i = 0; i < inputs[0].nbDims; i++) {\r\n            outputDims.d[i] = inputs[0].d[i];\r\n        }\r\n        return outputDims;\r\n    }\r\n\r\n    // Set plugin namespace\r\n    void gelu::setPluginNamespace(const char* pluginNamespace)\r\n    {\r\n        mPluginNamespace = pluginNamespace;\r\n    }\r\n\r\n    const char* gelu::getPluginNamespace() const\r\n    {\r\n        return mPluginNamespace;\r\n    }\r\n\r\n    // Return the DataType of the plugin output at the requested index\r\n    DataType gelu::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const\r\n    {\r\n        return DataType::kFLOAT;\r\n    }\r\n\r\n    // Return true if output tensor is broadcast across a batch.\r\n    bool gelu::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const\r\n    {\r\n        return false;\r\n    }\r\n\r\n    // Return true if plugin can use input that is broadcast across batch without replication.\r\n    bool gelu::canBroadcastInputAcrossBatch(int inputIndex) const\r\n    {\r\n        return false;\r\n    }\r\n\r\n    void gelu::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput)\r\n    {\r\n\r\n        mInputSize = 1;\r\n        for (int i = 0; i < in[0].dims.nbDims; i++) {\r\n            mInputSize *= in[0].dims.d[i];\r\n        }\r\n    }\r\n\r\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\r\n    void gelu::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator)\r\n    {\r\n    }\r\n\r\n    // Detach the plugin object from its execution context.\r\n    void gelu::detachFromContext() {}\r\n\r\n    const char* gelu::getPluginType() const\r\n    {\r\n        return \"geluLayer_TRT\";\r\n    }\r\n\r\n    const char* gelu::getPluginVersion() const\r\n    {\r\n        return \"1\";\r\n    }\r\n\r\n    void gelu::destroy()\r\n    {\r\n        delete this;\r\n    }\r\n\r\n    // Clone the plugin\r\n    IPluginV2IOExt* gelu::clone() const\r\n    {\r\n        gelu *p = new gelu();\r\n        p->setPluginNamespace(mPluginNamespace);\r\n        p->setInputSize(mInputSize);\r\n        return p;\r\n    }\r\n\r\n    __global__ void geluKer(const float *in, float *out, int size) {\r\n        int idx = threadIdx.x + blockIdx.x * blockDim.x;\r\n        if (idx >= size)\r\n            return;\r\n        //x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))\r\n        out[idx] = in[idx] * 0.5 *(1.0 + erf(in[idx]/1.4142135381698608));\r\n    }\r\n    void gelu::forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize) {\r\n\r\n        int numElem = batchSize * mInputSize;\r\n        geluKer<<<(numElem + mThreadCount - 1) / mThreadCount, mThreadCount>>>\r\n            (inputs[0], output, numElem);\r\n    }\r\n\r\n    int gelu::enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream)\r\n    {\r\n        forwardGpu((const float *const *)inputs, (float*)outputs[0], stream, batchSize);\r\n        return 0;\r\n    }\r\n\r\n    PluginFieldCollection geluCreator::mFC{};\r\n    std::vector<PluginField> geluCreator::mPluginAttributes;\r\n\r\n    geluCreator::geluCreator()\r\n    {\r\n        mPluginAttributes.clear();\r\n        mFC.nbFields = mPluginAttributes.size();\r\n        mFC.fields = mPluginAttributes.data();\r\n    }\r\n\r\n    const char* geluCreator::getPluginName() const\r\n    {\r\n            return \"geluLayer_TRT\";\r\n    }\r\n\r\n    const char* geluCreator::getPluginVersion() const\r\n    {\r\n            return \"1\";\r\n    }\r\n\r\n    const PluginFieldCollection* geluCreator::getFieldNames()\r\n    {\r\n            return &mFC;\r\n    }\r\n\r\n    IPluginV2IOExt* geluCreator::createPlugin(const char* name, const PluginFieldCollection* fc)\r\n    {\r\n        gelu* obj = new gelu();\r\n        obj->setPluginNamespace(mNamespace.c_str());\r\n        return obj;\r\n    }\r\n\r\n    IPluginV2IOExt* geluCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength)\r\n    {\r\n        // This object will be deleted when the network is destroyed, which will\r\n        gelu* obj = new gelu(serialData, serialLength);\r\n        obj->setPluginNamespace(mNamespace.c_str());\r\n        return obj;\r\n    }\r\n\r\n\r\n}\r\n\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/gelu.h",
    "content": "#ifndef GELU_H\r\n#define GELU_H\r\n\r\n#include <vector>\r\n#include <string>\r\n#include \"NvInfer.h\"\r\n#include \"myhpp.h\"\r\n#include <assert.h>\r\n#include \"utilsn.h\"\r\n#define M_PI       3.14159265358979323846   // pi\r\nnamespace nvinfer1\r\n{\r\n    class gelu:public IPluginV2IOExt\r\n    {\r\n    public:\r\n        explicit gelu();\r\n        gelu(const void* data, size_t length);\r\n        ~gelu();\r\n        int getNbOutputs() const override\r\n        {\r\n            return 1;\r\n        }\r\n\r\n        Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) override;\r\n        int initialize() override;\r\n        virtual void terminate() override {};\r\n        virtual size_t getWorkspaceSize(int maxBatchSize) const override { return 0;}\r\n        virtual int enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream) override;\r\n        virtual size_t getSerializationSize() const override;\r\n        virtual void serialize(void* buffer) const override;\r\n\r\n        bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const override {\r\n            return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\r\n        }\r\n\r\n        const char* getPluginType() const override;\r\n        const char* getPluginVersion() const override;\r\n        void destroy() override;\r\n        IPluginV2IOExt* clone() const override;\r\n        void setPluginNamespace(const char* pluginNamespace) override;\r\n        const char* getPluginNamespace() const override;\r\n        DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const override;\r\n        bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const override;\r\n        bool canBroadcastInputAcrossBatch(int inputIndex) const override;\r\n        void attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) override;\r\n        void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) override;\r\n        void detachFromContext() override;\r\n\r\n        void setInputSize(int s) {\r\n            mInputSize = s;\r\n        }\r\n\r\n    private:\r\n        void forwardGpu(const float *const * inputs,float * output, cudaStream_t stream,int batchSize = 1);\r\n        int mThreadCount = 256;\r\n        int mInputSize;\r\n        const char* mPluginNamespace;\r\n    };\r\n\r\n    class geluCreator : public IPluginCreator\r\n    {\r\n        public:\r\n            geluCreator();\r\n            ~geluCreator() override = default;\r\n            const char* getPluginName() const override;\r\n            const char* getPluginVersion() const override;\r\n            const PluginFieldCollection* getFieldNames() override;\r\n            IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) override;\r\n            IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) override;\r\n\r\n            void setPluginNamespace(const char* libNamespace) override\r\n            {\r\n                mNamespace = libNamespace;\r\n            }\r\n\r\n            const char* getPluginNamespace() const override\r\n            {\r\n                return mNamespace.c_str();\r\n            }\r\n\r\n        private:\r\n            std::string mNamespace;\r\n            static PluginFieldCollection mFC;\r\n            static std::vector<PluginField> mPluginAttributes;\r\n    };\r\n    REGISTER_TENSORRT_PLUGIN(geluCreator);\r\n};\r\n#endif // GELU_H\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/gen_wts.py",
    "content": "import torch\r\nimport struct\r\nimport sys\r\n\r\n# Initialize\r\npt_file = sys.argv[1]\r\n# Load model\r\nmodel = torch.load(pt_file, map_location=torch.device('cpu'))['model'].float()  # load to FP32\r\nmodel.to(device).eval()\r\n\r\nwith open(pt_file.split('.')[0] + '.wts', 'w') as f:\r\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\r\n    for k, v in model.state_dict().items():\r\n        vr = v.reshape(-1).cpu().numpy()\r\n        f.write('{} {} '.format(k, len(vr)))\r\n        for vv in vr:\r\n            f.write(' ')\r\n            f.write(struct.pack('>f',float(vv)).hex())\r\n        f.write('\\n')\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/include/dirent.h",
    "content": "/*\r\n * Dirent interface for Microsoft Visual Studio\r\n *\r\n * Copyright (C) 1998-2019 Toni Ronkko\r\n * This file is part of dirent.  Dirent may be freely distributed\r\n * under the MIT license.  For all details and documentation, see\r\n * https://github.com/tronkko/dirent\r\n */\r\n#ifndef DIRENT_H\r\n#define DIRENT_H\r\n\r\n/* Hide warnings about unreferenced local functions */\r\n#if defined(__clang__)\r\n#   pragma clang diagnostic ignored \"-Wunused-function\"\r\n#elif defined(_MSC_VER)\r\n#   pragma warning(disable:4505)\r\n#elif defined(__GNUC__)\r\n#   pragma GCC diagnostic ignored \"-Wunused-function\"\r\n#endif\r\n\r\n/*\r\n * Include windows.h without Windows Sockets 1.1 to prevent conflicts with\r\n * Windows Sockets 2.0.\r\n */\r\n#ifndef WIN32_LEAN_AND_MEAN\r\n#   define WIN32_LEAN_AND_MEAN\r\n#endif\r\n#include <windows.h>\r\n\r\n#include <stdio.h>\r\n#include <stdarg.h>\r\n#include <wchar.h>\r\n#include <string.h>\r\n#include <stdlib.h>\r\n#include <malloc.h>\r\n#include <sys/types.h>\r\n#include <sys/stat.h>\r\n#include <errno.h>\r\n\r\n/* Indicates that d_type field is available in dirent structure */\r\n#define _DIRENT_HAVE_D_TYPE\r\n\r\n/* Indicates that d_namlen field is available in dirent structure */\r\n#define _DIRENT_HAVE_D_NAMLEN\r\n\r\n/* Entries missing from MSVC 6.0 */\r\n#if !defined(FILE_ATTRIBUTE_DEVICE)\r\n#   define FILE_ATTRIBUTE_DEVICE 0x40\r\n#endif\r\n\r\n/* File type and permission flags for stat(), general mask */\r\n#if !defined(S_IFMT)\r\n#   define S_IFMT _S_IFMT\r\n#endif\r\n\r\n/* Directory bit */\r\n#if !defined(S_IFDIR)\r\n#   define S_IFDIR _S_IFDIR\r\n#endif\r\n\r\n/* Character device bit */\r\n#if !defined(S_IFCHR)\r\n#   define S_IFCHR _S_IFCHR\r\n#endif\r\n\r\n/* Pipe bit */\r\n#if !defined(S_IFFIFO)\r\n#   define S_IFFIFO _S_IFFIFO\r\n#endif\r\n\r\n/* Regular file bit */\r\n#if !defined(S_IFREG)\r\n#   define S_IFREG _S_IFREG\r\n#endif\r\n\r\n/* Read permission */\r\n#if !defined(S_IREAD)\r\n#   define S_IREAD _S_IREAD\r\n#endif\r\n\r\n/* Write permission */\r\n#if !defined(S_IWRITE)\r\n#   define S_IWRITE _S_IWRITE\r\n#endif\r\n\r\n/* Execute permission */\r\n#if !defined(S_IEXEC)\r\n#   define S_IEXEC _S_IEXEC\r\n#endif\r\n\r\n/* Pipe */\r\n#if !defined(S_IFIFO)\r\n#   define S_IFIFO _S_IFIFO\r\n#endif\r\n\r\n/* Block device */\r\n#if !defined(S_IFBLK)\r\n#   define S_IFBLK 0\r\n#endif\r\n\r\n/* Link */\r\n#if !defined(S_IFLNK)\r\n#   define S_IFLNK 0\r\n#endif\r\n\r\n/* Socket */\r\n#if !defined(S_IFSOCK)\r\n#   define S_IFSOCK 0\r\n#endif\r\n\r\n/* Read user permission */\r\n#if !defined(S_IRUSR)\r\n#   define S_IRUSR S_IREAD\r\n#endif\r\n\r\n/* Write user permission */\r\n#if !defined(S_IWUSR)\r\n#   define S_IWUSR S_IWRITE\r\n#endif\r\n\r\n/* Execute user permission */\r\n#if !defined(S_IXUSR)\r\n#   define S_IXUSR 0\r\n#endif\r\n\r\n/* Read group permission */\r\n#if !defined(S_IRGRP)\r\n#   define S_IRGRP 0\r\n#endif\r\n\r\n/* Write group permission */\r\n#if !defined(S_IWGRP)\r\n#   define S_IWGRP 0\r\n#endif\r\n\r\n/* Execute group permission */\r\n#if !defined(S_IXGRP)\r\n#   define S_IXGRP 0\r\n#endif\r\n\r\n/* Read others permission */\r\n#if !defined(S_IROTH)\r\n#   define S_IROTH 0\r\n#endif\r\n\r\n/* Write others permission */\r\n#if !defined(S_IWOTH)\r\n#   define S_IWOTH 0\r\n#endif\r\n\r\n/* Execute others permission */\r\n#if !defined(S_IXOTH)\r\n#   define S_IXOTH 0\r\n#endif\r\n\r\n/* Maximum length of file name */\r\n#if !defined(PATH_MAX)\r\n#   define PATH_MAX MAX_PATH\r\n#endif\r\n#if !defined(FILENAME_MAX)\r\n#   define FILENAME_MAX MAX_PATH\r\n#endif\r\n#if !defined(NAME_MAX)\r\n#   define NAME_MAX FILENAME_MAX\r\n#endif\r\n\r\n/* File type flags for d_type */\r\n#define DT_UNKNOWN 0\r\n#define DT_REG S_IFREG\r\n#define DT_DIR S_IFDIR\r\n#define DT_FIFO S_IFIFO\r\n#define DT_SOCK S_IFSOCK\r\n#define DT_CHR S_IFCHR\r\n#define DT_BLK S_IFBLK\r\n#define DT_LNK S_IFLNK\r\n\r\n/* Macros for converting between st_mode and d_type */\r\n#define IFTODT(mode) ((mode) & S_IFMT)\r\n#define DTTOIF(type) (type)\r\n\r\n/*\r\n * File type macros.  Note that block devices, sockets and links cannot be\r\n * distinguished on Windows and the macros S_ISBLK, S_ISSOCK and S_ISLNK are\r\n * only defined for compatibility.  These macros should always return false\r\n * on Windows.\r\n */\r\n#if !defined(S_ISFIFO)\r\n#   define S_ISFIFO(mode) (((mode) & S_IFMT) == S_IFIFO)\r\n#endif\r\n#if !defined(S_ISDIR)\r\n#   define S_ISDIR(mode) (((mode) & S_IFMT) == S_IFDIR)\r\n#endif\r\n#if !defined(S_ISREG)\r\n#   define S_ISREG(mode) (((mode) & S_IFMT) == S_IFREG)\r\n#endif\r\n#if !defined(S_ISLNK)\r\n#   define S_ISLNK(mode) (((mode) & S_IFMT) == S_IFLNK)\r\n#endif\r\n#if !defined(S_ISSOCK)\r\n#   define S_ISSOCK(mode) (((mode) & S_IFMT) == S_IFSOCK)\r\n#endif\r\n#if !defined(S_ISCHR)\r\n#   define S_ISCHR(mode) (((mode) & S_IFMT) == S_IFCHR)\r\n#endif\r\n#if !defined(S_ISBLK)\r\n#   define S_ISBLK(mode) (((mode) & S_IFMT) == S_IFBLK)\r\n#endif\r\n\r\n/* Return the exact length of the file name without zero terminator */\r\n#define _D_EXACT_NAMLEN(p) ((p)->d_namlen)\r\n\r\n/* Return the maximum size of a file name */\r\n#define _D_ALLOC_NAMLEN(p) ((PATH_MAX)+1)\r\n\r\n\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n\r\n/* Wide-character version */\r\nstruct _wdirent {\r\n    /* Always zero */\r\n    long d_ino;\r\n\r\n    /* File position within stream */\r\n    long d_off;\r\n\r\n    /* Structure size */\r\n    unsigned short d_reclen;\r\n\r\n    /* Length of name without \\0 */\r\n    size_t d_namlen;\r\n\r\n    /* File type */\r\n    int d_type;\r\n\r\n    /* File name */\r\n    wchar_t d_name[PATH_MAX+1];\r\n};\r\ntypedef struct _wdirent _wdirent;\r\n\r\nstruct _WDIR {\r\n    /* Current directory entry */\r\n    struct _wdirent ent;\r\n\r\n    /* Private file data */\r\n    WIN32_FIND_DATAW data;\r\n\r\n    /* True if data is valid */\r\n    int cached;\r\n\r\n    /* Win32 search handle */\r\n    HANDLE handle;\r\n\r\n    /* Initial directory name */\r\n    wchar_t *patt;\r\n};\r\ntypedef struct _WDIR _WDIR;\r\n\r\n/* Multi-byte character version */\r\nstruct dirent {\r\n    /* Always zero */\r\n    long d_ino;\r\n\r\n    /* File position within stream */\r\n    long d_off;\r\n\r\n    /* Structure size */\r\n    unsigned short d_reclen;\r\n\r\n    /* Length of name without \\0 */\r\n    size_t d_namlen;\r\n\r\n    /* File type */\r\n    int d_type;\r\n\r\n    /* File name */\r\n    char d_name[PATH_MAX+1];\r\n};\r\ntypedef struct dirent dirent;\r\n\r\nstruct DIR {\r\n    struct dirent ent;\r\n    struct _WDIR *wdirp;\r\n};\r\ntypedef struct DIR DIR;\r\n\r\n\r\n/* Dirent functions */\r\nstatic DIR *opendir (const char *dirname);\r\nstatic _WDIR *_wopendir (const wchar_t *dirname);\r\n\r\nstatic struct dirent *readdir (DIR *dirp);\r\nstatic struct _wdirent *_wreaddir (_WDIR *dirp);\r\n\r\nstatic int readdir_r(\r\n    DIR *dirp, struct dirent *entry, struct dirent **result);\r\nstatic int _wreaddir_r(\r\n    _WDIR *dirp, struct _wdirent *entry, struct _wdirent **result);\r\n\r\nstatic int closedir (DIR *dirp);\r\nstatic int _wclosedir (_WDIR *dirp);\r\n\r\nstatic void rewinddir (DIR* dirp);\r\nstatic void _wrewinddir (_WDIR* dirp);\r\n\r\nstatic int scandir (const char *dirname, struct dirent ***namelist,\r\n    int (*filter)(const struct dirent*),\r\n    int (*compare)(const struct dirent**, const struct dirent**));\r\n\r\nstatic int alphasort (const struct dirent **a, const struct dirent **b);\r\n\r\nstatic int versionsort (const struct dirent **a, const struct dirent **b);\r\n\r\n\r\n/* For compatibility with Symbian */\r\n#define wdirent _wdirent\r\n#define WDIR _WDIR\r\n#define wopendir _wopendir\r\n#define wreaddir _wreaddir\r\n#define wclosedir _wclosedir\r\n#define wrewinddir _wrewinddir\r\n\r\n\r\n/* Internal utility functions */\r\nstatic WIN32_FIND_DATAW *dirent_first (_WDIR *dirp);\r\nstatic WIN32_FIND_DATAW *dirent_next (_WDIR *dirp);\r\n\r\nstatic int dirent_mbstowcs_s(\r\n    size_t *pReturnValue,\r\n    wchar_t *wcstr,\r\n    size_t sizeInWords,\r\n    const char *mbstr,\r\n    size_t count);\r\n\r\nstatic int dirent_wcstombs_s(\r\n    size_t *pReturnValue,\r\n    char *mbstr,\r\n    size_t sizeInBytes,\r\n    const wchar_t *wcstr,\r\n    size_t count);\r\n\r\nstatic void dirent_set_errno (int error);\r\n\r\n\r\n/*\r\n * Open directory stream DIRNAME for read and return a pointer to the\r\n * internal working area that is used to retrieve individual directory\r\n * entries.\r\n */\r\nstatic _WDIR*\r\n_wopendir(\r\n    const wchar_t *dirname)\r\n{\r\n    _WDIR *dirp;\r\n    DWORD n;\r\n    wchar_t *p;\r\n\r\n    /* Must have directory name */\r\n    if (dirname == NULL  ||  dirname[0] == '\\0') {\r\n        dirent_set_errno (ENOENT);\r\n        return NULL;\r\n    }\r\n\r\n    /* Allocate new _WDIR structure */\r\n    dirp = (_WDIR*) malloc (sizeof (struct _WDIR));\r\n    if (!dirp) {\r\n        return NULL;\r\n    }\r\n\r\n    /* Reset _WDIR structure */\r\n    dirp->handle = INVALID_HANDLE_VALUE;\r\n    dirp->patt = NULL;\r\n    dirp->cached = 0;\r\n\r\n    /*\r\n     * Compute the length of full path plus zero terminator\r\n     *\r\n     * Note that on WinRT there's no way to convert relative paths\r\n     * into absolute paths, so just assume it is an absolute path.\r\n     */\r\n#if WINAPI_FAMILY_PARTITION(WINAPI_PARTITION_DESKTOP)\r\n    /* Desktop */\r\n    n = GetFullPathNameW (dirname, 0, NULL, NULL);\r\n#else\r\n    /* WinRT */\r\n    n = wcslen (dirname);\r\n#endif\r\n\r\n    /* Allocate room for absolute directory name and search pattern */\r\n    dirp->patt = (wchar_t*) malloc (sizeof (wchar_t) * n + 16);\r\n    if (dirp->patt == NULL) {\r\n        goto exit_closedir;\r\n    }\r\n\r\n    /*\r\n     * Convert relative directory name to an absolute one.  This\r\n     * allows rewinddir() to function correctly even when current\r\n     * working directory is changed between opendir() and rewinddir().\r\n     *\r\n     * Note that on WinRT there's no way to convert relative paths\r\n     * into absolute paths, so just assume it is an absolute path.\r\n     */\r\n#if WINAPI_FAMILY_PARTITION(WINAPI_PARTITION_DESKTOP)\r\n    /* Desktop */\r\n    n = GetFullPathNameW (dirname, n, dirp->patt, NULL);\r\n    if (n <= 0) {\r\n        goto exit_closedir;\r\n    }\r\n#else\r\n    /* WinRT */\r\n    wcsncpy_s (dirp->patt, n+1, dirname, n);\r\n#endif\r\n\r\n    /* Append search pattern \\* to the directory name */\r\n    p = dirp->patt + n;\r\n    switch (p[-1]) {\r\n    case '\\\\':\r\n    case '/':\r\n    case ':':\r\n        /* Directory ends in path separator, e.g. c:\\temp\\ */\r\n        /*NOP*/;\r\n        break;\r\n\r\n    default:\r\n        /* Directory name doesn't end in path separator */\r\n        *p++ = '\\\\';\r\n    }\r\n    *p++ = '*';\r\n    *p = '\\0';\r\n\r\n    /* Open directory stream and retrieve the first entry */\r\n    if (!dirent_first (dirp)) {\r\n        goto exit_closedir;\r\n    }\r\n\r\n    /* Success */\r\n    return dirp;\r\n\r\n    /* Failure */\r\nexit_closedir:\r\n    _wclosedir (dirp);\r\n    return NULL;\r\n}\r\n\r\n/*\r\n * Read next directory entry.\r\n *\r\n * Returns pointer to static directory entry which may be overwritten by\r\n * subsequent calls to _wreaddir().\r\n */\r\nstatic struct _wdirent*\r\n_wreaddir(\r\n    _WDIR *dirp)\r\n{\r\n    struct _wdirent *entry;\r\n\r\n    /*\r\n     * Read directory entry to buffer.  We can safely ignore the return value\r\n     * as entry will be set to NULL in case of error.\r\n     */\r\n    (void) _wreaddir_r (dirp, &dirp->ent, &entry);\r\n\r\n    /* Return pointer to statically allocated directory entry */\r\n    return entry;\r\n}\r\n\r\n/*\r\n * Read next directory entry.\r\n *\r\n * Returns zero on success.  If end of directory stream is reached, then sets\r\n * result to NULL and returns zero.\r\n */\r\nstatic int\r\n_wreaddir_r(\r\n    _WDIR *dirp,\r\n    struct _wdirent *entry,\r\n    struct _wdirent **result)\r\n{\r\n    WIN32_FIND_DATAW *datap;\r\n\r\n    /* Read next directory entry */\r\n    datap = dirent_next (dirp);\r\n    if (datap) {\r\n        size_t n;\r\n        DWORD attr;\r\n\r\n        /*\r\n         * Copy file name as wide-character string.  If the file name is too\r\n         * long to fit in to the destination buffer, then truncate file name\r\n         * to PATH_MAX characters and zero-terminate the buffer.\r\n         */\r\n        n = 0;\r\n        while (n < PATH_MAX  &&  datap->cFileName[n] != 0) {\r\n            entry->d_name[n] = datap->cFileName[n];\r\n            n++;\r\n        }\r\n        entry->d_name[n] = 0;\r\n\r\n        /* Length of file name excluding zero terminator */\r\n        entry->d_namlen = n;\r\n\r\n        /* File type */\r\n        attr = datap->dwFileAttributes;\r\n        if ((attr & FILE_ATTRIBUTE_DEVICE) != 0) {\r\n            entry->d_type = DT_CHR;\r\n        } else if ((attr & FILE_ATTRIBUTE_DIRECTORY) != 0) {\r\n            entry->d_type = DT_DIR;\r\n        } else {\r\n            entry->d_type = DT_REG;\r\n        }\r\n\r\n        /* Reset dummy fields */\r\n        entry->d_ino = 0;\r\n        entry->d_off = 0;\r\n        entry->d_reclen = sizeof (struct _wdirent);\r\n\r\n        /* Set result address */\r\n        *result = entry;\r\n\r\n    } else {\r\n\r\n        /* Return NULL to indicate end of directory */\r\n        *result = NULL;\r\n\r\n    }\r\n\r\n    return /*OK*/0;\r\n}\r\n\r\n/*\r\n * Close directory stream opened by opendir() function.  This invalidates the\r\n * DIR structure as well as any directory entry read previously by\r\n * _wreaddir().\r\n */\r\nstatic int\r\n_wclosedir(\r\n    _WDIR *dirp)\r\n{\r\n    int ok;\r\n    if (dirp) {\r\n\r\n        /* Release search handle */\r\n        if (dirp->handle != INVALID_HANDLE_VALUE) {\r\n            FindClose (dirp->handle);\r\n        }\r\n\r\n        /* Release search pattern */\r\n        free (dirp->patt);\r\n\r\n        /* Release directory structure */\r\n        free (dirp);\r\n        ok = /*success*/0;\r\n\r\n    } else {\r\n\r\n        /* Invalid directory stream */\r\n        dirent_set_errno (EBADF);\r\n        ok = /*failure*/-1;\r\n\r\n    }\r\n    return ok;\r\n}\r\n\r\n/*\r\n * Rewind directory stream such that _wreaddir() returns the very first\r\n * file name again.\r\n */\r\nstatic void\r\n_wrewinddir(\r\n    _WDIR* dirp)\r\n{\r\n    if (dirp) {\r\n        /* Release existing search handle */\r\n        if (dirp->handle != INVALID_HANDLE_VALUE) {\r\n            FindClose (dirp->handle);\r\n        }\r\n\r\n        /* Open new search handle */\r\n        dirent_first (dirp);\r\n    }\r\n}\r\n\r\n/* Get first directory entry (internal) */\r\nstatic WIN32_FIND_DATAW*\r\ndirent_first(\r\n    _WDIR *dirp)\r\n{\r\n    WIN32_FIND_DATAW *datap;\r\n    DWORD error;\r\n\r\n    /* Open directory and retrieve the first entry */\r\n    dirp->handle = FindFirstFileExW(\r\n        dirp->patt, FindExInfoStandard, &dirp->data,\r\n        FindExSearchNameMatch, NULL, 0);\r\n    if (dirp->handle != INVALID_HANDLE_VALUE) {\r\n\r\n        /* a directory entry is now waiting in memory */\r\n        datap = &dirp->data;\r\n        dirp->cached = 1;\r\n\r\n    } else {\r\n\r\n        /* Failed to open directory: no directory entry in memory */\r\n        dirp->cached = 0;\r\n        datap = NULL;\r\n\r\n        /* Set error code */\r\n        error = GetLastError ();\r\n        switch (error) {\r\n        case ERROR_ACCESS_DENIED:\r\n            /* No read access to directory */\r\n            dirent_set_errno (EACCES);\r\n            break;\r\n\r\n        case ERROR_DIRECTORY:\r\n            /* Directory name is invalid */\r\n            dirent_set_errno (ENOTDIR);\r\n            break;\r\n\r\n        case ERROR_PATH_NOT_FOUND:\r\n        default:\r\n            /* Cannot find the file */\r\n            dirent_set_errno (ENOENT);\r\n        }\r\n\r\n    }\r\n    return datap;\r\n}\r\n\r\n/*\r\n * Get next directory entry (internal).\r\n *\r\n * Returns\r\n */\r\nstatic WIN32_FIND_DATAW*\r\ndirent_next(\r\n    _WDIR *dirp)\r\n{\r\n    WIN32_FIND_DATAW *p;\r\n\r\n    /* Get next directory entry */\r\n    if (dirp->cached != 0) {\r\n\r\n        /* A valid directory entry already in memory */\r\n        p = &dirp->data;\r\n        dirp->cached = 0;\r\n\r\n    } else if (dirp->handle != INVALID_HANDLE_VALUE) {\r\n\r\n        /* Get the next directory entry from stream */\r\n        if (FindNextFileW (dirp->handle, &dirp->data) != FALSE) {\r\n            /* Got a file */\r\n            p = &dirp->data;\r\n        } else {\r\n            /* The very last entry has been processed or an error occurred */\r\n            FindClose (dirp->handle);\r\n            dirp->handle = INVALID_HANDLE_VALUE;\r\n            p = NULL;\r\n        }\r\n\r\n    } else {\r\n\r\n        /* End of directory stream reached */\r\n        p = NULL;\r\n\r\n    }\r\n\r\n    return p;\r\n}\r\n\r\n/*\r\n * Open directory stream using plain old C-string.\r\n */\r\nstatic DIR*\r\nopendir(\r\n    const char *dirname)\r\n{\r\n    struct DIR *dirp;\r\n\r\n    /* Must have directory name */\r\n    if (dirname == NULL  ||  dirname[0] == '\\0') {\r\n        dirent_set_errno (ENOENT);\r\n        return NULL;\r\n    }\r\n\r\n    /* Allocate memory for DIR structure */\r\n    dirp = (DIR*) malloc (sizeof (struct DIR));\r\n    if (!dirp) {\r\n        return NULL;\r\n    }\r\n    {\r\n        int error;\r\n        wchar_t wname[PATH_MAX + 1];\r\n        size_t n;\r\n\r\n        /* Convert directory name to wide-character string */\r\n        error = dirent_mbstowcs_s(\r\n            &n, wname, PATH_MAX + 1, dirname, PATH_MAX + 1);\r\n        if (error) {\r\n            /*\r\n             * Cannot convert file name to wide-character string.  This\r\n             * occurs if the string contains invalid multi-byte sequences or\r\n             * the output buffer is too small to contain the resulting\r\n             * string.\r\n             */\r\n            goto exit_free;\r\n        }\r\n\r\n\r\n        /* Open directory stream using wide-character name */\r\n        dirp->wdirp = _wopendir (wname);\r\n        if (!dirp->wdirp) {\r\n            goto exit_free;\r\n        }\r\n\r\n    }\r\n\r\n    /* Success */\r\n    return dirp;\r\n\r\n    /* Failure */\r\nexit_free:\r\n    free (dirp);\r\n    return NULL;\r\n}\r\n\r\n/*\r\n * Read next directory entry.\r\n */\r\nstatic struct dirent*\r\nreaddir(\r\n    DIR *dirp)\r\n{\r\n    struct dirent *entry;\r\n\r\n    /*\r\n     * Read directory entry to buffer.  We can safely ignore the return value\r\n     * as entry will be set to NULL in case of error.\r\n     */\r\n    (void) readdir_r (dirp, &dirp->ent, &entry);\r\n\r\n    /* Return pointer to statically allocated directory entry */\r\n    return entry;\r\n}\r\n\r\n/*\r\n * Read next directory entry into called-allocated buffer.\r\n *\r\n * Returns zero on success.  If the end of directory stream is reached, then\r\n * sets result to NULL and returns zero.\r\n */\r\nstatic int\r\nreaddir_r(\r\n    DIR *dirp,\r\n    struct dirent *entry,\r\n    struct dirent **result)\r\n{\r\n    WIN32_FIND_DATAW *datap;\r\n\r\n    /* Read next directory entry */\r\n    datap = dirent_next (dirp->wdirp);\r\n    if (datap) {\r\n        size_t n;\r\n        int error;\r\n\r\n        /* Attempt to convert file name to multi-byte string */\r\n        error = dirent_wcstombs_s(\r\n            &n, entry->d_name, PATH_MAX + 1, datap->cFileName, PATH_MAX + 1);\r\n\r\n        /*\r\n         * If the file name cannot be represented by a multi-byte string,\r\n         * then attempt to use old 8+3 file name.  This allows traditional\r\n         * Unix-code to access some file names despite of unicode\r\n         * characters, although file names may seem unfamiliar to the user.\r\n         *\r\n         * Be ware that the code below cannot come up with a short file\r\n         * name unless the file system provides one.  At least\r\n         * VirtualBox shared folders fail to do this.\r\n         */\r\n        if (error  &&  datap->cAlternateFileName[0] != '\\0') {\r\n            error = dirent_wcstombs_s(\r\n                &n, entry->d_name, PATH_MAX + 1,\r\n                datap->cAlternateFileName, PATH_MAX + 1);\r\n        }\r\n\r\n        if (!error) {\r\n            DWORD attr;\r\n\r\n            /* Length of file name excluding zero terminator */\r\n            entry->d_namlen = n - 1;\r\n\r\n            /* File attributes */\r\n            attr = datap->dwFileAttributes;\r\n            if ((attr & FILE_ATTRIBUTE_DEVICE) != 0) {\r\n                entry->d_type = DT_CHR;\r\n            } else if ((attr & FILE_ATTRIBUTE_DIRECTORY) != 0) {\r\n                entry->d_type = DT_DIR;\r\n            } else {\r\n                entry->d_type = DT_REG;\r\n            }\r\n\r\n            /* Reset dummy fields */\r\n            entry->d_ino = 0;\r\n            entry->d_off = 0;\r\n            entry->d_reclen = sizeof (struct dirent);\r\n\r\n        } else {\r\n\r\n            /*\r\n             * Cannot convert file name to multi-byte string so construct\r\n             * an erroneous directory entry and return that.  Note that\r\n             * we cannot return NULL as that would stop the processing\r\n             * of directory entries completely.\r\n             */\r\n            entry->d_name[0] = '?';\r\n            entry->d_name[1] = '\\0';\r\n            entry->d_namlen = 1;\r\n            entry->d_type = DT_UNKNOWN;\r\n            entry->d_ino = 0;\r\n            entry->d_off = -1;\r\n            entry->d_reclen = 0;\r\n\r\n        }\r\n\r\n        /* Return pointer to directory entry */\r\n        *result = entry;\r\n\r\n    } else {\r\n\r\n        /* No more directory entries */\r\n        *result = NULL;\r\n\r\n    }\r\n\r\n    return /*OK*/0;\r\n}\r\n\r\n/*\r\n * Close directory stream.\r\n */\r\nstatic int\r\nclosedir(\r\n    DIR *dirp)\r\n{\r\n    int ok;\r\n    if (dirp) {\r\n\r\n        /* Close wide-character directory stream */\r\n        ok = _wclosedir (dirp->wdirp);\r\n        dirp->wdirp = NULL;\r\n\r\n        /* Release multi-byte character version */\r\n        free (dirp);\r\n\r\n    } else {\r\n\r\n        /* Invalid directory stream */\r\n        dirent_set_errno (EBADF);\r\n        ok = /*failure*/-1;\r\n\r\n    }\r\n    return ok;\r\n}\r\n\r\n/*\r\n * Rewind directory stream to beginning.\r\n */\r\nstatic void\r\nrewinddir(\r\n    DIR* dirp)\r\n{\r\n    /* Rewind wide-character string directory stream */\r\n    _wrewinddir (dirp->wdirp);\r\n}\r\n\r\n/*\r\n * Scan directory for entries.\r\n */\r\nstatic int\r\nscandir(\r\n    const char *dirname,\r\n    struct dirent ***namelist,\r\n    int (*filter)(const struct dirent*),\r\n    int (*compare)(const struct dirent**, const struct dirent**))\r\n{\r\n    struct dirent **files = NULL;\r\n    size_t size = 0;\r\n    size_t allocated = 0;\r\n    const size_t init_size = 1;\r\n    DIR *dir = NULL;\r\n    struct dirent *entry;\r\n    struct dirent *tmp = NULL;\r\n    size_t i;\r\n    int result = 0;\r\n\r\n    /* Open directory stream */\r\n    dir = opendir (dirname);\r\n    if (dir) {\r\n\r\n        /* Read directory entries to memory */\r\n        while (1) {\r\n\r\n            /* Enlarge pointer table to make room for another pointer */\r\n            if (size >= allocated) {\r\n                void *p;\r\n                size_t num_entries;\r\n\r\n                /* Compute number of entries in the enlarged pointer table */\r\n                if (size < init_size) {\r\n                    /* Allocate initial pointer table */\r\n                    num_entries = init_size;\r\n                } else {\r\n                    /* Double the size */\r\n                    num_entries = size * 2;\r\n                }\r\n\r\n                /* Allocate first pointer table or enlarge existing table */\r\n                p = realloc (files, sizeof (void*) * num_entries);\r\n                if (p != NULL) {\r\n                    /* Got the memory */\r\n                    files = (dirent**) p;\r\n                    allocated = num_entries;\r\n                } else {\r\n                    /* Out of memory */\r\n                    result = -1;\r\n                    break;\r\n                }\r\n\r\n            }\r\n\r\n            /* Allocate room for temporary directory entry */\r\n            if (tmp == NULL) {\r\n                tmp = (struct dirent*) malloc (sizeof (struct dirent));\r\n                if (tmp == NULL) {\r\n                    /* Cannot allocate temporary directory entry */\r\n                    result = -1;\r\n                    break;\r\n                }\r\n            }\r\n\r\n            /* Read directory entry to temporary area */\r\n            if (readdir_r (dir, tmp, &entry) == /*OK*/0) {\r\n\r\n                /* Did we get an entry? */\r\n                if (entry != NULL) {\r\n                    int pass;\r\n\r\n                    /* Determine whether to include the entry in result */\r\n                    if (filter) {\r\n                        /* Let the filter function decide */\r\n                        pass = filter (tmp);\r\n                    } else {\r\n                        /* No filter function, include everything */\r\n                        pass = 1;\r\n                    }\r\n\r\n                    if (pass) {\r\n                        /* Store the temporary entry to pointer table */\r\n                        files[size++] = tmp;\r\n                        tmp = NULL;\r\n\r\n                        /* Keep up with the number of files */\r\n                        result++;\r\n                    }\r\n\r\n                } else {\r\n\r\n                    /*\r\n                     * End of directory stream reached => sort entries and\r\n                     * exit.\r\n                     */\r\n                    qsort (files, size, sizeof (void*),\r\n                        (int (*) (const void*, const void*)) compare);\r\n                    break;\r\n\r\n                }\r\n\r\n            } else {\r\n                /* Error reading directory entry */\r\n                result = /*Error*/ -1;\r\n                break;\r\n            }\r\n\r\n        }\r\n\r\n    } else {\r\n        /* Cannot open directory */\r\n        result = /*Error*/ -1;\r\n    }\r\n\r\n    /* Release temporary directory entry */\r\n    free (tmp);\r\n\r\n    /* Release allocated memory on error */\r\n    if (result < 0) {\r\n        for (i = 0; i < size; i++) {\r\n            free (files[i]);\r\n        }\r\n        free (files);\r\n        files = NULL;\r\n    }\r\n\r\n    /* Close directory stream */\r\n    if (dir) {\r\n        closedir (dir);\r\n    }\r\n\r\n    /* Pass pointer table to caller */\r\n    if (namelist) {\r\n        *namelist = files;\r\n    }\r\n    return result;\r\n}\r\n\r\n/* Alphabetical sorting */\r\nstatic int\r\nalphasort(\r\n    const struct dirent **a, const struct dirent **b)\r\n{\r\n    return strcoll ((*a)->d_name, (*b)->d_name);\r\n}\r\n\r\n/* Sort versions */\r\nstatic int\r\nversionsort(\r\n    const struct dirent **a, const struct dirent **b)\r\n{\r\n    /* FIXME: implement strverscmp and use that */\r\n    return alphasort (a, b);\r\n}\r\n\r\n/* Convert multi-byte string to wide character string */\r\nstatic int\r\ndirent_mbstowcs_s(\r\n    size_t *pReturnValue,\r\n    wchar_t *wcstr,\r\n    size_t sizeInWords,\r\n    const char *mbstr,\r\n    size_t count)\r\n{\r\n    int error;\r\n\r\n#if defined(_MSC_VER)  &&  _MSC_VER >= 1400\r\n\r\n    /* Microsoft Visual Studio 2005 or later */\r\n    error = mbstowcs_s (pReturnValue, wcstr, sizeInWords, mbstr, count);\r\n\r\n#else\r\n\r\n    /* Older Visual Studio or non-Microsoft compiler */\r\n    size_t n;\r\n\r\n    /* Convert to wide-character string (or count characters) */\r\n    n = mbstowcs (wcstr, mbstr, sizeInWords);\r\n    if (!wcstr  ||  n < count) {\r\n\r\n        /* Zero-terminate output buffer */\r\n        if (wcstr  &&  sizeInWords) {\r\n            if (n >= sizeInWords) {\r\n                n = sizeInWords - 1;\r\n            }\r\n            wcstr[n] = 0;\r\n        }\r\n\r\n        /* Length of resulting multi-byte string WITH zero terminator */\r\n        if (pReturnValue) {\r\n            *pReturnValue = n + 1;\r\n        }\r\n\r\n        /* Success */\r\n        error = 0;\r\n\r\n    } else {\r\n\r\n        /* Could not convert string */\r\n        error = 1;\r\n\r\n    }\r\n\r\n#endif\r\n    return error;\r\n}\r\n\r\n/* Convert wide-character string to multi-byte string */\r\nstatic int\r\ndirent_wcstombs_s(\r\n    size_t *pReturnValue,\r\n    char *mbstr,\r\n    size_t sizeInBytes, /* max size of mbstr */\r\n    const wchar_t *wcstr,\r\n    size_t count)\r\n{\r\n    int error;\r\n\r\n#if defined(_MSC_VER)  &&  _MSC_VER >= 1400\r\n\r\n    /* Microsoft Visual Studio 2005 or later */\r\n    error = wcstombs_s (pReturnValue, mbstr, sizeInBytes, wcstr, count);\r\n\r\n#else\r\n\r\n    /* Older Visual Studio or non-Microsoft compiler */\r\n    size_t n;\r\n\r\n    /* Convert to multi-byte string (or count the number of bytes needed) */\r\n    n = wcstombs (mbstr, wcstr, sizeInBytes);\r\n    if (!mbstr  ||  n < count) {\r\n\r\n        /* Zero-terminate output buffer */\r\n        if (mbstr  &&  sizeInBytes) {\r\n            if (n >= sizeInBytes) {\r\n                n = sizeInBytes - 1;\r\n            }\r\n            mbstr[n] = '\\0';\r\n        }\r\n\r\n        /* Length of resulting multi-bytes string WITH zero-terminator */\r\n        if (pReturnValue) {\r\n            *pReturnValue = n + 1;\r\n        }\r\n\r\n        /* Success */\r\n        error = 0;\r\n\r\n    } else {\r\n\r\n        /* Cannot convert string */\r\n        error = 1;\r\n\r\n    }\r\n\r\n#endif\r\n    return error;\r\n}\r\n\r\n/* Set errno variable */\r\nstatic void\r\ndirent_set_errno(\r\n    int error)\r\n{\r\n#if defined(_MSC_VER)  &&  _MSC_VER >= 1400\r\n\r\n    /* Microsoft Visual Studio 2005 and later */\r\n    _set_errno (error);\r\n\r\n#else\r\n\r\n    /* Non-Microsoft compiler or older Microsoft compiler */\r\n    errno = error;\r\n\r\n#endif\r\n}\r\n\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif /*DIRENT_H*/\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/layerNorm.cu",
    "content": "#include <assert.h>\r\n#include \"layerNorm.h\"\r\n#include \"utilsn.h\"\r\n#include <assert.h>\r\n#include <vector>\r\n\r\n\r\n\r\n\r\n\r\nnamespace nvinfer1\r\n{\r\n\r\nlayernorm::layernorm()\r\n{\r\n}\r\nlayernorm::~layernorm()\r\n{\r\n\r\n}\r\nlayernorm::layernorm(const void* data, size_t length)\r\n{\r\n    const char *d = reinterpret_cast<const char *>(data), *a = d;\r\n    Tn::read(d, mInputSize);\r\n    Tn::read(d,Length);\r\n\r\n    assert(d == a + length);\r\n}\r\nint layernorm::initialize()\r\n{\r\n    return 0;\r\n}\r\nvoid layernorm::serialize(void* buffer) const\r\n{\r\n    char* d = static_cast<char*>(buffer), *a = d;\r\n    Tn::write(d, mInputSize);\r\n    Tn::write(d,Length);\r\n    assert(d == a + getSerializationSize());\r\n}\r\nsize_t layernorm::getSerializationSize() const\r\n{\r\n    return sizeof(mInputSize) + sizeof(Length);\r\n}\r\nDims layernorm::getOutputDimensions(int index, const Dims* inputs, int nbInputDims)\r\n{\r\n//    outputDims.nbDims  = inputs[0].nbDims;\r\n//    outputDims.d[0] = inputs[0].d[0];\r\n//    for (int var = 1; var < inputs[0].nbDims; ++var) {\r\n//        outputDims.d[var] = 1;\r\n//    }\r\n    return Dims2{inputs[0].d[0],1};\r\n}\r\nvoid layernorm::setPluginNamespace(const char* pluginNamespace)\r\n{\r\n    mPluginNamespace = pluginNamespace;\r\n}\r\nconst char* layernorm::getPluginNamespace() const\r\n{\r\n    return mPluginNamespace;\r\n}\r\nconst char* layernorm::getPluginType() const\r\n{\r\n    return \"layerNorm_trt\";\r\n}\r\nconst char* layernorm::getPluginVersion() const\r\n{\r\n    return \"1\";\r\n}\r\nDataType layernorm::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const\r\n{\r\n    return inputTypes[0] ;//== nvinfer1::DataType::kFLOAT ? nvinfer1::DataType::kFLOAT : nvinfer1::DataType::kHALF;\r\n}\r\nvoid layernorm::destroy()\r\n{\r\n    delete this;\r\n}\r\nIPluginV2IOExt* layernorm::clone() const\r\n{\r\n    layernorm *ln = new layernorm();\r\n    ln->setPluginNamespace(mPluginNamespace);\r\n    ln->setInputSize(mInputSize,Length);\r\n    return ln;\r\n}\r\nbool layernorm::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const\r\n{\r\n    return false;\r\n}\r\nbool layernorm::canBroadcastInputAcrossBatch(int inputIndex) const\r\n{\r\n    return false;\r\n}\r\nvoid layernorm::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator)\r\n{}\r\nvoid layernorm::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput)\r\n{\r\n\r\n    int size = 1;\r\n    for(int i = 0 ; i < in[0].dims.nbDims ; i++)\r\n    {\r\n        size *= in[0].dims.d[i];\r\n    }\r\n    mInputSize = size;\r\n    Length = in[0].dims.d[in[0].dims.nbDims - 1];\r\n}\r\nvoid layernorm::detachFromContext()\r\n{}\r\n\r\n__device__ welford welford_update(welford a, const float *currValue, int length)\r\n{\r\n    #pragma unroll\r\n    for(int i = 0; i < length; i++){\r\n        a.count += 1;\r\n        float delta = currValue[i] - a.mean;\r\n        a.mean += delta / a.count;\r\n        float delta2 = currValue[i] - a.mean;\r\n        a.M2 += delta * delta2;\r\n    }\r\n    return a;\r\n}\r\n__device__ void mean_std(float* mean, float *std, const float *currValue,int l,int count = 0, float m = 0.0, float s = 0.0)\r\n{\r\n    #pragma unroll\r\n    for(int i = 0; i < l; i++){\r\n        count += 1;\r\n        float delta = currValue[i] - m;\r\n        m += delta / count;\r\n        float delta2 = currValue[i] - m;\r\n        s += delta * delta2;\r\n    }\r\n    *mean = m;\r\n    *std = sqrt((s / count) + 1e-5);\r\n}\r\n__global__ void lnCudaKer(const float *in, float *mean, float *std, int size,int l)\r\n{\r\n    int idx = threadIdx.x + blockIdx.x * blockDim.x;\r\n    if (idx >= size)\r\n        return;\r\n    mean_std(&mean[idx],&std[idx],in+idx*l,l);\r\n    //printf(\"idx = %d,mean = %f, std = %f\\n\",idx,mean[idx],std[idx]);\r\n}\r\nvoid layernorm::forwardGpu(const float *const *inputs, float *mean, float *std, cudaStream_t stream, int batchSize)\r\n{\r\n    int numElem = batchSize * mInputSize/Length;\r\n\r\n    lnCudaKer<<<(numElem + mThreadCount - 1) / mThreadCount, mThreadCount>>>\r\n        (inputs[0], mean,std, numElem,Length);\r\n}\r\nint layernorm::enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream)\r\n{\r\n    forwardGpu((const float *const *)inputs, (float*)outputs[0], (float*)outputs[1], stream, batchSize);\r\n    return 0;\r\n}\r\n\r\nPluginFieldCollection layernormCreator::mFC{};\r\nstd::vector<PluginField> layernormCreator::mPluginAttributes;\r\nlayernormCreator::layernormCreator()\r\n{\r\n    mPluginAttributes.clear();\r\n\r\n    mFC.nbFields = mPluginAttributes.size();\r\n    mFC.fields = mPluginAttributes.data();\r\n}\r\n\r\nconst char* layernormCreator::getPluginName() const\r\n{\r\n    return \"layerNorm_trt\";\r\n}\r\nconst char* layernormCreator::getPluginVersion() const\r\n{\r\n    return \"1\";\r\n}\r\nconst PluginFieldCollection* layernormCreator::getFieldNames()\r\n{\r\n    return &mFC;\r\n}\r\nIPluginV2IOExt* layernormCreator::createPlugin(const char* name, const PluginFieldCollection* fc)\r\n{\r\n    layernorm* obj = new layernorm();\r\n    obj->setPluginNamespace(mNamespace.c_str());\r\n\r\n    return obj;\r\n}\r\nIPluginV2IOExt* layernormCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength)\r\n{\r\n    layernorm* obj = new layernorm(serialData, serialLength);\r\n    obj->setPluginNamespace(mNamespace.c_str());\r\n    return obj;\r\n}\r\n\r\n\r\n\r\n\r\n\r\n\r\n}\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/layerNorm.h",
    "content": "#ifndef LAYERNORM_H\r\n#define LAYERNORM_H\r\n\r\n#include <vector>\r\n#include <string>\r\n#include <iostream>\r\n#include <NvInfer.h>\r\n#include <memory>\r\n#include <string.h>\r\n#include <cstdint>\r\n#include <stdlib.h>\r\n\r\nusing namespace std;\r\n\r\nstruct welford\r\n{\r\n    int count = 0;\r\n    double mean = 0.f;\r\n    double M2 = 0.f;\r\n};\r\n\r\nnamespace nvinfer1{\r\nclass layernorm : public IPluginV2IOExt\r\n{\r\npublic:\r\n    layernorm();\r\n    layernorm(const void* data, size_t length);\r\n    ~layernorm();\r\n    int getNbOutputs() const override\r\n    {\r\n        return 2;\r\n    }\r\n\r\n    Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) override;\r\n\r\n    int initialize() override;\r\n\r\n    virtual void terminate() override {};\r\n\r\n    virtual size_t getWorkspaceSize(int maxBatchSize) const override { return 0; }\r\n\r\n    virtual int enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream) override;\r\n\r\n    virtual size_t getSerializationSize() const override;\r\n\r\n    virtual void serialize(void* buffer) const override;\r\n\r\n    bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const override {\r\n        return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\r\n    }\r\n\r\n    void setPluginNamespace(const char* pluginNamespace) override;\r\n\r\n    const char* getPluginNamespace() const override;\r\n\r\n    const char* getPluginType() const override;\r\n\r\n    const char* getPluginVersion() const override;\r\n\r\n    void destroy() override;\r\n\r\n    IPluginV2IOExt* clone() const override;\r\n\r\n\r\n\r\n    DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const override;\r\n\r\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const override;\r\n\r\n    bool canBroadcastInputAcrossBatch(int inputIndex) const override;\r\n\r\n    void attachToContext(\r\n        cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) override;\r\n\r\n    void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) override;\r\n\r\n    void detachFromContext() override;\r\n\r\n    void setInputSize(int s, int l) {\r\n        mInputSize = s;\r\n        Length = l;\r\n    }\r\n\r\n\r\nprivate:\r\n    void forwardGpu(const float *const * inputs, float *mean, float *std, cudaStream_t stream, int batchSize = 1);\r\n    int mThreadCount = 256;\r\n    int mInputSize;\r\n    int Length;\r\n    Dims outputDims ;\r\n    const char* mPluginNamespace;\r\n};\r\nclass layernormCreator : public IPluginCreator\r\n{\r\n    public:\r\n        layernormCreator();\r\n\r\n        ~layernormCreator() override = default;\r\n\r\n        const char* getPluginName() const override;\r\n\r\n        const char* getPluginVersion() const override;\r\n\r\n        const PluginFieldCollection* getFieldNames() override;\r\n\r\n        IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) override;\r\n\r\n        IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) override;\r\n\r\n        void setPluginNamespace(const char* libNamespace) override\r\n        {\r\n            mNamespace = libNamespace;\r\n        }\r\n\r\n        const char* getPluginNamespace() const override\r\n        {\r\n            return mNamespace.c_str();\r\n        }\r\n\r\n    private:\r\n        std::string mNamespace;\r\n        static PluginFieldCollection mFC;\r\n        static std::vector<PluginField> mPluginAttributes;\r\n\r\n};\r\n\r\nREGISTER_TENSORRT_PLUGIN(layernormCreator);\r\n};\r\n\r\n#endif // LAYERNORM_H\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/logging.h",
    "content": "/*\r\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\r\n *\r\n * Licensed under the Apache License, Version 2.0 (the \"License\");\r\n * you may not use this file except in compliance with the License.\r\n * You may obtain a copy of the License at\r\n *\r\n *     http://www.apache.org/licenses/LICENSE-2.0\r\n *\r\n * Unless required by applicable law or agreed to in writing, software\r\n * distributed under the License is distributed on an \"AS IS\" BASIS,\r\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\r\n * See the License for the specific language governing permissions and\r\n * limitations under the License.\r\n */\r\n\r\n#ifndef TENSORRT_LOGGING_H\r\n#define TENSORRT_LOGGING_H\r\n\r\n#include \"NvInferRuntimeCommon.h\"\r\n#include <cassert>\r\n#include <ctime>\r\n#include <iomanip>\r\n#include <iostream>\r\n#include <ostream>\r\n#include <sstream>\r\n#include <string>\r\n\r\nusing Severity = nvinfer1::ILogger::Severity;\r\n\r\nclass LogStreamConsumerBuffer : public std::stringbuf\r\n{\r\npublic:\r\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\r\n        : mOutput(stream)\r\n        , mPrefix(prefix)\r\n        , mShouldLog(shouldLog)\r\n    {\r\n    }\r\n\r\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\r\n        : mOutput(other.mOutput)\r\n    {\r\n    }\r\n\r\n    ~LogStreamConsumerBuffer()\r\n    {\r\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\r\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\r\n        // if the pointer to the beginning is not equal to the pointer to the current position,\r\n        // call putOutput() to log the output to the stream\r\n        if (pbase() != pptr())\r\n        {\r\n            putOutput();\r\n        }\r\n    }\r\n\r\n    // synchronizes the stream buffer and returns 0 on success\r\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\r\n    // resetting the buffer and flushing the stream\r\n    virtual int sync()\r\n    {\r\n        putOutput();\r\n        return 0;\r\n    }\r\n\r\n    void putOutput()\r\n    {\r\n        if (mShouldLog)\r\n        {\r\n            // prepend timestamp\r\n            std::time_t timestamp = std::time(nullptr);\r\n            tm* tm_local = std::localtime(&timestamp);\r\n            std::cout << \"[\";\r\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\r\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\r\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\r\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\r\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\r\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\r\n            // std::stringbuf::str() gets the string contents of the buffer\r\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\r\n            mOutput << mPrefix << str();\r\n            // set the buffer to empty\r\n            str(\"\");\r\n            // flush the stream\r\n            mOutput.flush();\r\n        }\r\n    }\r\n\r\n    void setShouldLog(bool shouldLog)\r\n    {\r\n        mShouldLog = shouldLog;\r\n    }\r\n\r\nprivate:\r\n    std::ostream& mOutput;\r\n    std::string mPrefix;\r\n    bool mShouldLog;\r\n};\r\n\r\n//!\r\n//! \\class LogStreamConsumerBase\r\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\r\n//!\r\nclass LogStreamConsumerBase\r\n{\r\npublic:\r\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\r\n        : mBuffer(stream, prefix, shouldLog)\r\n    {\r\n    }\r\n\r\nprotected:\r\n    LogStreamConsumerBuffer mBuffer;\r\n};\r\n\r\n//!\r\n//! \\class LogStreamConsumer\r\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\r\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\r\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\r\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\r\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\r\n//!  Please do not change the order of the parent classes.\r\n//!\r\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\r\n{\r\npublic:\r\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\r\n    //!  Reportable severity determines if the messages are severe enough to be logged.\r\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\r\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\r\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\r\n        , mShouldLog(severity <= reportableSeverity)\r\n        , mSeverity(severity)\r\n    {\r\n    }\r\n\r\n    LogStreamConsumer(LogStreamConsumer&& other)\r\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\r\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\r\n        , mShouldLog(other.mShouldLog)\r\n        , mSeverity(other.mSeverity)\r\n    {\r\n    }\r\n\r\n    void setReportableSeverity(Severity reportableSeverity)\r\n    {\r\n        mShouldLog = mSeverity <= reportableSeverity;\r\n        mBuffer.setShouldLog(mShouldLog);\r\n    }\r\n\r\nprivate:\r\n    static std::ostream& severityOstream(Severity severity)\r\n    {\r\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\r\n    }\r\n\r\n    static std::string severityPrefix(Severity severity)\r\n    {\r\n        switch (severity)\r\n        {\r\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\r\n        case Severity::kERROR: return \"[E] \";\r\n        case Severity::kWARNING: return \"[W] \";\r\n        case Severity::kINFO: return \"[I] \";\r\n        case Severity::kVERBOSE: return \"[V] \";\r\n        default: assert(0); return \"\";\r\n        }\r\n    }\r\n\r\n    bool mShouldLog;\r\n    Severity mSeverity;\r\n};\r\n\r\n//! \\class Logger\r\n//!\r\n//! \\brief Class which manages logging of TensorRT tools and samples\r\n//!\r\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\r\n//! and supports logging two types of messages:\r\n//!\r\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\r\n//! - Test pass/fail messages\r\n//!\r\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\r\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\r\n//!\r\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\r\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\r\n//!\r\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\r\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\r\n//! library and messages coming from the sample.\r\n//!\r\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\r\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\r\n//! object.\r\n\r\nclass Logger : public nvinfer1::ILogger\r\n{\r\npublic:\r\n    Logger(Severity severity = Severity::kWARNING)\r\n        : mReportableSeverity(severity)\r\n    {\r\n    }\r\n\r\n    //!\r\n    //! \\enum TestResult\r\n    //! \\brief Represents the state of a given test\r\n    //!\r\n    enum class TestResult\r\n    {\r\n        kRUNNING, //!< The test is running\r\n        kPASSED,  //!< The test passed\r\n        kFAILED,  //!< The test failed\r\n        kWAIVED   //!< The test was waived\r\n    };\r\n\r\n    //!\r\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\r\n    //! \\return The nvinfer1::ILogger associated with this Logger\r\n    //!\r\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\r\n    //! we can eliminate the inheritance of Logger from ILogger\r\n    //!\r\n    nvinfer1::ILogger& getTRTLogger()\r\n    {\r\n        return *this;\r\n    }\r\n\r\n    //!\r\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\r\n    //!\r\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\r\n    //! inheritance from nvinfer1::ILogger\r\n    //!\r\n    void log(Severity severity, const char* msg) override\r\n    {\r\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\r\n    }\r\n\r\n    //!\r\n    //! \\brief Method for controlling the verbosity of logging output\r\n    //!\r\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\r\n    //!\r\n    void setReportableSeverity(Severity severity)\r\n    {\r\n        mReportableSeverity = severity;\r\n    }\r\n\r\n    //!\r\n    //! \\brief Opaque handle that holds logging information for a particular test\r\n    //!\r\n    //! This object is an opaque handle to information used by the Logger to print test results.\r\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\r\n    //! with Logger::reportTest{Start,End}().\r\n    //!\r\n    class TestAtom\r\n    {\r\n    public:\r\n        TestAtom(TestAtom&&) = default;\r\n\r\n    private:\r\n        friend class Logger;\r\n\r\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\r\n            : mStarted(started)\r\n            , mName(name)\r\n            , mCmdline(cmdline)\r\n        {\r\n        }\r\n\r\n        bool mStarted;\r\n        std::string mName;\r\n        std::string mCmdline;\r\n    };\r\n\r\n    //!\r\n    //! \\brief Define a test for logging\r\n    //!\r\n    //! \\param[in] name The name of the test.  This should be a string starting with\r\n    //!                  \"TensorRT\" and containing dot-separated strings containing\r\n    //!                  the characters [A-Za-z0-9_].\r\n    //!                  For example, \"TensorRT.sample_googlenet\"\r\n    //! \\param[in] cmdline The command line used to reproduce the test\r\n    //\r\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\r\n    //!\r\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\r\n    {\r\n        return TestAtom(false, name, cmdline);\r\n    }\r\n\r\n    //!\r\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\r\n    //!        as input\r\n    //!\r\n    //! \\param[in] name The name of the test\r\n    //! \\param[in] argc The number of command-line arguments\r\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\r\n    //!\r\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\r\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\r\n    {\r\n        auto cmdline = genCmdlineString(argc, argv);\r\n        return defineTest(name, cmdline);\r\n    }\r\n\r\n    //!\r\n    //! \\brief Report that a test has started.\r\n    //!\r\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\r\n    //!\r\n    //! \\param[in] testAtom The handle to the test that has started\r\n    //!\r\n    static void reportTestStart(TestAtom& testAtom)\r\n    {\r\n        reportTestResult(testAtom, TestResult::kRUNNING);\r\n        assert(!testAtom.mStarted);\r\n        testAtom.mStarted = true;\r\n    }\r\n\r\n    //!\r\n    //! \\brief Report that a test has ended.\r\n    //!\r\n    //! \\pre reportTestStart() has been called for the given testAtom\r\n    //!\r\n    //! \\param[in] testAtom The handle to the test that has ended\r\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\r\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\r\n    //!\r\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\r\n    {\r\n        assert(result != TestResult::kRUNNING);\r\n        assert(testAtom.mStarted);\r\n        reportTestResult(testAtom, result);\r\n    }\r\n\r\n    static int reportPass(const TestAtom& testAtom)\r\n    {\r\n        reportTestEnd(testAtom, TestResult::kPASSED);\r\n        return EXIT_SUCCESS;\r\n    }\r\n\r\n    static int reportFail(const TestAtom& testAtom)\r\n    {\r\n        reportTestEnd(testAtom, TestResult::kFAILED);\r\n        return EXIT_FAILURE;\r\n    }\r\n\r\n    static int reportWaive(const TestAtom& testAtom)\r\n    {\r\n        reportTestEnd(testAtom, TestResult::kWAIVED);\r\n        return EXIT_SUCCESS;\r\n    }\r\n\r\n    static int reportTest(const TestAtom& testAtom, bool pass)\r\n    {\r\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\r\n    }\r\n\r\n    Severity getReportableSeverity() const\r\n    {\r\n        return mReportableSeverity;\r\n    }\r\n\r\nprivate:\r\n    //!\r\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\r\n    //!\r\n    static const char* severityPrefix(Severity severity)\r\n    {\r\n        switch (severity)\r\n        {\r\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\r\n        case Severity::kERROR: return \"[E] \";\r\n        case Severity::kWARNING: return \"[W] \";\r\n        case Severity::kINFO: return \"[I] \";\r\n        case Severity::kVERBOSE: return \"[V] \";\r\n        default: assert(0); return \"\";\r\n        }\r\n    }\r\n\r\n    //!\r\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\r\n    //!\r\n    static const char* testResultString(TestResult result)\r\n    {\r\n        switch (result)\r\n        {\r\n        case TestResult::kRUNNING: return \"RUNNING\";\r\n        case TestResult::kPASSED: return \"PASSED\";\r\n        case TestResult::kFAILED: return \"FAILED\";\r\n        case TestResult::kWAIVED: return \"WAIVED\";\r\n        default: assert(0); return \"\";\r\n        }\r\n    }\r\n\r\n    //!\r\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\r\n    //!\r\n    static std::ostream& severityOstream(Severity severity)\r\n    {\r\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\r\n    }\r\n\r\n    //!\r\n    //! \\brief method that implements logging test results\r\n    //!\r\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\r\n    {\r\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\r\n                                         << testAtom.mCmdline << std::endl;\r\n    }\r\n\r\n    //!\r\n    //! \\brief generate a command line string from the given (argc, argv) values\r\n    //!\r\n    static std::string genCmdlineString(int argc, char const* const* argv)\r\n    {\r\n        std::stringstream ss;\r\n        for (int i = 0; i < argc; i++)\r\n        {\r\n            if (i > 0)\r\n                ss << \" \";\r\n            ss << argv[i];\r\n        }\r\n        return ss.str();\r\n    }\r\n\r\n    Severity mReportableSeverity;\r\n};\r\n\r\nnamespace\r\n{\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\r\n{\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\r\n}\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\r\n{\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\r\n}\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\r\n{\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\r\n}\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\r\n{\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\r\n}\r\n\r\n//!\r\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\r\n//         (\"fatal\" severity)\r\n//!\r\n//! Example usage:\r\n//!\r\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\r\n//!\r\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\r\n{\r\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\r\n}\r\n\r\n} // anonymous namespace\r\n\r\n#endif // TENSORRT_LOGGING_H\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/main.cpp",
    "content": "#include <iostream>\r\n\r\n\r\nusing namespace std;\r\n\r\n\r\n\r\n\r\n\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/myhpp.h",
    "content": "#ifndef MYHPP_H\r\n#define MYHPP_H\r\n\r\n#include <assert.h>\r\n#include <iostream>\r\n#include<vector>\r\n#include<map>\r\n#define _USE_MATH_DEFINES\r\n#include <math.h>\r\n#include <cmath>\r\n#include<string>\r\n#include<fstream>\r\n#include<streambuf>\r\n#include<ctime>\r\n#include<chrono>\r\n#include<iomanip>\r\n#include<cuda_runtime.h>\r\n#include<opencv2/core/core.hpp>\r\n#include<opencv2/imgproc/imgproc.hpp>\r\n#include<opencv2/imgcodecs/imgcodecs.hpp>\r\n#include<opencv2/dnn/dnn.hpp>\r\n//#include <opencv2/highgui/highgui.hpp>\r\n#include<stdio.h>\r\n#include<cuda.h>\r\n//#include <cudnn.h>\r\n#include <cublas_v2.h>\r\n#include<driver_types.h>\r\n#include<NvInfer.h>\r\n#include<NvInferPlugin.h>\r\n#include<NvOnnxParser.h>\r\n#include<NvOnnxConfig.h>\r\n#include<cstdint>\r\n\r\n\r\n\r\n#endif // MYHPP_H\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/trainsform.cpp",
    "content": "#include \"common.hpp\"\r\n#include \"logging.h\"\r\n#include <fstream>\r\n#include <iostream>\r\n#include <map>\r\n#include <sstream>\r\n#include <vector>\r\n#include <chrono>\r\n\r\n#define USE_FP32\r\n\r\nstatic Logger gLogger;\r\n\r\nconst char *INPUT_BLOB_NAME = \"data\";\r\nconst char *OUTPUT_BLOB_NAME = \"output\";\r\nstatic const int bs = 1;\r\nstatic const int channels = 96;\r\nstatic const int ch = 3;\r\nstatic const int INPUT_H = 576;\r\nstatic const int INPUT_W = 576;\r\nstatic const int NUM_CLASSES = 15;\r\nstatic const int outputSize = 576 * 576;\r\ncudaStream_t m_cudaStream;\r\nvector<void *> m_bindings;\r\nIExecutionContext *m_context;\r\n\r\nICudaEngine *createEngine(unsigned int maxBatchSize, IBuilder *builder, IBuilderConfig *config, DataType dt,std::string wtsPath)\r\n{\r\n    INetworkDefinition *network = builder->createNetworkV2(0U);\r\n    ITensor *data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{ch, INPUT_H, INPUT_W});\r\n    assert(data);\r\n    std::map<std::string, Weights> weightMap = loadWeights(wtsPath);\r\n    ITensor* conv1 = conv(network, weightMap, data, \"backbone.patch_embed.proj\", channels);\r\n    ITensor* shuffle1 = shuffle_reshapeApermute(network, conv1, Dims2{channels, -1}, Permutation{1, 0}, true);\r\n    ITensor *ln = m_layerNorm(network, weightMap, shuffle1, \"backbone.patch_embed.norm\");\r\n    debug_print(ln, \"ln\");\r\n    //layer0\r\n\r\n    ITensor *mask0 = trt_transform_imgMask(network, 147, 7, 3);\r\n    ITensor *blk00 = blk(network, weightMap, ln, mask0, \"backbone.layers.0.blocks.0\", INPUT_H / 4, channels, 3, 7, 0);\r\n    debug_print(blk00, \"blk00\");\r\n    ITensor *blk01 = blk(network, weightMap, blk00, mask0, \"backbone.layers.0.blocks.1\", INPUT_H / 4, channels, 3, 7, 3);\r\n    debug_print(blk01, \"blk01\");\r\n    ITensor* out0 = m_layerNorm(network, weightMap, blk01, \"backbone.norm0\");\r\n    out0 = shuffle_reshapeApermute(network, out0, Dims3{INPUT_H / 4, INPUT_H / 4, channels}, Permutation{2, 0, 1}, true);\r\n    ITensor *down_layer0 = downsample(network, weightMap, blk01, \"backbone.layers.0.downsample\", INPUT_H / 4);\r\n    debug_print(down_layer0, \"down_blk1\");\r\n    //layer1\r\n    ITensor *mask1 = trt_transform_imgMask(network, 77, 7, 3);\r\n    ITensor *blk10 = blk(network, weightMap, down_layer0, mask1, \"backbone.layers.1.blocks.0\", INPUT_H / 8, channels * 2, 6, 7, 0);\r\n    debug_print(blk10, \"blk10\");\r\n    ITensor *blk11 = blk(network, weightMap, blk10, mask1, \"backbone.layers.1.blocks.1\", INPUT_H / 8, channels * 2, 6, 7, 3);\r\n    debug_print(blk11, \"blk11\");\r\n    ITensor* out1 = m_layerNorm(network, weightMap, blk11, \"backbone.norm1\");\r\n    out1 = shuffle_reshapeApermute(network, out1, Dims3{INPUT_H / 8, INPUT_H / 8, channels * 2}, Permutation{2, 0, 1}, true);\r\n    ITensor *down_layer1 = downsample(network, weightMap, blk11, \"backbone.layers.1.downsample\", INPUT_H / 8);\r\n    debug_print(down_layer1, \"down_layer1\");\r\n    //layer2\r\n    ITensor *mask2 = trt_transform_imgMask(network, 42, 7, 3);\r\n    ITensor *blk20 = blk(network, weightMap, down_layer1, mask2, \"backbone.layers.2.blocks.0\", INPUT_H / 16, channels * 4, 12, 7, 0);\r\n    debug_print(blk20, \"blk20\");\r\n    ITensor *blk21 = blk(network, weightMap, blk20, mask2, \"backbone.layers.2.blocks.1\", INPUT_H / 16, channels * 4, 12, 7, 3);\r\n    debug_print(blk21, \"blk21\");\r\n    ITensor *blk22 = blk(network, weightMap, blk21, mask2, \"backbone.layers.2.blocks.2\", INPUT_H / 16,channels * 4, 12, 7, 0);\r\n    debug_print(blk22, \"blk22\");\r\n    ITensor *blk23 = blk(network, weightMap, blk22, mask2, \"backbone.layers.2.blocks.3\", INPUT_H / 16, channels * 4, 12, 7, 3);\r\n    debug_print(blk23, \"blk23\");\r\n    ITensor *blk24 = blk(network, weightMap, blk23, mask2, \"backbone.layers.2.blocks.4\", INPUT_H / 16, channels * 4, 12, 7, 0);\r\n    debug_print(blk24, \"blk24\");\r\n    ITensor *blk25 = blk(network, weightMap, blk24, mask2, \"backbone.layers.2.blocks.5\", INPUT_H / 16, channels * 4, 12, 7, 3);\r\n    debug_print(blk25, \"blk25\");\r\n    ITensor* out2 = m_layerNorm(network, weightMap, blk25, \"backbone.norm2\");\r\n    out2 = shuffle_reshapeApermute(network, out2, Dims3{INPUT_H / 16, INPUT_H / 16, channels * 4}, Permutation{2, 0, 1}, true);\r\n    ITensor *down_layer2 = downsample(network, weightMap, blk25, \"backbone.layers.2.downsample\", INPUT_H / 16);\r\n    debug_print(down_layer2, \"down_layer2\");\r\n    //layer3\r\n    ITensor *mask3 = trt_transform_imgMask(network, 21, 7, 3);\r\n    ITensor *blk30 = blk(network, weightMap, down_layer2, mask3, \"backbone.layers.3.blocks.0\", INPUT_H / 32, channels * 8, 24, 7, 0);\r\n    debug_print(blk30, \"blk30\");\r\n    ITensor *blk31 = blk(network, weightMap, blk30, mask3, \"backbone.layers.3.blocks.1\", INPUT_H / 32, channels * 8, 24, 7, 3);\r\n    debug_print(blk31, \"blk31\");\r\n    ITensor* out3 = m_layerNorm(network, weightMap, blk31, \"backbone.norm3\");\r\n    out3 = shuffle_reshapeApermute(network, out3, Dims3{INPUT_H / 32, INPUT_H / 32, channels * 8}, Permutation{2, 0, 1}, true);\r\n    ITensor* out[4] = {out0, out1, out2, out3};\r\n    out0 = transform_lateral_conv(network, weightMap, out0, \"decode_head.lateral_convs.0\");  // 512,INPUT_H/4,INPUT_H/4\r\n    out1 = transform_lateral_conv(network, weightMap, out1, \"decode_head.lateral_convs.1\");  // 512,INPUT_H/8,INPUT_H/8\r\n    out2 = transform_lateral_conv(network, weightMap, out2, \"decode_head.lateral_convs.2\");  // 512,INPUT_H/16,INPUT_H/16\r\n    auto psp_out_0 = transform_psp(network, weightMap, out3, \"decode_head.psp_modules.0.1\", 1);\r\n    auto psp_out_1 = transform_psp(network, weightMap, out3, \"decode_head.psp_modules.1.1\", 2);\r\n    auto psp_out_2 = transform_psp(network, weightMap, out3, \"decode_head.psp_modules.2.1\", 3);\r\n    auto psp_out_3 = transform_psp(network, weightMap, out3, \"decode_head.psp_modules.3.1\", 6);\r\n    ITensor* psp_outs[5] = {out3, psp_out_0, psp_out_1, psp_out_2, psp_out_3};\r\n    auto PSP_outs = network->addConcatenation(psp_outs, 5);\r\n    PSP_outs->setAxis(0);\r\n    debug_print(PSP_outs->getOutput(0), \"PSP_outs\");\r\n    out3 = transform_lateral_conv(network, weightMap, PSP_outs->getOutput(0), \"decode_head.bottleneck\", 3, 1, 512);  // 512,INPUT_H/32,INPUT_H/32\r\n    debug_print(out3, \"out3\");\r\n    auto laterals2 = up_Add(network, out3, out2);\r\n    auto laterals1 = up_Add(network, laterals2, out1);\r\n    auto laterals0 = up_Add(network, laterals1, out0);\r\n    auto fpn0 = transform_lateral_conv(network, weightMap, laterals0, \"decode_head.fpn_convs.0\", 3, 1, 512);\r\n    auto fpn1 = transform_lateral_conv(network, weightMap, laterals1, \"decode_head.fpn_convs.1\", 3, 1, 512);\r\n    auto fpn2 = transform_lateral_conv(network, weightMap, laterals2, \"decode_head.fpn_convs.2\", 3, 1, 512);\r\n    fpn1 = resize(network, fpn1,fpn0->getDimensions().d[1]);\r\n    fpn2 = resize(network, fpn2,fpn0->getDimensions().d[1]);\r\n    auto fpn3 = resize(network, out3, fpn0->getDimensions().d[1]);\r\n    ITensor* fpn_outs[4] = {fpn0, fpn1, fpn2, fpn3};\r\n    auto FPN_outs = network->addConcatenation(fpn_outs, 4);\r\n    FPN_outs->setAxis(0);\r\n    debug_print(FPN_outs->getOutput(0), \"FPN_outs\");\r\n    auto fpn_output = transform_lateral_conv(network, weightMap, FPN_outs->getOutput(0), \"decode_head.fpn_bottleneck\", 3, 1, 512);\r\n    debug_print(fpn_output, \"fpn_output\");\r\n    auto seg = network->addConvolutionNd(*fpn_output, NUM_CLASSES, Dims2{1, 1}, weightMap[\"decode_head.conv_seg.weight\"], weightMap[\"decode_head.conv_seg.bias\"]);\r\n    seg->setStrideNd(Dims2{1, 1});\r\n    debug_print(seg->getOutput(0), \"seg\");\r\n    auto seg_resize = resize(network, seg->getOutput(0), INPUT_H);\r\n    debug_print(seg_resize, \"seg_resize\");\r\n    auto output = network->addTopK(*seg_resize, TopKOperation::kMAX, 1, 0X01)->getOutput(1);\r\n    debug_print(output, \"output\");\r\n\r\n    std::cout << \"set name out\" << std::endl;\r\n    output->setName(OUTPUT_BLOB_NAME);\r\n    network->markOutput(*output);\r\n    builder->setMaxBatchSize(12);\r\n    config->setMaxWorkspaceSize((1 << 30)); // 1G\r\n#ifdef USE_FP16\r\n    std::cout<< \"use fp16\"<<std::endl;\r\n    config->setFlag(BuilderFlag::kFP16);\r\n#endif\r\n    ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);\r\n    std::cout << \"build success!\" << std::endl;\r\n    network->destroy();\r\n\r\n    return engine;\r\n}\r\n\r\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory **modelStream,std::string wtsPath)\r\n{\r\n    IBuilder *builder = createInferBuilder(gLogger);\r\n    IBuilderConfig *config = builder->createBuilderConfig();\r\n    ICudaEngine *engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT, wtsPath);\r\n    assert(engine != nullptr);\r\n    (*modelStream) = engine->serialize();\r\n    engine->destroy();\r\n    builder->destroy();\r\n}\r\n\r\nvoid createEng(std::string wtsPath, std::string engine_name)\r\n{\r\n    char *trtModelStream{nullptr};\r\n    size_t size{0};\r\n\r\n    IHostMemory *modelStream{nullptr};\r\n    APIToModel(bs, &modelStream, wtsPath);\r\n    assert(modelStream != nullptr);\r\n    std::ofstream p(engine_name, std::ios::binary);\r\n    if (!p)\r\n    {\r\n        std::cerr << \"could not open plan output file\" << std::endl;\r\n        return;\r\n    }\r\n    p.write(reinterpret_cast<const char *>(modelStream->data()), modelStream->size());\r\n    modelStream->destroy();\r\n    std::ifstream file(engine_name, std::ios::binary);\r\n    if (file.good())\r\n    {\r\n        file.seekg(0, file.end);\r\n        size = file.tellg();\r\n        file.seekg(0, file.beg);\r\n        trtModelStream = new char[size];\r\n        assert(trtModelStream);\r\n        file.read(trtModelStream, size);\r\n        file.close();\r\n    }\r\n}\r\n\r\nvoid inference_init(string ENGPath,ICudaEngine *m_engine)\r\n{\r\n    ifstream cache(ENGPath, ios::binary);\r\n    cache.seekg(0, ios::end);\r\n    const int engSize = cache.tellg();\r\n    cache.seekg(0, ios::beg);\r\n    void *modelMem = malloc(engSize);\r\n    cache.read((char*)modelMem, engSize);\r\n    cache.close();\r\n    IRuntime *runtime = nvinfer1::createInferRuntime(gLogger);\r\n    m_engine = runtime->deserializeCudaEngine(modelMem, engSize);\r\n    runtime->destroy();\r\n    free(modelMem);\r\n    if (!m_engine) {\r\n        cout << \"deserialize eng error!\" << endl;\r\n        return;\r\n    }\r\n    m_context = m_engine->createExecutionContext();\r\n    if (cudaStreamCreate(&m_cudaStream) != 0) return;\r\n    int bindings = m_engine->getNbBindings();\r\n    if (bindings < 2)\r\n    {\r\n        cout << \"Error! the network have one input and one output at least!\" << endl;\r\n        return;\r\n    }\r\n    cout << \"1111111111111\" << endl;\r\n    m_bindings.resize(bindings, nullptr);\r\n    CHECK(cudaMalloc(&m_bindings.at(0), bs * ch * INPUT_H * INPUT_W * sizeof(float)));\r\n    CHECK(cudaMalloc(&m_bindings.at(1), bs * outputSize * 4));\r\n}\r\n\r\nvoid doInference(const float *input, int *output)\r\n{\r\n    cout << \"do infer:\" << endl;\r\n    CHECK(cudaMemcpyAsync(m_bindings.at(0), input, bs * ch * INPUT_H * INPUT_W * sizeof(float),\r\n                          cudaMemcpyHostToDevice, m_cudaStream));\r\n\r\n    m_context->enqueue(bs, m_bindings.data(), m_cudaStream, nullptr);\r\n\r\n    CHECK(cudaMemcpyAsync(output, m_bindings.at(1), bs * outputSize * 4,\r\n                          cudaMemcpyDeviceToHost, m_cudaStream));\r\n\r\n    cudaStreamSynchronize(m_cudaStream);\r\n}\r\n\r\n\r\nint main(int argc, char** argv)\r\n{\r\n    cout << \"begin\" << endl;\r\n    //string wts = \"G:/shaj/trainsform/ktn5n6_29.511.21.8.wts\";\r\n    //string eng = \"G:/shaj/trainsform/trainsform.eng\";\r\n    std::string argv1 = argv[1];\r\n    if (argv1 == \"-s\") {\r\n        string wts = argv[2];\r\n        string eng = argv[3];\r\n        createEng(wts,eng);\r\n    } else {\r\n        string eng = argv[2];\r\n\r\n        ICudaEngine *m_engine;\r\n\r\n        inference_init(eng,m_engine);\r\n\r\n        vector<cv::Mat> testVal;\r\n        map<string,cv::Mat> dataProb;\r\n        vector<string> imgs;\r\n        cv::Mat img;\r\n        //string pattern_dir = \"G:/shaj/trainsform\";\r\n        string pattern_dir = argv[3];\r\n        string pattern = pattern_dir+ \"/*.bmp\";\r\n        vector<cv::String> images_names;\r\n        cv::glob(pattern, images_names, false);\r\n        int i = 0;\r\n        cv::Scalar Mean = cv::Scalar(123.675, 116.28, 103.53);\r\n        cv::Scalar Std = cv::Scalar(58.395, 57.12, 57.375);\r\n        cv::Size size = {INPUT_H,INPUT_W};\r\n\r\n        for (auto image_name: images_names)\r\n        {\r\n            if (i < bs)\r\n            {\r\n                cv::Mat Img = cv::imread(image_name, 1);\r\n\r\n                testVal.push_back(Img);\r\n                cout << image_name << endl;\r\n                imgs.push_back(image_name);\r\n            }\r\n        }\r\n        float *data = new float[bs * ch * INPUT_H * INPUT_W];\r\n        int *output = new int[bs * outputSize];\r\n\r\n        cv::Mat Transed_t = BlobFromImages(testVal, cv::Size{INPUT_H, INPUT_W}, Mean, Std, true, false);\r\n\r\n        memcpy(data, Transed_t.data, bs * ch * INPUT_H * INPUT_W * sizeof(float));\r\n\r\n        //for(int i = 0 ; i< 20; i++){\r\n\r\n        auto start_time = std::chrono::system_clock::now();\r\n        doInference(data, output);\r\n\r\n        auto end_time = std::chrono::system_clock::now();\r\n        float duration;\r\n        duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count();\r\n        cout << \"time:\" << duration << endl;\r\n        //}\r\n        //    for(int i = 0; i < 100; i++)\r\n        //        cout<<i<<\":\"<<output[i]<<endl;\r\n\r\n        int n = 0;\r\n        int *out = new int[outputSize];\r\n        string outPath = pattern_dir + \"/output\";\r\n        for (int i = 0; i < testVal.size(); i++)\r\n        {\r\n            cv::Mat img = cv::imread(imgs[i], 1);\r\n            cv::Mat dst;\r\n            cv::resize(img,dst,cv::Size{INPUT_H, INPUT_W});\r\n            //string outPath_n = outPath + \"/\"+to_string(n) + \".jpg\";\r\n            n += 1;\r\n            out = output + i * outputSize;\r\n            for (int i = 0; i < outputSize; i++)\r\n            {\r\n                if (out[i] != 0)\r\n                {\r\n                    int w = i % (INPUT_H);\r\n                    int h = i / (INPUT_W);\r\n                    dst.at<cv::Vec3b>(h, w)[0] =  out[i] * 10;\r\n                    dst.at<cv::Vec3b>(h, w)[1] =  out[i] * 30;\r\n                    dst.at<cv::Vec3b>(h, w)[2] =  out[i] * 40;\r\n                }\r\n            }\r\n            //cout<<outPath_n<<endl;\r\n            string outPath_result = imgs[i].replace(0, pattern_dir.size(), outPath);\r\n            cout << outPath_result << endl;\r\n            cv::imwrite(outPath_result, dst);\r\n        }\r\n        testVal.clear();\r\n        imgs.clear();\r\n    }\r\n\r\n    m_context->destroy();\r\n    m_engine->destroy();\r\n    for (auto bindings: m_bindings) {\r\n        cudaFree(bindings);\r\n    }\r\n    cudaFree(m_cudaStream);\r\n\r\n    cout << \"swin_transform\" << endl;\r\n    return 0;\r\n}\r\n\r\n"
  },
  {
    "path": "swin-transformer/semantic-segmentation/utilsn.h",
    "content": "#ifndef UTILSN_H\r\n#define UTILSN_H\r\n\r\n#include <iostream>\r\n#include <vector>\r\n#include <algorithm>\r\n#include <cudnn.h>\r\n#include <NvInfer.h>\r\n#include \"myhpp.h\"\r\nusing namespace std;\r\n\r\n#ifndef CUDA_CHECK\r\n\r\n#define CUDA_CHECK(callstr)                                                                    \\\r\n    {                                                                                          \\\r\n        cudaError_t error_code = callstr;                                                      \\\r\n        if (error_code != cudaSuccess) {                                                       \\\r\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\r\n            assert(0);                                                                         \\\r\n        }                                                                                      \\\r\n    }\r\n\r\n#endif\r\n\r\nnamespace Tn\r\n{\r\n    class Profiler : public nvinfer1::IProfiler\r\n    {\r\n    public:\r\n        void printLayerTimes(int itrationsTimes)\r\n        {\r\n            float totalTime = 0;\r\n            for (size_t i = 0; i < mProfile.size(); i++)\r\n            {\r\n                printf(\"%-40.40s %4.3fms\\n\", mProfile[i].first.c_str(), mProfile[i].second / itrationsTimes);\r\n                totalTime += mProfile[i].second;\r\n            }\r\n            printf(\"Time over all layers: %4.3f\\n\", totalTime / itrationsTimes);\r\n        }\r\n    private:\r\n        typedef std::pair<std::string, float> Record;\r\n        std::vector<Record> mProfile;\r\n\r\n        virtual void reportLayerTime(const char* layerName, float ms)\r\n        {\r\n            auto record = std::find_if(mProfile.begin(), mProfile.end(), [&](const Record& r){ return r.first == layerName; });\r\n            if (record == mProfile.end())\r\n               { mProfile.push_back(std::make_pair(layerName, ms));}\r\n            else\r\n                record->second += ms;\r\n        }\r\n    };\r\n\r\n    template<typename T>\r\n    void write(char*& buffer, const T& val)\r\n    {\r\n        *reinterpret_cast<T*>(buffer) = val;\r\n        buffer += sizeof(T);\r\n    }\r\n\r\n    template<typename T>\r\n    void read(const char*& buffer, T& val)\r\n    {\r\n        val = *reinterpret_cast<const T*>(buffer);\r\n        buffer += sizeof(T);\r\n    }\r\n\r\n//    void* copyToDevice(const void* data, size_t count)\r\n//    {\r\n//        void* deviceData;\r\n//        cudaMalloc(&deviceData, count);\r\n//        cudaMemcpy(deviceData, data, count, cudaMemcpyHostToDevice);\r\n//        return deviceData;\r\n//    }\r\n//    void deserializeToDevice(const char*& hostBuffer, void*& deviceWeights, size_t size)\r\n//    {\r\n//        deviceWeights = copyToDevice(hostBuffer, size);\r\n//        hostBuffer += size;\r\n//    }\r\n//    size_t type2size(nvinfer1::DataType type) { return sizeof(float); }\r\n//    void convertAndCopyToBuffer(char*& buffer, const nvinfer1::Weights& weights)\r\n//    {\r\n//        memcpy(buffer, weights.values, weights.count * type2size(weights.type));\r\n//        buffer += weights.count * type2size(weights.type);\r\n//    }\r\n\r\n\r\n}\r\n\r\n#endif // UTILSN_H\r\n"
  },
  {
    "path": "tsm/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(TSM)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n\n# tensorrt\ninclude_directories(/home/ubuntu/TensorRT/include/)\nlink_directories(/home/ubuntu/TensorRT/lib/)\n\nadd_executable(tsm_r50 ${PROJECT_SOURCE_DIR}/tsm_r50.cpp)\ntarget_link_libraries(tsm_r50 nvinfer)\ntarget_link_libraries(tsm_r50 cudart)\n\nadd_definitions(-O2 -pthread)\n"
  },
  {
    "path": "tsm/README.md",
    "content": "# Temporal Shift Module\n\nTSM-R50 from \"TSM: Temporal Shift Module for Efficient Video Understanding\" <https://arxiv.org/abs/1811.08383>\n\nTSM is a widely used Action Recognition model. This TensorRT implementation is tested with TensorRT 5.1 and TensorRT 7.2.\n\nFor the PyTorch implementation, you can refer to [open-mmlab/mmaction2](https://github.com/open-mmlab/mmaction2) or [mit-han-lab/temporal-shift-module](https://github.com/mit-han-lab/temporal-shift-module).\n\nMore details about the shift module(which is the core of TSM) could to [test_shift.py](./test_shift.py).\n\n## Tutorial\n\n+ An example could refer to [demo.sh](./demo.sh)\n  + Requirements: Successfully installed `torch>=1.3.0, torchvision`\n\n+ Step 1: Train/Download TSM-R50 checkpoints from [offical Github repo](https://github.com/mit-han-lab/temporal-shift-module) or [MMAction2](https://github.com/open-mmlab/mmaction2)\n  + Supported settings: `num_segments`, `shift_div`, `num_classes`.\n  + Fixed settings: `backbone`(ResNet50), `shift_place`(blockres), `temporal_pool`(False).\n\n+ Step 2: Convert PyTorch checkpoints to TensorRT weights.\n\n```shell\npython gen_wts.py /path/to/pytorch.pth --out-filename /path/to/tensorrt.wts\n```\n\n+ Step 3: Test Python API.\n  + Modify configs in `tsm_r50.py`.\n  + Inference with `tsm_r50.py`.\n\n```python\n# Supported settings\nBATCH_SIZE = 1\nNUM_SEGMENTS = 8\nINPUT_H = 224\nINPUT_W = 224\nOUTPUT_SIZE = 400\nSHIFT_DIV = 8\n```\n\n```shell\nusage: tsm_r50.py [-h] [--tensorrt-weights TENSORRT_WEIGHTS] [--input-video INPUT_VIDEO] [--save-engine-path SAVE_ENGINE_PATH] [--load-engine-path LOAD_ENGINE_PATH] [--test-mmaction2] [--mmaction2-config MMACTION2_CONFIG] [--mmaction2-checkpoint MMACTION2_CHECKPOINT] [--test-cpp] [--cpp-result-path CPP_RESULT_PATH]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --tensorrt-weights TENSORRT_WEIGHTS\n                        Path to TensorRT weights, which is generated by gen_weights.py\n  --input-video INPUT_VIDEO\n                        Path to local video file\n  --save-engine-path SAVE_ENGINE_PATH\n                        Save engine to local file\n  --load-engine-path LOAD_ENGINE_PATH\n                        Saved engine file path\n  --test-mmaction2      Compare TensorRT results with MMAction2 Results\n  --mmaction2-config MMACTION2_CONFIG\n                        Path to MMAction2 config file\n  --mmaction2-checkpoint MMACTION2_CHECKPOINT\n                        Path to MMAction2 checkpoint url or file path\n  --test-cpp            Compare Python API results with C++ API results\n  --cpp-result-path CPP_RESULT_PATH\n                        Path to C++ API results\n```\n\n+ Step 4: Test C++ API.\n  + Mocify Configs in `tsm_r50.cpp`.\n  + Build from source code: `mkdir build && cd build && cmake .. && make`\n  + Generate Engine file: `./tsm_r50 -s`\n  + Inference with genrated engine file and write predictions to local: `./tsm_r50 -d`\n  + Compare results with Python API: `python tsm_r50.py --tensorrt-weights /path/to/tensorrt.weights --test-cpp --cpp-result-file /path/to/cpp-result.txt`\n\n## TODO\n\n+ [x] Python Shift module.\n+ [x] Generate wts of official tsm and mmaction2 tsm.\n+ [x] Python API Definition\n+ [x] Test with mmaction2 demo\n+ [x] Tutorial\n+ [x] C++ API Definition\n"
  },
  {
    "path": "tsm/demo.sh",
    "content": "# Step 1: Get checkpoints from mmaction2\n# https://github.com/open-mmlab/mmaction2/tree/master/configs/recognition/tsm\nwget https://download.openmmlab.com/mmaction/recognition/tsm/tsm_r50_1x1x8_50e_kinetics400_rgb/tsm_r50_1x1x8_50e_kinetics400_rgb_20200607-af7fb746.pth\n\n# Step 2: Convert pytorch checkpoints to TensorRT weights\npython gen_wts.py tsm_r50_1x1x8_50e_kinetics400_rgb_20200607-af7fb746.pth --out-filename ./tsm_r50_kinetics400_mmaction2.wts\n\n# Step 3: Test Python API.\n# 3.1 Skip this step since we use default settings.\n# 3.2 Inference\n# 3.2.1 Save local engine file to `./tsm_r50_kinetics400_mmaction2.trt`.\npython tsm_r50.py \\\n    --tensorrt-weights ./tsm_r50_kinetics400_mmaction2.wts \\\n    --save-engine-path ./tsm_r50_kinetics400_mmaction2.trt\n\n# 3.2.2 Predict the recognition result using a single video `demo.mp4`.\n#       Should print `Result class id 6`, aka `arm wrestling`\n# Download demo video\nwget https://raw.githubusercontent.com/open-mmlab/mmaction2/master/demo/demo.mp4\n# # use *.wts as input\n# python tsm_r50.py --tensorrt-weights ./tsm_r50_kinetics400_mmaction2.wts \\\n#     --input-video ./demo.mp4\n# use engine file as input\npython tsm_r50.py --load-engine-path ./tsm_r50_kinetics400_mmaction2.trt \\\n    --input-video ./demo.mp4\n\n# 3.2.3 Optional: Compare inference result with MMAction2 TSM-R50 model\n#       Have to install MMAction2 First, please refer to https://github.com/open-mmlab/mmaction2/blob/master/docs/install.md\n# pip3 install pytest-runner\n# pip3 install mmcv\n# pip3 install mmaction2\n# # use *.wts as input\n# python tsm_r50.py \\\n#     --tensorrt-weights ./tsm_r50_kinetics400_mmaction2.wts \\\n#     --test-mmaction2 \\\n#     --mmaction2-config mmaction2_tsm_r50_config.py \\\n#     --mmaction2-checkpoint tsm_r50_1x1x8_50e_kinetics400_rgb_20200607-af7fb746.pth\n# # use TensorRT engine as input\n# python tsm_r50.py \\\n#     --load-engine-path ./tsm_r50_kinetics400_mmaction2.trt \\\n#     --test-mmaction2 \\\n#     --mmaction2-config mmaction2_tsm_r50_config.py \\\n#     --mmaction2-checkpoint tsm_r50_1x1x8_50e_kinetics400_rgb_20200607-af7fb746.pth\n\n# Step 4: Test Python API.\n# 4.1 Skip this step since we use default settings.\n# 4.2 Build CPP\nmkdir build && cd build && cmake .. && make\n# 4.3 Generate Engine file\n./tsm_r50 -s\n# 4.4 Get Predictions\n./tsm_r50 -d\n# 4.5 Compare C++ Results with Python Results\ncd ..\npython tsm_r50.py --test-cpp --tensorrt-weights ./tsm_r50_kinetics400_mmaction2.wts\n"
  },
  {
    "path": "tsm/gen_wts.py",
    "content": "import argparse\nimport struct\n\nimport torch\nimport numpy as np\n\n\ndef write_one_weight(writer, name, weight):\n    assert isinstance(weight, np.ndarray)\n    values = weight.reshape(-1)\n    writer.write('{} {}'.format(name, len(values)))\n    for value in values:\n        writer.write(' ')\n        # float to bytes to hex_string\n        writer.write(struct.pack('>f', float(value)).hex())\n    writer.write('\\n')\n\n\ndef convert_name(name):\n    return name.replace(\"module.\", \"\").replace(\"base_model.\", \"\").\\\n        replace(\"net.\", \"\").replace(\"new_fc\", \"fc\").replace(\"backbone.\", \"\").\\\n        replace(\"cls_head.fc_cls\", \"fc\").replace(\".conv.\", \".\").\\\n        replace(\"conv1.bn\", \"bn1\").replace(\"conv2.bn\", \"bn2\").\\\n        replace(\"conv3.bn\", \"bn3\").replace(\"downsample.bn\", \"downsample.1\").\\\n        replace(\"downsample.weight\", \"downsample.0.weight\")\n\n\ndef main(args):\n    ckpt = torch.load(args.checkpoint)['state_dict']\n    ckpt = {k: v for k, v in ckpt.items() if 'num_batches_tracked' not in k}\n    with open(args.out_filename, \"w\") as f:\n        f.write(f\"{len(ckpt)}\\n\")\n        for k, v in ckpt.items():\n            key = convert_name(k)\n            write_one_weight(f, key, v.cpu().numpy())\n\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"checkpoint\", type=str, help=\"Path to checkpoint file\")\n    parser.add_argument(\"--out-filename\",\n                        type=str,\n                        default=\"tsm_r50.wts\",\n                        help=\"Path to converted wegiths file\")\n    args = parser.parse_args()\n    main(args)\n"
  },
  {
    "path": "tsm/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "tsm/mmaction2_tsm_r50_config.py",
    "content": "# model settings\nmodel = dict(\n    type='Recognizer2D',\n    backbone=dict(\n        type='ResNetTSM',\n        pretrained='torchvision://resnet50',\n        depth=50,\n        norm_eval=False,\n        shift_div=8),\n    cls_head=dict(\n        type='TSMHead',\n        num_classes=400,\n        in_channels=2048,\n        spatial_type='avg',\n        consensus=dict(type='AvgConsensus', dim=1),\n        dropout_ratio=0.5,\n        init_std=0.001,\n        is_shift=True),\n    # model training and testing settings\n    train_cfg=None,\n    test_cfg=dict(average_clips='prob'))\n"
  },
  {
    "path": "tsm/test_shift.py",
    "content": "import numpy as np\nimport pycuda.autoinit  # noqa\nimport pycuda.driver as cuda\nimport tensorrt as trt\nimport torch\nfrom numpy.testing import assert_array_almost_equal\n\nINPUT_BLOB_NAME = 'input'\nOUTPUT_BLOB_NAME = 'output'\n\n\ndef shift_mit(x, num_segments, shift_div=8):\n    \"\"\"Official temporal shift module.\n    \n    Code Reference: https://github.com/mit-han-lab/temporal-shift-module/blob/master/ops/temporal_shift.py # noqa\n    Cannot convert to ONNX Model.\n    \"\"\"\n    nt, c, h, w = x.size()\n    n_batch = nt // num_segments\n    x = x.view(n_batch, num_segments, c, h, w)\n\n    fold = c // shift_div\n\n    out = torch.zeros_like(x)\n    out[:, :-1, :fold] = x[:, 1:, :fold]  # shift left\n    out[:, 1:, fold:2 * fold] = x[:, :-1, fold:2 * fold]  # shift right\n    out[:, :, 2 * fold:] = x[:, :, 2 * fold:]  # not shift\n\n    return out.view(nt, c, h, w)\n\n\ndef shift_mmaction2(x, num_segments, shift_div=8):\n    \"\"\"MMAction2 temporal shift module.\n    \n    Code Reference: https://github.com/open-mmlab/mmaction2/blob/master/mmaction/models/backbones/resnet_tsm.py # noqa\n    Could convert to ONNX Model.\n    \"\"\"\n    # [N, C, H, W]\n    n, c, h, w = x.size()\n\n    # [N // num_segments, num_segments, C, H*W]\n    # can't use 5 dimensional array on PPL2D backend for caffe\n    x = x.view(-1, num_segments, c, h * w)\n\n    # get shift fold\n    fold = c // shift_div\n\n    # split c channel into three parts:\n    # left_split, mid_split, right_split\n    left_split = x[:, :, :fold, :]\n    mid_split = x[:, :, fold:2 * fold, :]\n    right_split = x[:, :, 2 * fold:, :]\n\n    # can't use torch.zeros(*A.shape) or torch.zeros_like(A)\n    # because array on caffe inference must be got by computing\n\n    # shift left on num_segments channel in `left_split`\n    zeros = left_split - left_split\n    blank = zeros[:, :1, :, :]\n    left_split = left_split[:, 1:, :, :]\n    left_split = torch.cat((left_split, blank), 1)\n\n    # shift right on num_segments channel in `mid_split`\n    zeros = mid_split - mid_split\n    blank = zeros[:, :1, :, :]\n    mid_split = mid_split[:, :-1, :, :]\n    mid_split = torch.cat((blank, mid_split), 1)\n\n    # right_split: no shift\n\n    # concatenate\n    out = torch.cat((left_split, mid_split, right_split), 2)\n\n    # [N, C, H, W]\n    # restore the original dimension\n    return out.view(n, c, h, w)\n\n\ndef _tensorrt_shift_module(network,\n                           input,\n                           num_segments=8,\n                           shift_div=8,\n                           input_shape=(16, 64, 32, 32)):\n    \"\"\"Temporal shift module implemented by TensorRT Network Definition API.\"\"\"\n    fold = input_shape[1] // shift_div\n    batch_size = input_shape[0] // num_segments\n\n    # reshape\n    reshape = network.add_shuffle(input)\n    assert reshape\n    reshape.reshape_dims = (batch_size, num_segments) + tuple(input_shape[-3:])\n\n    # left\n    left_split = network.add_slice(reshape.get_output(0),\n                                   start=(0, 1, 0, 0, 0),\n                                   shape=(batch_size, num_segments - 1, fold,\n                                          input_shape[2], input_shape[3]),\n                                   stride=(1, 1, 1, 1, 1))\n    assert left_split\n    left_split_shape = (batch_size, 1, fold, input_shape[2], input_shape[3])\n    left_blank = network.add_constant(shape=left_split_shape,\n                                      weights=np.zeros(left_split_shape,\n                                                       np.float32))\n    assert left_blank\n    left = network.add_concatenation(\n        [left_split.get_output(0),\n         left_blank.get_output(0)])\n    assert left\n    left.axis = 1\n\n    # mid\n    mid_split_shape = (batch_size, 1, fold, input_shape[2], input_shape[3])\n    mid_blank = network.add_constant(shape=mid_split_shape,\n                                     weights=np.zeros(mid_split_shape,\n                                                      np.float32))\n    assert mid_blank\n    mid_split = network.add_slice(reshape.get_output(0),\n                                  start=(0, 0, fold, 0, 0),\n                                  shape=(batch_size, num_segments - 1, fold,\n                                         input_shape[2], input_shape[3]),\n                                  stride=(1, 1, 1, 1, 1))\n    assert mid_split\n    mid = network.add_concatenation(\n        [mid_blank.get_output(0),\n         mid_split.get_output(0)])\n    assert mid\n    mid.axis = 1\n\n    # right\n    right = network.add_slice(reshape.get_output(0),\n                              start=(0, 0, 2 * fold, 0, 0),\n                              shape=(batch_size, num_segments,\n                                     input_shape[1] - 2 * fold, input_shape[2],\n                                     input_shape[3]),\n                              stride=(1, 1, 1, 1, 1))\n\n    # concat\n    concat = network.add_concatenation(\n        [left.get_output(0),\n         mid.get_output(0),\n         right.get_output(0)])\n    assert concat\n    concat.axis = 2\n\n    # reshape\n    reshape2 = network.add_shuffle(concat.get_output(0))\n    assert reshape2\n    reshape2.reshape_dims = input_shape\n    return reshape2\n\n\ndef shift_tensorrt(x, num_segments, shift_div, input_shape):\n    \"\"\"Test TensorRT temporal shift module.\"\"\"\n    assert isinstance(x, np.ndarray)\n\n    gLogger = trt.Logger(trt.Logger.INFO)\n    builder = trt.Builder(gLogger)\n    config = builder.create_builder_config()\n\n    # create engine\n    explicit_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)\n    network = builder.create_network(explicit_flag)\n    input = network.add_input(INPUT_BLOB_NAME, trt.float32, input_shape)\n    assert input\n    output = _tensorrt_shift_module(network,\n                                    input,\n                                    num_segments=num_segments,\n                                    shift_div=shift_div,\n                                    input_shape=input_shape)\n    assert output\n\n    # generate engine by builder/network/config\n    output.get_output(0).name = OUTPUT_BLOB_NAME\n    network.mark_output(output.get_output(0))\n    builder.max_batch_size = 1\n    builder.max_workspace_size = 1 << 20\n    engine = builder.build_engine(network, config)\n    del network\n    assert engine.num_bindings == 2, f'{engine.num_bindings}'\n    context = engine.create_execution_context()\n\n    # buffer\n    host_in = cuda.pagelocked_empty(trt.volume(input_shape), dtype=np.float32)\n    np.copyto(host_in, x.ravel())\n    host_out = cuda.pagelocked_empty(trt.volume(input_shape), dtype=np.float32)\n    devide_in = cuda.mem_alloc(host_in.nbytes)\n    devide_out = cuda.mem_alloc(host_out.nbytes)\n    bindings = [int(devide_in), int(devide_out)]\n    stream = cuda.Stream()\n\n    # do inference\n    cuda.memcpy_htod_async(devide_in, host_in, stream)\n    context.execute_async(bindings=bindings, stream_handle=stream.handle)\n    cuda.memcpy_dtoh_async(host_out, devide_out, stream)\n    stream.synchronize()\n\n    return np.array(host_out.reshape(*input_shape))\n\n\nif __name__ == '__main__':\n    INPUT_SHAPE = (16, 64, 32, 32)\n    assert len(INPUT_SHAPE) == 4\n    NUM_SEGMENTS = 8\n    SHIFT_DIV = 8\n\n    # inference\n    inputs = np.random.rand(*INPUT_SHAPE).astype(np.float32)\n    inputs_pytorch = torch.tensor(inputs)\n    with torch.no_grad():\n        rmit = shift_mit(inputs_pytorch, NUM_SEGMENTS, SHIFT_DIV).numpy()\n        rmmaction2 = shift_mmaction2(inputs_pytorch, NUM_SEGMENTS,\n                                     SHIFT_DIV).numpy()\n    rtensorrt = shift_tensorrt(inputs, NUM_SEGMENTS, SHIFT_DIV, INPUT_SHAPE)\n\n    # test results\n    assert_array_almost_equal(rmit, rtensorrt)\n    assert_array_almost_equal(rmmaction2, rtensorrt)\n    print(\"Tests PASSED\")\n"
  },
  {
    "path": "tsm/tsm_r50.cpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <cmath>\n#include <cstring>\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 400;\nstatic const int NUM_SEGMENTS = 8;\nstatic const int SHIFT_DIV = 8;\n\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\nconst char* WEIGHTS_PATH = \"../tsm_r50_kinetics400_mmaction2.wts\";\nconst char* ENGINE_PATH = \"./tsm_r50_kinetics400_mmaction2_cpp.trt\";\nconst char* RESULT_PATH = \"./result.txt\";\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nvoid print(char* name, ITensor* tensor) {\n    Dims dim = tensor->getDimensions();\n    std::cout << name << \" \" << dim.d[0] << \" \" << dim.d[1] << \" \" << dim.d[2] << \" \" << dim.d[3] <<std::endl;\n}\n\nIConcatenationLayer* addShift(INetworkDefinition *network, ITensor& input, Dims4 inputShape, int numSegments, int shiftDiv) {\n    int fold = int(inputShape.d[1] / shiftDiv);\n    float* zeros = reinterpret_cast<float*>(malloc(sizeof(zeros) * fold*inputShape.d[2]*inputShape.d[3]));\n    memset(zeros, 0, sizeof(zeros) * fold*inputShape.d[2]*inputShape.d[3]);\n    Weights zeros_weights{DataType::kFLOAT, zeros, fold*inputShape.d[2]*inputShape.d[3]};\n\n    // left\n    ISliceLayer* left1 = network->addSlice(input, Dims4{1, 0, 0, 0}, Dims4{numSegments - 1, fold, inputShape.d[2], inputShape.d[3]}, Dims4{1, 1, 1, 1});\n    IConstantLayer* left2 = network->addConstant(Dims4{1, fold, inputShape.d[2], inputShape.d[3]}, zeros_weights);\n    ITensor* tensorsLeft[] = {left1->getOutput(0), left2->getOutput(0)};\n    IConcatenationLayer* left = network->addConcatenation(tensorsLeft, 2);\n    left->setAxis(0);\n\n    // mid\n    IConstantLayer* mid1 = network->addConstant(Dims4{1, fold, inputShape.d[2], inputShape.d[3]}, zeros_weights);\n    ISliceLayer* mid2 = network->addSlice(input, Dims4{0, fold, 0, 0}, Dims4{numSegments - 1, fold, inputShape.d[2], inputShape.d[3]}, Dims4{1, 1, 1, 1});\n    ITensor* tensorsMid[] = {mid1->getOutput(0), mid2->getOutput(0)};\n    IConcatenationLayer* mid = network->addConcatenation(tensorsMid, 2);\n    mid->setAxis(0);\n\n    // right\n    ISliceLayer* right = network->addSlice(input, Dims4{0, 2 * fold, 0, 0}, Dims4{numSegments, inputShape.d[1] - 2 * fold, inputShape.d[2], inputShape.d[3]}, Dims4{1, 1, 1, 1});\n\n    // concatenate left/mid/right\n    ITensor* tensors[] = {left->getOutput(0), mid->getOutput(0), right->getOutput(0)};\n    IConcatenationLayer* concat = network->addConcatenation(tensors, 3);\n    concat->setAxis(1);\n    return concat;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nIActivationLayer* bottleneck(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int stride, std::string lname, Dims4 inputShape) {\n    IConcatenationLayer* shift = addShift(network, input, inputShape, NUM_SEGMENTS, SHIFT_DIV);\n    assert(shift);\n\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolution(*shift->getOutput(0), outch, DimsHW{1, 1}, weightMap[lname + \"conv1.weight\"], emptywts);\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"bn1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IConvolutionLayer* conv2 = network->addConvolution(*relu1->getOutput(0), outch, DimsHW{3, 3}, weightMap[lname + \"conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setStride(DimsHW{stride, stride});\n    conv2->setPadding(DimsHW{1, 1});\n\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"bn2\", 1e-5);\n\n    IActivationLayer* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n\n    IConvolutionLayer* conv3 = network->addConvolution(*relu2->getOutput(0), outch * 4, DimsHW{1, 1}, weightMap[lname + \"conv3.weight\"], emptywts);\n    assert(conv3);\n\n    IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"bn3\", 1e-5);\n\n    IElementWiseLayer* ew1;\n    if (stride != 1 || inch != outch * 4) {\n        IConvolutionLayer* conv4 = network->addConvolution(input, outch * 4, DimsHW{1, 1}, weightMap[lname + \"downsample.0.weight\"], emptywts);\n        assert(conv4);\n        conv4->setStride(DimsHW{stride, stride});\n\n        IScaleLayer* bn4 = addBatchNorm2d(network, weightMap, *conv4->getOutput(0), lname + \"downsample.1\", 1e-5);\n        ew1 = network->addElementWise(*bn4->getOutput(0), *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    } else {\n        ew1 = network->addElementWise(input, *bn3->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    IActivationLayer* relu3 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n    return relu3;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, DataType dt)\n{\n    INetworkDefinition* network = builder->createNetwork();\n\n    // Create input tensor of shape {NUM_SEGMENTS, 3, INPUT_H, INPUT_W } with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims4{NUM_SEGMENTS, 3, INPUT_H, INPUT_W});\n    assert(data);\n    print(\"input\", data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(WEIGHTS_PATH);\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolution(*data, 64, DimsHW{7, 7}, weightMap[\"conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStride(DimsHW{2, 2});\n    conv1->setPadding(DimsHW{3, 3});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"bn1\", 1e-5);\n\n    // Add activation layer using the ReLU algorithm.\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    // Add max pooling layer with stride of 2x2 and kernel size of 2x2.\n    IPoolingLayer* pool1 = network->addPooling(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    assert(pool1);\n    pool1->setStride(DimsHW{2, 2});\n    pool1->setPadding(DimsHW{1, 1});\n    \n    int curHeight = int(INPUT_H / 4);\n    int curWidth = int(INPUT_W / 4);\n    IActivationLayer* x = bottleneck(network, weightMap, *pool1->getOutput(0), 64, 64, 1, \"layer1.0.\", Dims4{NUM_SEGMENTS, 64, curHeight, curWidth});\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 64, 1, \"layer1.1.\", Dims4{NUM_SEGMENTS, 256, curHeight, curWidth});\n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 64, 1, \"layer1.2.\", Dims4{NUM_SEGMENTS, 256, curHeight, curWidth});\n    \n    x = bottleneck(network, weightMap, *x->getOutput(0), 256, 128, 2, \"layer2.0.\", Dims4{NUM_SEGMENTS, 256, curHeight, curWidth});\n    curHeight = int(INPUT_H / 8);\n    curWidth = int(INPUT_W / 8);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.1.\", Dims4{NUM_SEGMENTS, 512, curHeight, curWidth});\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.2.\", Dims4{NUM_SEGMENTS, 512, curHeight, curWidth});\n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 128, 1, \"layer2.3.\", Dims4{NUM_SEGMENTS, 512, curHeight, curWidth});\n    \n    x = bottleneck(network, weightMap, *x->getOutput(0), 512, 256, 2, \"layer3.0.\", Dims4{NUM_SEGMENTS, 512, curHeight, curWidth});\n    curHeight = int(INPUT_H / 16);\n    curWidth = int(INPUT_W / 16);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.1.\", Dims4{NUM_SEGMENTS, 1024, curHeight, curWidth});\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.2.\", Dims4{NUM_SEGMENTS, 1024, curHeight, curWidth});\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.3.\", Dims4{NUM_SEGMENTS, 1024, curHeight, curWidth});\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.4.\", Dims4{NUM_SEGMENTS, 1024, curHeight, curWidth});\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 256, 1, \"layer3.5.\", Dims4{NUM_SEGMENTS, 1024, curHeight, curWidth});\n\n    x = bottleneck(network, weightMap, *x->getOutput(0), 1024, 512, 2, \"layer4.0.\", Dims4{NUM_SEGMENTS, 1024, curHeight, curWidth});\n    curHeight = int(INPUT_H / 32);\n    curWidth = int(INPUT_W / 32);\n    x = bottleneck(network, weightMap, *x->getOutput(0), 2048, 512, 1, \"layer4.1.\", Dims4{NUM_SEGMENTS, 2048, curHeight, curWidth});\n    x = bottleneck(network, weightMap, *x->getOutput(0), 2048, 512, 1, \"layer4.2.\", Dims4{NUM_SEGMENTS, 2048, curHeight, curWidth});\n\n    IPoolingLayer* pool2 = network->addPooling(*x->getOutput(0), PoolingType::kAVERAGE, DimsHW{curHeight, curWidth});\n    assert(pool2);\n    pool2->setStride(DimsHW{1, 1});\n    \n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), OUTPUT_SIZE, weightMap[\"fc.weight\"], weightMap[\"fc.bias\"]);\n    assert(fc1);\n\n    IReduceLayer* reduce = network->addReduce(*fc1->getOutput(0), ReduceOperation::kAVG, 1, false);\n    assert(reduce);\n\n    ISoftMaxLayer* softmax = network->addSoftMax(*reduce->getOutput(0));\n    assert(softmax);\n    softmax->setAxes(1);\n\n    softmax->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*softmax->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    ICudaEngine* engine = builder->buildCudaEngine(*network);\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream)\n{\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize)\n{\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * NUM_SEGMENTS * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * NUM_SEGMENTS * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./tsm_r50 -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./tsm_r50 -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(ENGINE_PATH, std::ios::binary);\n        if (!p)\n        {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(ENGINE_PATH, std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n\n    // Subtract mean from image\n    static float data[NUM_SEGMENTS * 3 * INPUT_H * INPUT_W];\n    for (int i = 0; i < NUM_SEGMENTS * 3 * INPUT_H * INPUT_W; i++)\n        data[i] = 1.0;\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    static float prob[OUTPUT_SIZE];\n    doInference(*context, data, prob, 1);\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < 10; i++)\n    {\n        std::cout << prob[i] << \", \";\n    }\n    std::cout << std::endl;\n    for (unsigned int i = 0; i < 10; i++)\n    {\n        std::cout << prob[OUTPUT_SIZE - 10 + i] << \", \";\n    }\n    std::cout << std::endl;\n    std::fstream writer(RESULT_PATH, std::ios::out);\n\n    writer << prob[0];\n    for(int i = 1; i < OUTPUT_SIZE ; i++) {\n        writer << \" \" << prob[i];\n    }\n    writer.close();\n\n    return 0;\n}\n"
  },
  {
    "path": "tsm/tsm_r50.py",
    "content": "import argparse\nimport os\nimport struct\n\nimport numpy as np\nimport pycuda.autoinit  # noqa\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nBATCH_SIZE = 1\nNUM_SEGMENTS = 8\nINPUT_H = 224\nINPUT_W = 224\nOUTPUT_SIZE = 400\nSHIFT_DIV = 8\n\nassert INPUT_H % 32 == 0 and INPUT_W % 32 == 0, \\\n    \"Input height and width should be a multiple of 32.\"\n\nEPS = 1e-5\nINPUT_BLOB_NAME = \"data\"\nOUTPUT_BLOB_NAME = \"prob\"\n\nTRT_LOGGER = trt.Logger(trt.Logger.INFO)\n\n\ndef load_weights(file):\n    print(f\"Loading weights: {file}\")\n\n    assert os.path.exists(file), f'Unable to load weight file {file}'\n\n    weight_map = {}\n    with open(file, \"r\") as f:\n        lines = [line.strip() for line in f]\n    count = int(lines[0])\n    assert count == len(lines) - 1\n    for i in range(1, count + 1):\n        splits = lines[i].split(\" \")\n        name = splits[0]\n        cur_count = int(splits[1])\n        assert cur_count + 2 == len(splits)\n        values = []\n        for j in range(2, len(splits)):\n            # hex string to bytes to float\n            values.append(struct.unpack(\">f\", bytes.fromhex(splits[j])))\n        weight_map[name] = np.array(values, dtype=np.float32)\n\n    return weight_map\n\n\ndef add_shift_module(network, input, input_shape, num_segments=8, shift_div=8):\n    fold = input_shape[1] // shift_div\n\n    # left\n    left_split = network.add_slice(input,\n                                   start=(1, 0, 0, 0),\n                                   shape=(num_segments - 1, fold,\n                                          input_shape[2], input_shape[3]),\n                                   stride=(1, 1, 1, 1))\n    assert left_split\n    left_split_shape = (1, fold, input_shape[2], input_shape[3])\n    left_blank = network.add_constant(shape=left_split_shape,\n                                      weights=np.zeros(left_split_shape,\n                                                       np.float32))\n    assert left_blank\n    left = network.add_concatenation(\n        [left_split.get_output(0),\n         left_blank.get_output(0)])\n    assert left\n    left.axis = 0\n\n    # mid\n    mid_split_shape = (1, fold, input_shape[2], input_shape[3])\n    mid_blank = network.add_constant(shape=mid_split_shape,\n                                     weights=np.zeros(mid_split_shape,\n                                                      np.float32))\n    assert mid_blank\n    mid_split = network.add_slice(input,\n                                  start=(0, fold, 0, 0),\n                                  shape=(num_segments - 1, fold,\n                                         input_shape[2], input_shape[3]),\n                                  stride=(1, 1, 1, 1))\n    assert mid_split\n    mid = network.add_concatenation(\n        [mid_blank.get_output(0),\n         mid_split.get_output(0)])\n    assert mid\n    mid.axis = 0\n\n    # right\n    right = network.add_slice(input,\n                              start=(0, 2 * fold, 0, 0),\n                              shape=(num_segments, input_shape[1] - 2 * fold,\n                                     input_shape[2], input_shape[3]),\n                              stride=(1, 1, 1, 1))\n\n    # concat left mid right\n    output = network.add_concatenation(\n        [left.get_output(0),\n         mid.get_output(0),\n         right.get_output(0)])\n    assert output\n    output.axis = 1\n    return output\n\n\ndef add_batch_norm_2d(network, weight_map, input, layer_name, eps):\n    gamma = weight_map[layer_name + \".weight\"]\n    beta = weight_map[layer_name + \".bias\"]\n    mean = weight_map[layer_name + \".running_mean\"]\n    var = weight_map[layer_name + \".running_var\"]\n    var = np.sqrt(var + eps)\n\n    scale = gamma / var\n    shift = -mean / var * gamma + beta\n    return network.add_scale(input=input,\n                             mode=trt.ScaleMode.CHANNEL,\n                             shift=shift,\n                             scale=scale)\n\n\ndef bottleneck(network, weight_map, input, in_channels, out_channels, stride,\n               layer_name, input_shape):\n    shift = add_shift_module(network, input, input_shape, NUM_SEGMENTS,\n                             SHIFT_DIV)\n    assert shift\n\n    conv1 = network.add_convolution(input=shift.get_output(0),\n                                    num_output_maps=out_channels,\n                                    kernel_shape=(1, 1),\n                                    kernel=weight_map[layer_name +\n                                                      \"conv1.weight\"],\n                                    bias=trt.Weights())\n    assert conv1\n\n    bn1 = add_batch_norm_2d(network, weight_map, conv1.get_output(0),\n                            layer_name + \"bn1\", EPS)\n    assert bn1\n\n    relu1 = network.add_activation(bn1.get_output(0),\n                                   type=trt.ActivationType.RELU)\n    assert relu1\n\n    conv2 = network.add_convolution(input=relu1.get_output(0),\n                                    num_output_maps=out_channels,\n                                    kernel_shape=(3, 3),\n                                    kernel=weight_map[layer_name +\n                                                      \"conv2.weight\"],\n                                    bias=trt.Weights())\n    assert conv2\n    conv2.stride = (stride, stride)\n    conv2.padding = (1, 1)\n\n    bn2 = add_batch_norm_2d(network, weight_map, conv2.get_output(0),\n                            layer_name + \"bn2\", EPS)\n    assert bn2\n\n    relu2 = network.add_activation(bn2.get_output(0),\n                                   type=trt.ActivationType.RELU)\n    assert relu2\n\n    conv3 = network.add_convolution(input=relu2.get_output(0),\n                                    num_output_maps=out_channels * 4,\n                                    kernel_shape=(1, 1),\n                                    kernel=weight_map[layer_name +\n                                                      \"conv3.weight\"],\n                                    bias=trt.Weights())\n    assert conv3\n\n    bn3 = add_batch_norm_2d(network, weight_map, conv3.get_output(0),\n                            layer_name + \"bn3\", EPS)\n    assert bn3\n\n    if stride != 1 or in_channels != 4 * out_channels:\n        conv4 = network.add_convolution(\n            input=input,\n            num_output_maps=out_channels * 4,\n            kernel_shape=(1, 1),\n            kernel=weight_map[layer_name + \"downsample.0.weight\"],\n            bias=trt.Weights())\n        assert conv4\n        conv4.stride = (stride, stride)\n\n        bn4 = add_batch_norm_2d(network, weight_map, conv4.get_output(0),\n                                layer_name + \"downsample.1\", EPS)\n        assert bn4\n\n        ew1 = network.add_elementwise(bn4.get_output(0), bn3.get_output(0),\n                                      trt.ElementWiseOperation.SUM)\n    else:\n        ew1 = network.add_elementwise(input, bn3.get_output(0),\n                                      trt.ElementWiseOperation.SUM)\n    assert ew1\n\n    relu3 = network.add_activation(ew1.get_output(0),\n                                   type=trt.ActivationType.RELU)\n    assert relu3\n\n    return relu3\n\n\ndef create_engine(maxBatchSize, builder, dt, weights):\n    weight_map = load_weights(weights)\n    network = builder.create_network()\n\n    data = network.add_input(INPUT_BLOB_NAME, dt,\n                             (NUM_SEGMENTS, 3, INPUT_H, INPUT_W))\n    assert data\n\n    conv1 = network.add_convolution(input=data,\n                                    num_output_maps=64,\n                                    kernel_shape=(7, 7),\n                                    kernel=weight_map[\"conv1.weight\"],\n                                    bias=trt.Weights())\n    assert conv1\n    conv1.stride = (2, 2)\n    conv1.padding = (3, 3)\n\n    bn1 = add_batch_norm_2d(network, weight_map, conv1.get_output(0), \"bn1\",\n                            EPS)\n    assert bn1\n\n    relu1 = network.add_activation(bn1.get_output(0),\n                                   type=trt.ActivationType.RELU)\n    assert relu1\n\n    pool1 = network.add_pooling(input=relu1.get_output(0),\n                                window_size=trt.DimsHW(3, 3),\n                                type=trt.PoolingType.MAX)\n    assert pool1\n    pool1.stride = (2, 2)\n    pool1.padding = (1, 1)\n\n    cur_height = INPUT_H // 4\n    cur_width = INPUT_W // 4\n    x = bottleneck(network, weight_map, pool1.get_output(0), 64, 64, 1,\n                   \"layer1.0.\", (NUM_SEGMENTS, 64, cur_height, cur_width))\n    x = bottleneck(network, weight_map, x.get_output(0), 256, 64, 1,\n                   \"layer1.1.\", (NUM_SEGMENTS, 256, cur_height, cur_width))\n    x = bottleneck(network, weight_map, x.get_output(0), 256, 64, 1,\n                   \"layer1.2.\", (NUM_SEGMENTS, 256, cur_height, cur_width))\n\n    x = bottleneck(network, weight_map, x.get_output(0), 256, 128, 2,\n                   \"layer2.0.\", (NUM_SEGMENTS, 256, cur_height, cur_width))\n    cur_height = INPUT_H // 8\n    cur_width = INPUT_W // 8\n    x = bottleneck(network, weight_map, x.get_output(0), 512, 128, 1,\n                   \"layer2.1.\", (NUM_SEGMENTS, 512, cur_height, cur_width))\n    x = bottleneck(network, weight_map, x.get_output(0), 512, 128, 1,\n                   \"layer2.2.\", (NUM_SEGMENTS, 512, cur_height, cur_width))\n    x = bottleneck(network, weight_map, x.get_output(0), 512, 128, 1,\n                   \"layer2.3.\", (NUM_SEGMENTS, 512, cur_height, cur_width))\n    x = bottleneck(network, weight_map, x.get_output(0), 512, 256, 2,\n                   \"layer3.0.\", (NUM_SEGMENTS, 512, cur_height, cur_width))\n    cur_height = INPUT_H // 16\n    cur_width = INPUT_W // 16\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 256, 1,\n                   \"layer3.1.\", (NUM_SEGMENTS, 1024, cur_height, cur_width))\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 256, 1,\n                   \"layer3.2.\", (NUM_SEGMENTS, 1024, cur_height, cur_width))\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 256, 1,\n                   \"layer3.3.\", (NUM_SEGMENTS, 1024, cur_height, cur_width))\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 256, 1,\n                   \"layer3.4.\", (NUM_SEGMENTS, 1024, cur_height, cur_width))\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 256, 1,\n                   \"layer3.5.\", (NUM_SEGMENTS, 1024, cur_height, cur_width))\n\n    x = bottleneck(network, weight_map, x.get_output(0), 1024, 512, 2,\n                   \"layer4.0.\", (NUM_SEGMENTS, 1024, cur_height, cur_width))\n    cur_height = INPUT_H // 32\n    cur_width = INPUT_W // 32\n    x = bottleneck(network, weight_map, x.get_output(0), 2048, 512, 1,\n                   \"layer4.1.\", (NUM_SEGMENTS, 2048, cur_height, cur_width))\n    x = bottleneck(network, weight_map, x.get_output(0), 2048, 512, 1,\n                   \"layer4.2.\", (NUM_SEGMENTS, 2048, cur_height, cur_width))\n\n    pool2 = network.add_pooling(x.get_output(0),\n                                window_size=trt.DimsHW(cur_height, cur_width),\n                                type=trt.PoolingType.AVERAGE)\n    assert pool2\n    pool2.stride = (1, 1)\n\n    fc1 = network.add_fully_connected(input=pool2.get_output(0),\n                                      num_outputs=OUTPUT_SIZE,\n                                      kernel=weight_map['fc.weight'],\n                                      bias=weight_map['fc.bias'])\n    assert fc1\n\n    reshape = network.add_shuffle(fc1.get_output(0))\n    assert reshape\n    reshape.reshape_dims = (NUM_SEGMENTS, OUTPUT_SIZE)\n\n    reduce = network.add_reduce(reshape.get_output(0),\n                                op=trt.ReduceOperation.AVG,\n                                axes=1,\n                                keep_dims=False)\n    assert reduce\n\n    softmax = network.add_softmax(reduce.get_output(0))\n    assert softmax\n    softmax.axes = 1\n\n    softmax.get_output(0).name = OUTPUT_BLOB_NAME\n    network.mark_output(softmax.get_output(0))\n\n    # Build engine\n    builder.max_batch_size = maxBatchSize\n    builder.max_workspace_size = 1 << 20\n    engine = builder.build_cuda_engine(network)\n\n    del network\n    del weight_map\n\n    return engine\n\n\ndef do_inference(context, host_in, host_out, batchSize):\n    devide_in = cuda.mem_alloc(host_in.nbytes)\n    devide_out = cuda.mem_alloc(host_out.nbytes)\n    bindings = [int(devide_in), int(devide_out)]\n    stream = cuda.Stream()\n\n    cuda.memcpy_htod_async(devide_in, host_in, stream)\n    context.execute_async(batch_size=batchSize,\n                          bindings=bindings,\n                          stream_handle=stream.handle)\n    cuda.memcpy_dtoh_async(host_out, devide_out, stream)\n    stream.synchronize()\n\n\ndef inference_mmaction2(inputs, config, checkpoint):\n    import torch\n    from mmaction.models import build_model\n    from mmcv import Config\n    from mmcv.runner import load_checkpoint\n\n    cfg = Config.fromfile(config)\n    cfg.model.backbone.pretrained = None\n    model = build_model(cfg.model,\n                        train_cfg=None,\n                        test_cfg=cfg.get('test_cfg'))\n    load_checkpoint(model, checkpoint, map_location='cpu')\n    model.eval()\n    inputs = torch.tensor(inputs)\n    with torch.no_grad():\n        return model(return_loss=False, imgs=inputs)\n\n\ndef main(args):\n    assert not (args.save_engine_path and args.load_engine_path)\n\n    if args.load_engine_path:\n        # load from local file\n        runtime = trt.Runtime(TRT_LOGGER)\n        assert runtime\n        with open(args.load_engine_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n    else:\n        # Create network and engine\n        assert args.tensorrt_weights\n        builder = trt.Builder(TRT_LOGGER)\n        engine = create_engine(BATCH_SIZE, builder, trt.float32,\n                               args.tensorrt_weights)\n    assert engine\n    assert engine.num_bindings == 2\n\n    if args.save_engine_path is not None:\n        # save engine to local file\n        with open(args.save_engine_path, \"wb\") as f:\n            f.write(engine.serialize())\n        print(f\"{args.save_engine_path} Generated successfully.\")\n\n    context = engine.create_execution_context()\n    assert context\n\n    host_in = cuda.pagelocked_empty(BATCH_SIZE * NUM_SEGMENTS * 3 * INPUT_H *\n                                    INPUT_W,\n                                    dtype=np.float32)\n    host_out = cuda.pagelocked_empty(BATCH_SIZE * OUTPUT_SIZE,\n                                     dtype=np.float32)\n\n    if args.test_mmaction2:\n        assert args.mmaction2_config and args.mmaction2_checkpoint, \\\n            \"MMAction2 config and checkpoint couldn't be None\"\n\n        data = np.random.randn(BATCH_SIZE, NUM_SEGMENTS, 3, INPUT_H,\n                               INPUT_W).astype(np.float32)\n\n        # TensorRT inference\n        np.copyto(host_in, data.ravel())\n        do_inference(context, host_in, host_out, BATCH_SIZE)\n\n        # pytorch inference\n        pytorch_results = inference_mmaction2(data, args.mmaction2_config,\n                                              args.mmaction2_checkpoint)\n\n        # test\n        from numpy.testing import assert_array_almost_equal\n        assert_array_almost_equal(host_out.reshape(-1),\n                                  pytorch_results.reshape(-1),\n                                  decimal=4)\n        print(\"MMAction2 TEST PASSED\")\n\n    if args.test_cpp:\n        assert args.cpp_result_path, \"Should set --cpp-result-path\"\n        assert os.path.exists(args.cpp_result_path),\\\n            f\"{args.cpp_result} doesn't exist\"\n\n        # C++ API fixed inputs\n        inputs = np.ones((BATCH_SIZE, NUM_SEGMENTS, 3, INPUT_H, INPUT_W),\n                         dtype=np.float32)\n\n        # TensorRT inference\n        np.copyto(host_in, inputs.ravel())\n        do_inference(context, host_in, host_out, BATCH_SIZE)\n\n        # Read cpp inference results\n        with open(args.cpp_result_path, \"r\") as f:\n            data = f.read().strip()\n        cpp_results = np.array([float(d)\n                                for d in data.split(\" \")]).astype(np.float32)\n\n        # test\n        from numpy.testing import assert_array_almost_equal\n        assert_array_almost_equal(host_out.reshape(-1),\n                                  cpp_results.reshape(-1),\n                                  decimal=4)\n        print(\"CPP TEST PASSED\")\n\n    if args.input_video:\n        # Get ONE prediction result from ONE video\n        # Use demo.mp4 from MMAction2\n        import cv2\n\n        # get selected frame id of uniform sampling\n        cap = cv2.VideoCapture(args.input_video)\n        sample_length = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))\n        avg_interval = sample_length / float(NUM_SEGMENTS)\n        base_offsets = np.arange(NUM_SEGMENTS) * avg_interval\n        clip_offsets = (base_offsets + avg_interval / 2.0).astype(np.int32)\n\n        # read frames\n        frames = []\n        for i in range(max(clip_offsets) + 1):\n            flag, frame = cap.read()\n            if i in clip_offsets:\n                frames.append(cv2.resize(frame, (INPUT_W, INPUT_W)))\n        frames = np.array(frames)\n\n        # preprocessing frames\n        mean = np.array([123.675, 116.28, 103.53])\n        std = np.array([58.395, 57.12, 57.375])\n        frames = (frames - mean) / std\n        frames = frames.transpose([0, 3, 1, 2])\n\n        # TensorRT inference\n        np.copyto(host_in, frames.ravel())\n        do_inference(context, host_in, host_out, BATCH_SIZE)\n        # For demo.mp4, should be 6, aka arm wrestling\n        class_id = np.argmax(host_out.reshape(-1))\n        print(\n            f'Result class id {class_id}, socre {round(host_out[class_id]):.2f}'\n        )\n\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\n        \"--tensorrt-weights\",\n        type=str,\n        default=None,\n        help=\"Path to TensorRT weights, which is generated by gen_weights.py\")\n    parser.add_argument(\"--input-video\",\n                        type=str,\n                        default=None,\n                        help=\"Path to local video file\")\n    parser.add_argument(\"--save-engine-path\",\n                        type=str,\n                        default=None,\n                        help=\"Save engine to local file\")\n    parser.add_argument(\"--load-engine-path\",\n                        type=str,\n                        default=None,\n                        help=\"Saved engine file path\")\n    parser.add_argument(\"--test-mmaction2\",\n                        action='store_true',\n                        help=\"Compare TensorRT results with MMAction2 Results\")\n    parser.add_argument(\"--mmaction2-config\",\n                        type=str,\n                        default=None,\n                        help=\"Path to MMAction2 config file\")\n    parser.add_argument(\"--mmaction2-checkpoint\",\n                        type=str,\n                        default=None,\n                        help=\"Path to MMAction2 checkpoint url or file path\")\n    parser.add_argument(\"--test-cpp\",\n                        action='store_true',\n                        help=\"Compare Python API results with C++ API results\")\n    parser.add_argument(\"--cpp-result-path\",\n                        type=str,\n                        default='./build/result.txt',\n                        help=\"Path to C++ API results\")\n\n    main(parser.parse_args())\n"
  },
  {
    "path": "tutorials/check_fp16_int8_support.md",
    "content": "# Check if Your GPU Supports FP16/INT8\n\n## 1. check your GPU Compute Capability\n\nvisit https://developer.nvidia.com/cuda-gpus#compute and check your GPU compute capability.\n\nFor example, GTX1080 is 6.1, Tesla T4 is 7.5.\n\n## 2. check the hardware-precision-matrix\n\nvisit https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#hardware-precision-matrix and check the matrix.\n\nFor example, compute capability 6.1 supports FP32 and INT8. 7.5 supports FP32, FP16, INT8, FP16 tensor core, etc.\n"
  },
  {
    "path": "tutorials/faq.md",
    "content": "# Frequently Asked Questions (FAQ)\n\n## 1. fatal error: NvInfer.h: No such file or directory\n\n`NvInfer.h` is one of the headers of TensorRT. If you install the tensorrt DEB package, the headers should in `/usr/include/x86_64-linux-gnu/`. If you install tensorrt TAR or ZIP file, it is recommended to manage TensorRT with modern CMake syntax, e.g. [FindTensorRT.cmake](../lenet/FindTensorRT.cmake).\n\n`dpkg -L` can print out the contents of a DEB package.\n\n```\n$ dpkg -L libnvinfer-dev\n/.\n/usr\n/usr/lib\n/usr/lib/x86_64-linux-gnu\n/usr/lib/x86_64-linux-gnu/libnvinfer_static.a\n/usr/lib/x86_64-linux-gnu/libmyelin_compiler_static.a\n/usr/lib/x86_64-linux-gnu/libmyelin_executor_static.a\n/usr/lib/x86_64-linux-gnu/libmyelin_pattern_library_static.a\n/usr/lib/x86_64-linux-gnu/libmyelin_pattern_runtime_static.a\n/usr/include\n/usr/include/x86_64-linux-gnu\n/usr/include/x86_64-linux-gnu/NvInfer.h\n/usr/include/x86_64-linux-gnu/NvInferRuntime.h\n/usr/include/x86_64-linux-gnu/NvInferRuntimeCommon.h\n/usr/include/x86_64-linux-gnu/NvInferVersion.h\n/usr/include/x86_64-linux-gnu/NvUtils.h\n/usr/share\n/usr/share/doc\n/usr/share/doc/libnvinfer-dev\n/usr/share/doc/libnvinfer-dev/copyright\n/usr/share/doc/libnvinfer-dev/changelog.Debian\n/usr/lib/x86_64-linux-gnu/libmyelin.so\n/usr/lib/x86_64-linux-gnu/libnvinfer.so\n```\n\n## 2. fatal error: cuda_runtime_api.h: No such file or directory\n\n`cuda_runtime_api.h` is from cuda-cudart. If you met this error, you need find where it is and adapt the `include_directories` and `link_directories` of cuda in `CMakeLists.txt`.\n\n```\n$ dpkg -L cuda-cudart-dev-10-0\n/.\n/usr\n/usr/local\n/usr/local/cuda-10.0\n/usr/local/cuda-10.0/targets\n/usr/local/cuda-10.0/targets/x86_64-linux\n/usr/local/cuda-10.0/targets/x86_64-linux/lib\n/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudadevrt.a\n/usr/local/cuda-10.0/targets/x86_64-linux/lib/libOpenCL.so.1.1\n/usr/local/cuda-10.0/targets/x86_64-linux/lib/libculibos.a\n/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart_static.a\n/usr/local/cuda-10.0/targets/x86_64-linux/include\n/usr/local/cuda-10.0/targets/x86_64-linux/include/cuda_runtime_api.h\n/usr/local/cuda-10.0/targets/x86_64-linux/include/cudart_platform.h\n/usr/local/cuda-10.0/targets/x86_64-linux/include/cuda_device_runtime_api.h\n/usr/local/cuda-10.0/targets/x86_64-linux/include/cuda_runtime.h\n/usr/lib\n/usr/lib/pkgconfig\n/usr/lib/pkgconfig/cudart-10.0.pc\n/usr/share\n/usr/share/doc\n/usr/share/doc/cuda-cudart-dev-10-0\n/usr/share/doc/cuda-cudart-dev-10-0/changelog.Debian.gz\n/usr/share/doc/cuda-cudart-dev-10-0/copyright\n/usr/local/cuda-10.0/targets/x86_64-linux/lib/libOpenCL.so\n/usr/local/cuda-10.0/targets/x86_64-linux/lib/libOpenCL.so.1\n/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so\n```\n\n## 3. .wts not prepared or not in the right directory\n\nIf .wts file not in the right directory. The loadWeights() function will report error. Error logs like following.\n\nBy default, the .wts file usually should be put in the same dir as `build`. For example, `tensorrtx/yolov5/yolov5s.wts`. And the .wts path defined in `yolov5.cpp`.\n\n```\nstd::map<std::__cxx11::basic_string, nvinfer1::Weights> loadWeights(std::__cxx11::string): Assertion `input.is_open() && \"Unable to load weight file.\"' failed.\nAborted (core dumped)\n```\n\n## 4. yolo -s failed, class_num not adapted\n\nIf you train your own yolo model, you need set the `CLASS_NUM` in `yololayer.h`. Which is `80` by default. Otherwise, you will get errors like following.\n\n```\n[Convolution]: kernel weights has count xxx but xxx was expected\nvoid APIToModel(unsigned int, nvinfer1::IHostMemory**): Assertion `engine != nullptr' failed.\nAborted (core dumped)\n```\n"
  },
  {
    "path": "tutorials/from_pytorch_to_trt_stepbystep_hrnet.md",
    "content": "# 使用 TRT 加速网络-零\n\n本次教程以 HRNet 分类器（HRNet-W18-C-Small-v2）为例子\n\ncode：https://github.com/HRNet/HRNet-Image-Classification\n\npaper：https://arxiv.org/abs/1908.07919\n\n## 1 论文网络的基本了解\n\n无论是仅仅使用网络还是要对网络改进，首先都要对网络有一定了解。对于这种比较火的网络，网上大批详解博客，可以多去阅读，加上论文，来对网络理解。\n\nHRNet 分类器网络看起来很简单，如下图\n\n![682463-20200104221712824-157549407](https://user-images.githubusercontent.com/20653176/93749152-ff957680-fc2b-11ea-883c-79046e41ace8.png)\n\n从网络中可看到基本组件很简单：卷积和 upsmple。【这里就表明网络 TRT 加速时不会有 plugin 的需求。】\n\n参考博客：\n\n1. https://www.cnblogs.com/darkknightzh/p/12150637.html\n2. https://zhuanlan.zhihu.com/p/143385915\n3. https://blog.csdn.net/weixin_37993251/article/details/88043650\n4. https://blog.csdn.net/weixin_38715903/article/details/101629781?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param&depth_1-utm_source=dis\n\n## 2 pytorch 代码跑通\n\n跑通 demo 是很重要的一步。跑通后就可以一步一步跟进，看到底走了哪些层，这样心里就会有一个基本框架；然后可以生成 wts 文件；同时也可以生成 onnx 文件。\n\n上述的**参考博客 4**中对代码有详细介绍，可以详细分析下。\n\n建议：**对于运行环境，建议使用 anaconda 的 conda create 创建虚拟环境，这样没有一系列环境问题。**\n\n```python\nconda create -n xx python=3.7   # 创建环境\nactivate xx    # 激活\npip install xxxx  # 安装包\ndeactivate xx  # 推出环境\n```\n\n在生成 wts 文件时，没有必须每次都是去配置`gen_wts.py`，主要是读取模型，保存模型参数。只要 demo 文件跑通就可以随时保存为 wts。\n\n## 3 pytorch 代码 debug\n\n这一步骤单独拉出来是因为在 debug 的过程中，要关注经过哪些层，预处理有哪些，后处理有哪些。另外在后面搭建 TRT 网络时，还要根据 debug 过程在中的一些信息来调试 trt 网络。\n\n## 4 网络的可视化\n\n将 pytorch 模型保存为 onnx，可有可无。但是建议如果可以保存，就使用 onnx 来可视化网络。这样对网络架构一级每层的输入输出就会非常明了。\n\n如果无法保存 onnx，搭建网络时，要根据 wts 来分析，比较麻烦。\n\n另外强烈建议：**无论是否保存了 onnx，都要手动在纸上将网络在画一遍，，并且将每层的输出维度标注下来，这样搭建层比较多的网络时，不会晕，并且在 debugTRT 网络时可以有效定位错误。**\n\n在手动画网络图时，可以给每个节点“标号”，利用该“标号”在搭建 TRT 网络时，可以很清楚知道 **“哪个节点输入，经过某种操作，输出哪个节点。”**\n\n在 onnx 图中看到几个层一定要心里有数：\n\n比如下面红线框出的一大块实际上就是 upsample 层\n\n![](imgs/93747936-0ae7a280-fc2a-11ea-86c1-9f72622402b9.png))\n\n下面的为 FC 层：\n\n![image-20200918141448071](https://user-images.githubusercontent.com/20653176/93749177-0de39280-fc2c-11ea-8a20-b8ab0b3b940f.png)\n\nConv+BN+Relu 层\n\n![image-20200918141632723](https://user-images.githubusercontent.com/20653176/93749201-189e2780-fc2c-11ea-9aad-0ac7723575c4.png)\n\nResBlock 层\n\n![image-20200918141709487](https://user-images.githubusercontent.com/20653176/93749220-2358bc80-fc2c-11ea-998a-0892755dfbc0.png)\n\n单击节点。会有详细信息，这些信息使搭建网络变得方便。\n\n![image-20200918141931327](https://user-images.githubusercontent.com/20653176/93749222-2489e980-fc2c-11ea-9025-c5d367efd7f9.png)\n\n如果无法导出 onnx：\n\n搭建网络时需要从 wts 中查看层名，各个卷积层信息需要从代码中分析。\n\n![image_f](https://user-images.githubusercontent.com/20653176/93750398-fd341c00-fc2d-11ea-9077-ee749b6aef41.png)\n\n![image-20200918142959711](https://user-images.githubusercontent.com/20653176/93749484-8fd3bb80-fc2c-11ea-951d-3c1f403e521a.png)\n\n## 5 TRT 搭建网络\n\n搭建网络时就按照 onnx 图一层一层搭建。\n\n几点建议：\n\n1 要不断去查 API 的使用 https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/index.html\n\n2 利用已有的模块，不要重复造轮子\n\n3 各个层名使用 onnx 的 id，这样在搭建网络时不会晕。，根据 onnx 的结点信息，各层之间的连接也不会出错。\n\n## 6 TRT 网络 debug\n\n搭建网络过程肯定会出错，debug 是必要的手段：\n\n1 打印每层的维度\n\n```c++\nDims dim = id_1083->getOutput(0)->getDimensions();\nstd::cout << dim[0] << \" \" << dim[1] << \" \" << dim[2] << \" \" << dim[3] << std::endl;\n```\n\n**一般如果出现生成 engine 就失败的情况，就从 createEngine 的第一句开始调试，并且随时关注窗口输出，如果在某一层出现大量提示信息，那么该层就会有问题，就将该层的输入 tensor 维度和输出 tensor 维度信息都打印出来，看输出的维度是否正常。**\n\n2 打印输出\n\nTRT 是先构建网络，然后再 enqueue 时才能得到各层的输出信息，因此若想对比每一层的输出，需要将该层设置为 output 层\n\n```c++\nout->getOutput(0)->setName(OUTPUT_BLOB_NAME);  // out可替换为任意一层\nnetwork->markOutput(*out->getOutput(0));\n```\n\n3 关注输入层 data\n\n数据层的 debug 无需第 2 步的做法，直接可以查看预处理后的结果。在 debug\n\n## 7 TRT 代码整理\n\n这里就是将 TRT 搭建的网络，能封装函数，就封装为函数模块，增加代码可读性。\n"
  },
  {
    "path": "tutorials/getting_started.md",
    "content": "# Getting Started with TensorRTx\n\n## 1. Setup the development environment\n\n(**RECOMMENDED**) If you prefer to run everything in a docker container, check [HERE](../docker/README.md)\n\nIf you prefer to install every dependencies locally, check [HERE](./install.md)\n\n## 2. Run TensorRTx demo\n\nIt is recommended to go through the [lenet5](https://github.com/wang-xinyu/tensorrtx/tree/master/lenet) or [mlp](https://github.com/wang-xinyu/tensorrtx/tree/master/mlp) first. But if you are proficient in TensorRT, please check the readme file of the model you want directly.\n\nWe use \"lenet5\" to explain how we build DL network with TensorRT API.\n\n### 2.1. Export lenet5 weights in pytorch\n\n1. Clone the [wang-xinyu/pytorchx](https://github.com/wang-xinyu/pytorchx) in your machine, then enter lenet folder:\n\n   ```bash\n   pip install torch\n   git clone https://github.com/wang-xinyu/pytorchx\n   cd pytorchx/lenet\n   ```\n\n2. Run lenet5.py to generate lenet5.pth which is the pytorch serialized model. The lenet5 arch is defined in lenet5.py.\n\n   ```bash\n   python lenet5.py\n   ```\n\n3. Run inference.py to generate lenet5.wts, which is weights file for tensorrt.\n\n   ```bash\n   python inference.py\n   ```\n\nThe terminal output would be like:\n\n```txt\nthe output of lenet5 is [[0.0950, 0.0998, 0.1101, 0.0975, 0.0966, 0.1097, 0.0948, 0.1056, 0.0992, 0.0917]], shape is [1, 10].\n\ncuda device count:  2\ninput:  torch.Size([1, 1, 32, 32])\nconv1 torch.Size([1, 6, 28, 28])\npool1:  torch.Size([1, 6, 14, 14])\nconv2 torch.Size([1, 16, 10, 10])\npool2 torch.Size([1, 16, 5, 5])\nview:  torch.Size([1, 400])\nfc1:  torch.Size([1, 120])\nlenet out: tensor([[0.0950, 0.0998, 0.1101, 0.0975, 0.0966, 0.1097, 0.0948, 0.1056, 0.0992,\n         0.0917]], device='cuda:0', grad_fn=<SoftmaxBackward>)\n```\n\n### 2.2. Run lenet5 in TensorRT\n\nClone the wang-xinyu/tensorrtx in your machine. Enter lenet folder, copy lenet5.wts generated above, and cmake&make c++ code.\n\nAnd of course you should install cuda/cudnn/tensorrt first. You might need to adapt the tensorrt path in CMakeLists.txt if you install tensorrt from tar package.\n\n```bash\ngit clone https://github.com/wang-xinyu/tensorrtx\ncd tensorrtx/lenet\ncp [PATH-OF-pytorchx]/pytorchx/lenet/lenet5.wts .\ncmake -S . -B build\ncd build\nmake\n```\n\nIf the `make` succeed, the executable `lenet` will generated.\n\nRun lenet to build tensorrt engine and serialize it to file `lenet5.engine`.\n\n```bash\n./lenet -s\n```\n\nDeserialize the engine and run inference.\n\n```bash\n./lenet -d\n```\n\nYou should see the output like this,\n\n```txt\nOutput:\n\n0.0949623, 0.0998472, 0.110072, 0.0975036, 0.0965564, 0.109736, 0.0947979, 0.105618, 0.099228, 0.0916792,\n```\n\n## 3. Compare the two output\n\nAs the input to pytorch and tensorrt are same, i.e. a [1,1,32,32] all ones tensor.\n\nSo the output should be same, otherwise there must be something wrong.\n\n```txt\nThe pytorch output is\n0.0950, 0.0998, 0.1101, 0.0975, 0.0966, 0.1097, 0.0948, 0.1056, 0.0992, 0.0917\n\nThe tensorrt output is\n0.0949623, 0.0998472, 0.110072, 0.0975036, 0.0965564, 0.109736, 0.0947979, 0.105618, 0.099228, 0.0916792\n```\n\nSame! exciting, isn't it?\n\n## 4. The `.wts` content format\n\nThe `.wts` is plain text file, e.g. `lenet5.wts`, part of the contents are:\n\n```txt\n10\nconv1.weight 150 be40ee1b bd20bab8 bdc4bc53 ...\nconv1.bias 6 bd327058 ...\nconv2.weight 2400 3c6f2220 3c693090 ...\nconv2.bias 16 bd183967 bcb1ac8a ...\nfc1.weight 48000 3c162c20 bd25196a ...\nfc1.bias 120 3d3c3d49 bc64b948 ...\nfc2.weight 10080 bce095a4 3d33b9dc ...\nfc2.bias 84 bc71eaa0 3d9b276c ...\nfc3.weight 840 3c252870 3d855351 ...\nfc3.bias 10 bdbe4bb8 3b119ee0 ...\n...\n```\n\nThe first line is a number, indicate how many lines it has, excluding itself.\n\nAnd then each line is\n\n`[weight name] [value count = N] [value1] [value2], ..., [valueN]`\n\nThe value is in HEX format.\n\n## 5. Frequently Asked Questions (FAQ)\n\ncheck [HERE](./faq.md) for the answers of questions you may encounter.\n"
  },
  {
    "path": "tutorials/install.md",
    "content": "# Install the dependencies of tensorrtx\n\nUsing docker as development environment is strongly recommended, you may check [HERE](../docker/README) for the deployment instructions of docker container and _ignore_ the rest of this document.\n\nWhile if this is not your case, we always recommend using major LTS version of your OS, Nvidia driver, CUDA, and so on.\n\n## OS\n\nUbuntu-22.04 is recommended. It is strongly recommended to use `apt` to manage packages in Ubuntu.\n\n## Nvidia Related\n\n### Driver\n\nYou should install the nvidia driver first before anything else, go to [Ubuntu Driver Installation Guide](https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html#ubuntu) for more details.\n\n**NOTE**: Since version 560, the installation step is a little different than before, check [HERE](https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html#recent-updates) for more details.\n\n### CUDA\n\nGo to [NVIDIA CUDA Installation Guide for Linux](https://developer.nvidia.com/cuda-10.0-download-archive) for the detailed steps.\n\n**NOTE**:\n\n- Do not forget to check [Post-installation Actions](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions) to setup the environment correctly.\n- Make your CUDA version comply with your driver version\n- If you want multi-version CUDA, docker is strongly recommended.\n\n### TensorRT\n\ncheck [HERE](https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#downloading) to install TensorRT.\n\n### (Optional) OpenCV\n\n```\nsudo apt-get update && sudo apt install libgtk-3-dev libopencv-dev\n```\n\n## Verify installation\n\n```\ndpkg -l | grep cuda\ndpkg -l | grep nvinfer\ndpkg -l | grep opencv\n```\n"
  },
  {
    "path": "tutorials/measure_performance.md",
    "content": "# Measure performance of TensorRT\n\n## 1. add some variables and structures\n\nsee https://github.com/NVIDIA/TensorRT/tree/master/samples/sampleNMT for more detail.\n\n```c++\n// for rcnn, you can put these code into common.hpp\n#include \"logging.h\" // rcnn/logging.h\nstatic Logger gLogger{ Logger::Severity::kINFO };\nstatic LogStreamConsumer gLogInfo{ LOG_INFO(gLogger) };\n\nstruct SimpleProfiler : public nvinfer1::IProfiler\n{\n    struct Record\n    {\n        float time{ 0 };\n        int count{ 0 };\n    };\n\n    virtual void reportLayerTime(const char* layerName, float ms)\n    {\n        mProfile[layerName].count++;\n        mProfile[layerName].time += ms;\n        if (std::find(mLayerNames.begin(), mLayerNames.end(), layerName) == mLayerNames.end())\n        {\n            mLayerNames.push_back(layerName);\n        }\n    }\n\n    SimpleProfiler(const char* name, const std::vector<SimpleProfiler>& srcProfilers = std::vector<SimpleProfiler>())\n        : mName(name)\n    {\n        for (const auto& srcProfiler : srcProfilers)\n        {\n            for (const auto& rec : srcProfiler.mProfile)\n            {\n                auto it = mProfile.find(rec.first);\n                if (it == mProfile.end())\n                {\n                    mProfile.insert(rec);\n                }\n                else\n                {\n                    it->second.time += rec.second.time;\n                    it->second.count += rec.second.count;\n                }\n            }\n        }\n    }\n\n    friend std::ostream& operator<<(std::ostream& out, const SimpleProfiler& value)\n    {\n        out << \"========== \" << value.mName << \" profile ==========\" << std::endl;\n        float totalTime = 0;\n        std::string layerNameStr = \"TensorRT layer name\";\n        int maxLayerNameLength = std::max(static_cast<int>(layerNameStr.size()), 70);\n        for (const auto& elem : value.mProfile)\n        {\n            totalTime += elem.second.time;\n            maxLayerNameLength = std::max(maxLayerNameLength, static_cast<int>(elem.first.size()));\n        }\n\n        auto old_settings = out.flags();\n        auto old_precision = out.precision();\n        // Output header\n        {\n            out << std::setw(maxLayerNameLength) << layerNameStr << \" \";\n            out << std::setw(12) << \"Runtime, \"\n                << \"%\"\n                << \" \";\n            out << std::setw(12) << \"Invocations\"\n                << \" \";\n            out << std::setw(12) << \"Runtime, ms\" << std::endl;\n        }\n        for (size_t i = 0; i < value.mLayerNames.size(); i++)\n        {\n            const std::string layerName = value.mLayerNames[i];\n            auto elem = value.mProfile.at(layerName);\n            out << std::setw(maxLayerNameLength) << layerName << \" \";\n            out << std::setw(12) << std::fixed << std::setprecision(1) << (elem.time * 100.0F / totalTime) << \"%\"\n                << \" \";\n            out << std::setw(12) << elem.count << \" \";\n            out << std::setw(12) << std::fixed << std::setprecision(2) << elem.time << std::endl;\n        }\n        out.flags(old_settings);\n        out.precision(old_precision);\n        out << \"========== \" << value.mName << \" total runtime = \" << totalTime << \" ms ==========\" << std::endl;\n\n        return out;\n    }\n\nprivate:\n    std::string mName;\n    std::vector<std::string> mLayerNames;\n    std::map<std::string, Record> mProfile;\n};\n```\n\n## 2. set profiler for context and print the log\n\n```c++\n// you'd better set name for every layers\n// build engine\n// build context\nauto sp = SimpleProfiler(\"test\");\ncontext->setProfiler(&sp);\ncontext->enqueue(...);\ngLogInfo << sp << std::endl;\n```\n"
  },
  {
    "path": "tutorials/migration_guide.md",
    "content": "# Migration Guide\n\n## <u>Newest</u> Migration Guide\n\nPlease check [Page](https://docs.nvidia.com/deeplearning/tensorrt/migration-guide/index.html)\n\nFor any archives version, please check this [Page](https://docs.nvidia.com/deeplearning/tensorrt/archives/index.html)\n\n## (DEPRECATED) Migrating from TensorRT 4.x to 7.x\n\n**NOTE**: Both TensorRT 4.x and 7.x are **DEPRECATED** by NVIDIA officially, so this part is **outdated**.\n\nThe following APIs are deprecated and replaced in TensorRT 7.\n- `DimsCHW`, replaced by `Dims3`\n- `addConvolution()`, replaced by `addConvolutionNd()`\n- `addPooling()`, replaced by `addPoolingNd()`\n- `addDeconvolution()`, replaced by `addDeconvolutionNd()`\n- `createNetwork()`, replaced by `createNetworkV2()`\n- `buildCudaEngine()`, replaced by `buildEngineWithConfig()`\n- `createPReLUPlugin()`, replaced by `addActivation()` with `ActivationType::kLEAKY_RELU`\n- `IPlugin` and `IPluginExt` class, replaced by `IPluginV2IOExt` or `IPluginV2DynamicExt`\n- Use the new `Logger` class defined in `logging.h`\n"
  },
  {
    "path": "tutorials/multi_GPU_processing.md",
    "content": "# How to Implement Multi-GPU Processing\n\nMaybe you hope to take advantage of multiple GPU to make inference even faster. Here are few tips to help you deal with it! Take **YOLO V4** as an example.\n\n## 1. Make custom plugin (i.e. YOLO layer and Mish layer for YOLO V4) running asynchronically.\n\nTo do this, we need to use CudaStream parameter in the kernels of all custom layers and use asynchronous functions.\nFor example, in function ` forwardGpu()` of **yololayer.cu**, you need to do the following changes to make sure that the engine will be running on a specific CudaStream.\n\n  1) Change `cudaMemset(output + idx*outputElem, 0, sizeof(float))` to `cudaMemsetAsync(output + idx*outputElem, 0, sizeof(float), stream)`\n  2) Change `CalDetection<<< (yolo.width*yolo.height*batchSize + mThreadCount - 1) / mThreadCount, mThreadCount>>>(inputs[i],output, numElem, yolo.width, yolo.height, (float *)mAnchor[i], mClassCount ,outputElem)` to `CalDetection<<< (yolo.width*yolo.height*batchSize + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream>>>(inputs[i],output, numElem, yolo.width, yolo.height, (float *)mAnchor[i], mClassCount ,outputElem)`\n\n  ## 2. Create an engine for each device you want to use.\n\n  Maybe it is a good idea to create a struct to store the engine, context and buffer for each device individually. For example,\n  ```\n  struct Plan{\n    IRuntime* runtime;\n    ICudaEngine* engine;\n    IExecutionContext* context;\n    void buffers[2];\n    cudaStream_t stream;\n  };\n  ```\n  And then use `cudaSetDevice()` to make each engine you create running on specific device. Moreover, to maximize performance, make sure that the engine file you are using to deserialize is the one tensor RT optimized for this device.\n\n  ## 3. Use function wisely\n  Here are some knowledge I learned when trying to parallelize the inference.\n  1) Do not use synchronized function , like `cudaFree()`, during inference.\n  2) Using `cudaMallocHost()` instead of `malloc()` when allocating memory on the host side.\n"
  },
  {
    "path": "tutorials/run_on_windows.md",
    "content": "# How to Compile and Run on Windows\n\nThis tutorial can be applied to any models in this repo. Only need to adapt couple of lines.\n\n## Environments\n\n* vs (only vs2015, vs2017 tested)\n* cuda\n* TensorRT\n* Cmake\n* opencv\n* dirent.h for windows, put into tensorrtx/include, download from https://github.com/tronkko/dirent\n\n  ![image-20200828131208257](https://user-images.githubusercontent.com/20653176/91524367-99217f00-e931-11ea-9a13-fb420403b73b.png)\n\n## Compile and Run\n\n### 1. Modify CmakeLists.txt\n\n```cmake\ncmake_minimum_required(VERSION 2.6)\n\nproject(yolov5) # 1\nset(OpenCV_DIR \"D:\\\\opencv\\\\opencv346\\\\build\")  #2\nset(TRT_DIR \"D:\\\\TensorRT-7.0.0.11.Windows10.x86_64.cuda-10.2.cudnn7.6\\\\TensorRT-7.0.0.11\")  #3\n\nadd_definitions(-std=c++11)\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nset(THREADS_PREFER_PTHREAD_FLAG ON)\nfind_package(Threads)\n\n# setup CUDA\nfind_package(CUDA REQUIRED)\nmessage(STATUS \"    libraries: ${CUDA_LIBRARIES}\")\nmessage(STATUS \"    include path: ${CUDA_INCLUDE_DIRS}\")\n\ninclude_directories(${CUDA_INCLUDE_DIRS})\n\n####\nenable_language(CUDA)  # add this line, then no need to setup cuda path in vs\n####\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\ninclude_directories(${TRT_DIR}\\\\include)\n\n# -D_MWAITXINTRIN_H_INCLUDED for solving error: identifier \"__builtin_ia32_mwaitx\" is undefined\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -D_MWAITXINTRIN_H_INCLUDED\")\n\n# setup opencv\nfind_package(OpenCV QUIET\n    NO_MODULE\n    NO_DEFAULT_PATH\n    NO_CMAKE_PATH\n    NO_CMAKE_ENVIRONMENT_PATH\n    NO_SYSTEM_ENVIRONMENT_PATH\n    NO_CMAKE_PACKAGE_REGISTRY\n    NO_CMAKE_BUILDS_PATH\n    NO_CMAKE_SYSTEM_PATH\n    NO_CMAKE_SYSTEM_PACKAGE_REGISTRY\n)\n\nmessage(STATUS \"OpenCV library status:\")\nmessage(STATUS \"    version: ${OpenCV_VERSION}\")\nmessage(STATUS \"    libraries: ${OpenCV_LIBS}\")\nmessage(STATUS \"    include path: ${OpenCV_INCLUDE_DIRS}\")\n\ninclude_directories(${OpenCV_INCLUDE_DIRS})\nlink_directories(${TRT_DIR}\\\\lib)\n\nadd_executable(yolov5 ${PROJECT_SOURCE_DIR}/yolov5.cpp ${PROJECT_SOURCE_DIR}/yololayer.cu ${PROJECT_SOURCE_DIR}/yololayer.h)   #4\n\ntarget_link_libraries(yolov5 \"nvinfer\" \"nvinfer_plugin\")   #5\ntarget_link_libraries(yolov5 ${OpenCV_LIBS})          #6\ntarget_link_libraries(yolov5 ${CUDA_LIBRARIES})   #7\ntarget_link_libraries(yolov5 Threads::Threads)       #8\n```\n\nNotice: 8 lines to adapt in CMakeLists.txt, marked with #1-#8\n\n- #1 project name, set according to your project name\n- #2 your opencv path\n- #3 your tensorrt path\n- #4 source file needed, including .cpp .cu .h\n- #5-#8 libs needed\n\n### 2. run cmake-gui to config the project\n\n#### 2.1 open cmake-gui and set the path\n\n![image-20200828124434245](https://user-images.githubusercontent.com/20653176/91524158-1dbfcd80-e931-11ea-8a82-518eaf391d5a.png)\n\n#### 2.2 click **Configure** and set the envs\n\n![image-20200828124902923](https://user-images.githubusercontent.com/20653176/91524303-75f6cf80-e931-11ea-8591-64a8a1a9292b.png)\n\n#### 2.3 click **Finish**, and wait for the `Configuring done`\n\n![image-20200828124951872](https://user-images.githubusercontent.com/20653176/91524340-8b6bf980-e931-11ea-9ea4-141f5b94aa0a.png)\n\n#### 2.4 click **Generate**\n\n![image-20200828125046738](https://user-images.githubusercontent.com/20653176/91524350-8eff8080-e931-11ea-9ed1-82c5af2f558f.png)\n\n#### 2.5 click **Open Project**\n\n![image-20200828125215067](https://user-images.githubusercontent.com/20653176/91524352-9030ad80-e931-11ea-877e-dc08bfaef731.png)\n\n#### 2.6 Click **Generate -> Generate solution**\n\n![image-20200828125402056](https://user-images.githubusercontent.com/20653176/91524356-9161da80-e931-11ea-84ba-177e12200e04.png)\n\n### 3. run in command line\n\ncd to the path of exe (e.g. E:\\LearningCodes\\GithubRepo\\tensorrtx\\yolov5\\build\\Debug)\n\n```\nyolov5.exe -s             // serialize model to plan file i.e. 'yolov5s.engine'\nyolov5.exe -d  ../samples // deserialize plan file and run inference, the images in samples will be processed.\n```\n\n**Notice**: while serializing the model, the .wts should put in the parent dir of xxx.vcxproj, or just modify the .wts path in yolov5.cpp\n\n![image-20200828125938472](https://user-images.githubusercontent.com/20653176/91524358-93c43480-e931-11ea-81b6-ae01b92e1146.png)\n\n### 4. run in vs\n\nIn vs, firstly `Set As Startup Project`, and then setup `Project ==> Properties ==> Configuration Properties ==> Debugging ==> Command Arguments` as `-s` or `-d ../yolov3-spp/samples`. Then can run or debug.\n\n![image-20200828130117902](https://user-images.githubusercontent.com/20653176/91524360-94f56180-e931-11ea-9873-39bed7ee19f1.png)\n\n![image-20200828130415658](https://user-images.githubusercontent.com/20653176/91524362-96bf2500-e931-11ea-8c79-8db3a25fc135.png)\n\n![image-20200828131516231](https://user-images.githubusercontent.com/20653176/91524370-9a52ac00-e931-11ea-8c1a-acf828fe81b4.png)\n\n**Notice**: The .dll of tensorrt and opencv should be put in the same directory with exe file. Or set environment variables in windows.(Not recommended)\n"
  },
  {
    "path": "ufld/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(lane_det)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\n# cuda directory\ninclude_directories(/usr/local/cuda/include/)\nlink_directories(/usr/local/cuda/lib64/)\n\n# tensorrt\n#include_directories(/workspace/TensorRT-7.2.3.4/include/)\n#link_directories(/workspace/TensorRT-7.2.3.4/lib/)\n\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(lane_det ${PROJECT_SOURCE_DIR}/lane_det.cpp)\ntarget_link_libraries(lane_det nvinfer)\ntarget_link_libraries(lane_det cudart)\ntarget_link_libraries(lane_det ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "ufld/README.md",
    "content": "# Ultra-Fast-Lane-Detection(UFLD)\n\nThe Pytorch implementation is [Ultra-Fast-Lane-Detection](https://github.com/cfzd/Ultra-Fast-Lane-Detection).\n\n## How to Run\n```\n1. generate lane.wts and lane.onnx from pytorch with tusimple_18.pth\n\ngit clone https://github.com/wang-xinyu/tensorrtx.git\ngit clone https://github.com/cfzd/Ultra-Fast-Lane-Detection.git\n// download its weights 'tusimple_18.pth'\n// copy tensorrtx/ufld/gen_wts.py into Ultra-Fast-Lane-Detection/\n// ensure the file name is tusimple_18.pth and lane.wts in gen_wts.py\n// go to Ultra-Fast-Lane-Detection\npython gen_wts.py\n// a file 'lane.wts' will be generated.\n// then ( not necessary )\npython pth2onnx.py\n//a file 'lane.onnx' will be generated.\n\n2. build tensorrtx/ufld and run\n\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./lane_det -s          // serialize model to plan file i.e. 'lane.engine'\nsudo ./lane_det -d  PATH_TO_YOUR_IMAGE_FOLDER // deserialize plan file and run inference, the images will be processed.\n\n```\n\n## More Information\n1. Changed the preprocess and postprocess in tensorrtx, give a different way to convert NHWC to NCHW in preprocess and just show the result using opencv rather than saving the result in postprocess.\n2. If there are some bugs where you inference with multi batch_size, just modify the code in preprocess or postprocess, it's not complicated.\n3. Some results are stored in resluts folder.\n"
  },
  {
    "path": "ufld/common.hpp",
    "content": "#ifndef LANE_DET_COMMON_H_\n#define LANE_DET_COMMON_H_\n\n#include <iostream>\n#include <fstream>\n#include <map>\n#include <string>\n#include <sstream>\n#include <vector>\n#include <opencv2/opencv.hpp>\n#include \"dirent.h\"\n#include \"NvInfer.h\"\n#include <chrono>\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\nusing namespace nvinfer1;\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* convBnLeaky( INetworkDefinition *network, std::map<std::string, Weights>& weightMap,\n                     ITensor& input, int outch, int ksize, int s, int p, int g,\n                     std::string lname, int i, bool use_bn = false )\n{\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolution(input, outch, DimsHW{ ksize, ksize }, weightMap[lname + \".conv\"+ std::to_string(i) + \".weight\"], weightMap[lname + \".conv\" + std::to_string(i)+\".bias\"]);\n    assert(conv1);\n    conv1->setStride(DimsHW{s, s});\n    conv1->setPadding(DimsHW{p, p});\n    conv1->setNbGroups(g);\n    if (use_bn)\n    {\n        IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".batchnorm\"+std::to_string(i), 1e-5);\n        auto relu = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n        assert(relu);\n        return relu;\n    }\n    else\n    {\n        auto relu = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n        assert(relu);\n        return relu;\n    }\n}\n\nIActivationLayer* basicBlock(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int stride, std::string lname) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n\n    IConvolutionLayer* conv1 = network->addConvolution(input, outch, DimsHW{ 3, 3 }, weightMap[lname + \"conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStride(DimsHW{ stride, stride });\n    conv1->setPadding(DimsHW{ 1, 1 });\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"bn1\", 1e-5);\n\n    IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n\n    IConvolutionLayer* conv2 = network->addConvolution(*relu1->getOutput(0), outch, DimsHW{ 3, 3 }, weightMap[lname + \"conv2.weight\"], emptywts);\n    assert(conv2);\n    conv2->setPadding(DimsHW{ 1, 1 });\n\n    IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"bn2\", 1e-5);\n\n    IElementWiseLayer* ew1;\n    if (inch != outch) {\n        IConvolutionLayer* conv3 = network->addConvolution(input, outch, DimsHW{ 1, 1 }, weightMap[lname + \"downsample.0.weight\"], emptywts);\n        assert(conv3);\n        conv3->setStride(DimsHW{ stride, stride });\n        IScaleLayer* bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"downsample.1\", 1e-5);\n        ew1 = network->addElementWise(*bn3->getOutput(0), *bn2->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    else {\n        ew1 = network->addElementWise(input, *bn2->getOutput(0), ElementWiseOperation::kSUM);\n    }\n    IActivationLayer* relu2 = network->addActivation(*ew1->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n    return relu2;\n}\n\nint read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n                strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n    closedir(p_dir);\n    return 0;\n}\n\n#endif\n\n"
  },
  {
    "path": "ufld/gen_wts.py",
    "content": "import torch\nimport struct\n#import models.crnn as crnn\nfrom model.model import parsingNet\n\n# Initialize\nmodel = parsingNet(pretrained = False, backbone='18', cls_dim = (101, 56, 4), use_aux=False)\ndevice = 'cpu'\n# Load model\nstate_dict = torch.load('tusimple_18.pth', map_location='cpu')['model']\nmodel.to(device).eval()\n\nf = open('lane.wts', 'w')\nf.write('{}\\n'.format(len(state_dict.keys())))\nfor k, v in state_dict.items():\n    vr = v.reshape(-1).cpu().numpy()\n    f.write('{} {} '.format(k, len(vr)))\n    for vv in vr:\n        f.write(' ')\n        f.write(struct.pack('>f',float(vv)).hex())\n    f.write('\\n')\n"
  },
  {
    "path": "ufld/lane_det.cpp",
    "content": "#include <iostream>\n#include <chrono>\n#include <string>\n#include <sstream>\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include \"common.hpp\"\n\n#define USE_FP16  // comment out this if want to use FP32\n#define DEVICE 0  // GPU id\n#define BATCH_SIZE 1\nstatic const int INPUT_C = 3;\nstatic const int INPUT_H = 288;\nstatic const int INPUT_W = 800;\nstatic const int OUTPUT_C = 101;\nstatic const int OUTPUT_H = 56;\nstatic const int OUTPUT_W = 4;\nstatic const int OUTPUT_SIZE = OUTPUT_C * OUTPUT_H * OUTPUT_W;\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\nstatic Logger gLogger;\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder,IBuilderConfig* builderConfig, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{INPUT_C, INPUT_H, INPUT_W });\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../lane.wts\");\n#if 0\n    /* print layer names */\n    for(std::map<std::string, Weights>::iterator iter = weightMap.begin(); iter != weightMap.end() ; iter++)\n    {\n        std::cout << iter->first << std::endl;\n    }\n#endif\n    auto conv1 = network->addConvolution(*data, 64, DimsHW{ 7, 7 }, weightMap[\"model.conv1.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStride(DimsHW{2, 2});\n    conv1->setPadding(DimsHW{3, 3});\n    conv1->setNbGroups(1);\n\n    auto bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"model.bn1\", 1e-5);\n    auto relu0 = network->addActivation(*bn1->getOutput(0), ActivationType::kRELU);\n    IPoolingLayer* pool0 = network->addPooling(*relu0->getOutput(0), PoolingType::kMAX, DimsHW{ 3, 3 });\n    pool0->setStride( DimsHW{ 2, 2 } );\n    pool0->setPadding( DimsHW{ 1, 1 } );\n    assert(pool0);\n\n    auto basic0 = basicBlock(network, weightMap, *pool0->getOutput(0), 64, 64, 1, \"model.layer1.0.\");\n    auto basic1 = basicBlock(network, weightMap, *basic0->getOutput(0), 64, 64, 1, \"model.layer1.1.\");\n    auto basic2_0 = basicBlock(network, weightMap, *basic1->getOutput(0), 64, 128, 2, \"model.layer2.0.\");\n\n    auto basic2_1 = basicBlock(network, weightMap, *basic2_0->getOutput(0), 128, 128, 1, \"model.layer2.1.\");\n\n    auto basic3_0 = basicBlock(network, weightMap, *basic2_1->getOutput(0), 128, 256, 2, \"model.layer3.0.\");\n\n    auto basic3_1 = basicBlock(network, weightMap, *basic3_0->getOutput(0), 256, 256, 1, \"model.layer3.1.\");\n\n    auto basic4_0 = basicBlock(network, weightMap, *basic3_1->getOutput(0), 256, 512, 2, \"model.layer4.0.\");\n\n    auto basic4_1 = basicBlock(network, weightMap, *basic4_0->getOutput(0), 512, 512, 1, \"model.layer4.1.\");\n\n#if 0\n    /* just for debug */\n    Dims dims1 = basic4_1->getOutput(0)->getDimensions();\n    for (int i = 0; i < dims1.nbDims; i++)\n    {\n        std::cout << dims1.d[i] << \"-\" << (int)dims1.type[i] << \"   \";\n    }\n    std::cout << std::endl;\n#endif\n\n    auto conv2 = network->addConvolution(*basic4_1->getOutput(0), 8, DimsHW{ 1, 1 }, weightMap[\"pool.weight\"], weightMap[\"pool.bias\"]);\n    assert(conv2);\n    conv2->setStride(DimsHW{1, 1});\n    conv2->setPadding(DimsHW{0, 0});\n    conv2->setNbGroups(1);\n\n    IShuffleLayer* permute0 = network->addShuffle(*conv2->getOutput(0));\n    assert(permute0);\n    permute0->setReshapeDimensions( Dims2{1, 1800});\n\n    auto fcwts0 = network->addConstant(nvinfer1::Dims2(2048, 1800), weightMap[\"cls.0.weight\"]);\n    auto matrixMultLayer0 = network->addMatrixMultiply(*permute0->getOutput(0), MatrixOperation::kNONE, *fcwts0->getOutput(0), MatrixOperation::kTRANSPOSE);\n\n    assert(matrixMultLayer0 != nullptr);\n    // Add elementwise layer for adding bias\n    auto fcbias0 = network->addConstant(nvinfer1::Dims2(1, 2048), weightMap[\"cls.0.bias\"]);\n\n    auto addBiasLayer0 = network->addElementWise(*matrixMultLayer0->getOutput(0), *fcbias0->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    assert(addBiasLayer0 != nullptr);\n\n    auto relu = network->addActivation(*addBiasLayer0->getOutput(0), ActivationType::kRELU);\n\n    auto fcwts1 = network->addConstant(nvinfer1::Dims2(22624, 2048), weightMap[\"cls.2.weight\"]);\n    auto matrixMultLayer1 = network->addMatrixMultiply(*relu->getOutput(0), MatrixOperation::kNONE, *fcwts1->getOutput(0), MatrixOperation::kTRANSPOSE);\n\n    assert(matrixMultLayer1 != nullptr);\n    // Add elementwise layer for adding bias\n    auto fcbias1 = network->addConstant(nvinfer1::Dims2(1, 22624), weightMap[\"cls.2.bias\"]);\n\n    auto addBiasLayer1 = network->addElementWise(*matrixMultLayer1->getOutput(0), *fcbias1->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    assert(addBiasLayer1 != nullptr);\n\n    IShuffleLayer* permute1 = network->addShuffle(*addBiasLayer1->getOutput(0));\n    assert(permute1);\n    permute1->setReshapeDimensions( Dims3{ 101, 56, 4 });\n\n    permute1->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*permute1->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    builderConfig->setMaxWorkspaceSize(16 * (1 << 20));// 16MB\n\n#ifdef USE_FP16\n    if(builder->platformHasFastFp16()) {\n        std::cout << \"Platform supports fp16 mode and use it !!!\" << std::endl;\n        builderConfig->setFlag(BuilderFlag::kFP16);\n    } else {\n        std::cout << \"Platform doesn't support fp16 mode so you can't use it !!!\" << std::endl;\n    }\n#endif\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *builderConfig);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*)(mem.second.values));\n    }\n\n   return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* builderConfig = builder->createBuilderConfig();\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, builderConfig, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * INPUT_C * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_C * INPUT_H * INPUT_W * sizeof(float),\n          cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float),\n          cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nstd::vector<float> prepareImage(cv::Mat & img)\n{\n    cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n    cv::Mat resized;\n    cv::resize(img, resized, cv::Size(INPUT_W, INPUT_H));\n\n    cv::Mat img_float;\n\n    resized.convertTo(img_float, CV_32FC3, 1. / 255.);\n\n    // HWC TO CHW\n    std::vector<cv::Mat> input_channels(INPUT_C);\n    cv::split(img_float, input_channels);\n\n    // normalize\n    std::vector<float> result(INPUT_H * INPUT_W * INPUT_C);\n    auto data = result.data();\n    int channelLength = INPUT_H * INPUT_W;\n    static float mean[]= {0.485, 0.456, 0.406};\n    static float std[] = {0.229, 0.224, 0.225};\n    for (int i = 0; i < INPUT_C; ++i) {\n        cv::Mat normed_channel = (input_channels[i] - mean[i]) / std[i];\n        memcpy(data, normed_channel.data, channelLength * sizeof(float));\n        data += channelLength;\n    }\n\n    return result;\n}\n\n/* (101,56,4), add softmax on 101_axis and calculate Expect */\nvoid softmax_mul(float* x, float* y, int rows, int cols, int chan)\n{\n    for(int i = 0, wh = rows * cols; i < rows; i++)\n    {\n        for(int j = 0; j < cols; j++)\n        {\n            float sum = 0.0;\n            float expect = 0.0;\n            for(int k = 0; k < chan - 1; k++)\n            {\n                x[k * wh + i * cols + j] = exp(x[k * wh + i * cols + j]);\n                sum += x[k * wh + i * cols + j];\n            }\n            for(int k = 0; k < chan - 1; k++)\n            {\n                x[k * wh + i * cols + j] /= sum;\n            }\n            for(int k = 0; k < chan - 1; k++)\n            {\n                x[k * wh + i * cols + j] = x[k * wh + i * cols + j] * (k + 1);\n                expect += x[k * wh + i * cols + j];\n            }\n            y[i * cols + j] = expect;\n        }\n    }\n}\n/* (101,56,4), calculate max index on 101_axis */\nvoid argmax(float* x, float* y, int rows, int cols, int chan)\n{\n    for(int i = 0,wh = rows * cols; i < rows; i++)\n    {\n        for(int j = 0; j < cols; j++)\n        {\n            int max = -10000000;\n            int max_ind = -1;\n            for(int k = 0; k < chan; k++)\n            {\n                if(x[k * wh + i * cols + j] > max)\n                {\n                    max = x[k * wh + i * cols + j];\n                    max_ind = k;\n                }\n            }\n            y[i * cols + j] = max_ind;\n        }\n    }\n}\n\nint main(int argc, char** argv)\n{\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{ nullptr };\n    size_t size{ 0 };\n\n    if (argc == 2 && std::string(argv[1]) == \"-s\")\n    {\n            IHostMemory* modelStream{ nullptr };\n            APIToModel(BATCH_SIZE, &modelStream);\n            assert(modelStream != nullptr);\n            std::ofstream p(\"lane_det.engine\", std::ios::binary);\n            if (!p) {\n                    std::cerr << \"could not open plan output file\" << std::endl;\n                    return -1;\n            }\n            p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n            modelStream->destroy();\n            return 0;\n    }\n    else if (argc == 3 && std::string(argv[1]) == \"-d\")\n    {\n            std::ifstream file(\"lane_det.engine\", std::ios::binary);\n            if (file.good()) {\n                    file.seekg(0, file.end);\n                    size = file.tellg();\n                    file.seekg(0, file.beg);\n                    trtModelStream = new char[size];\n                    assert(trtModelStream);\n                    file.read(trtModelStream, size);\n                    file.close();\n            }\n    }\n    else\n    {\n            std::cerr << \"arguments not right!\" << std::endl;\n            std::cerr << \"./crnn -s  // serialize model to plan file\" << std::endl;\n            std::cerr << \"./crnn -d ../samples  // deserialize plan file and run inference\" << std::endl;\n            return -1;\n    }\n\n    /* prepare input data */\n    static float data[BATCH_SIZE * INPUT_C * INPUT_H * INPUT_W];\n    static float prob[BATCH_SIZE * OUTPUT_SIZE];\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(argv[2], file_names) < 0) {\n            std::cout << \"read_files_in_dir failed.\" << std::endl;\n            return -1;\n    }\n\n    int fcount = 0;\n    int vis_h = 720;\n    int vis_w = 1280;\n    int col_sample_w = 8;\n    for (int f = 0; f < (int)file_names.size(); f++)\n    {\n        cv::Mat vis;\n        fcount++;\n        if (fcount < BATCH_SIZE && f + 1 != (int)file_names.size()) continue;\n        for (int b = 0; b < fcount; b++)\n        {\n            cv::Mat img = cv::imread(std::string(argv[2]) + \"/\" + file_names[f - fcount + 1 + b], 1);\n            if (img.empty()) continue;\n            cv::resize(img, vis, cv::Size(vis_w, vis_h));\n            std::vector<float> result(INPUT_C * INPUT_W * INPUT_H);\n            result = prepareImage(img);\n            memcpy(data, &result[0], INPUT_C * INPUT_W * INPUT_H * sizeof(float));\n        }\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, BATCH_SIZE); //prob: size (101, 56, 4)\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time is \"\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \" ms\" << std::endl;\n\n        std::vector<int> tusimple_row_anchor\n            { 64,  68,  72,  76,  80,  84,  88,  92,  96,  100, 104, 108, 112,\n              116, 120, 124, 128, 132, 136, 140, 144, 148, 152, 156, 160, 164,\n              168, 172, 176, 180, 184, 188, 192, 196, 200, 204, 208, 212, 216,\n              220, 224, 228, 232, 236, 240, 244, 248, 252, 256, 260, 264, 268,\n              272, 276, 280, 284 };\n\n        float max_ind[BATCH_SIZE * OUTPUT_H * OUTPUT_W];\n        float prob_reverse[BATCH_SIZE * OUTPUT_SIZE];\n        /* do out_j = out_j[:, ::-1, :] in python list*/\n        float expect[BATCH_SIZE * OUTPUT_H * OUTPUT_W];\n        for (int k = 0, wh = OUTPUT_W * OUTPUT_H; k < OUTPUT_C; k++)\n        {\n            for(int j = 0; j < OUTPUT_H; j ++)\n            {\n                for(int l = 0; l < OUTPUT_W; l++)\n                {\n                    prob_reverse[k * wh + (OUTPUT_H - 1 - j) * OUTPUT_W + l] =\n                        prob[k * wh + j * OUTPUT_W + l];\n                }\n            }\n        }\n\n        argmax(prob_reverse, max_ind, OUTPUT_H, OUTPUT_W, OUTPUT_C);\n        /* calculate softmax and Expect */\n        softmax_mul(prob_reverse, expect, OUTPUT_H, OUTPUT_W, OUTPUT_C);\n        for(int k = 0; k < OUTPUT_H; k++) {\n            for(int j = 0; j < OUTPUT_W; j++) {\n                max_ind[k * OUTPUT_W + j] == 100 ? expect[k * OUTPUT_W + j] = 0 :\n                    expect[k * OUTPUT_W + j] = expect[k * OUTPUT_W + j];\n            }\n        }\n        std::vector<int> i_ind;\n        for(int k = 0; k < OUTPUT_W; k++) {\n            int ii = 0;\n            for(int g = 0; g < OUTPUT_H; g++) {\n                if(expect[g * OUTPUT_W + k] != 0)\n                    ii++;\n            }\n            if(ii > 2) {\n                i_ind.push_back(k);\n            }\n        }\n        for(int k = 0; k < OUTPUT_H; k++) {\n            for(int ll = 0; ll < i_ind.size(); ll++) {\n                if(expect[OUTPUT_W * k + i_ind[ll]] > 0) {\n                    cv::Point pp =\n                        { int(expect[OUTPUT_W * k + i_ind[ll]] * col_sample_w * vis_w / INPUT_W) - 1,\n                          int( vis_h * tusimple_row_anchor[OUTPUT_H - 1 - k] / INPUT_H) - 1 };\n                    cv::circle(vis, pp, 8, CV_RGB(0, 255 ,0), 2);\n                }\n            }\n        }\n        cv::imshow(\"lane_vis\",vis);\n        cv::waitKey(0);\n    }\n\n    return 0;\n}\n"
  },
  {
    "path": "ufld/logging.h",
    "content": "#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override \n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "ufld/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "ufld/pth2onnx.py",
    "content": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.optim as optim\nfrom torchvision import datasets, transforms\nimport torch.onnx as torch_onnx\nfrom model.model import parsingNet\n\nMODELPATH = \"tusimple_18.pth\"\n\nnet = parsingNet(pretrained = False, backbone='18', cls_dim = (101, 56, 4), use_aux=False).cuda()\n\nstate_dict = torch.load(MODELPATH, map_location='cpu')['model']\n\nnet.train(False)\n\nx = torch.randn(1, 3, 288, 800).cuda()\n\ntorch_onnx.export(net, x, \"lane.onnx\", verbose=True, input_names=[\"input\"], output_names=[\"output\"],opset_version=11)\n"
  },
  {
    "path": "unet/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(unet)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\n# cuda directory\ninclude_directories(/usr/local/cuda/include/)\nlink_directories(/usr/local/cuda/lib64/)\n\n# tensorrt\ninclude_directories(/workspace/TensorRT-7.2.3.4/include/)\nlink_directories(/workspace/TensorRT-7.2.3.4/lib/)\n\n# opencv library\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\n# link library and add exec file\nadd_executable(unet ${PROJECT_SOURCE_DIR}/unet.cpp)\ntarget_link_libraries(unet nvinfer)\ntarget_link_libraries(unet cudart)\ntarget_link_libraries(unet ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "unet/README.md",
    "content": "# UNet\n\nPytorch model from [Pytorch-UNet](https://github.com/milesial/Pytorch-UNet).\n\n## Contributors\n\n<a href=\"https://github.com/YuzhouPeng\"><img src=\"https://avatars.githubusercontent.com/u/13601004?v=4?s=48\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/East-Face\"><img src=\"https://avatars.githubusercontent.com/u/35283869?v=4s=48\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/irvingzhang0512\"><img src=\"https://avatars.githubusercontent.com/u/22089207?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/wang-xinyu\"><img src=\"https://avatars.githubusercontent.com/u/15235574?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/nengwp\"><img src=\"https://avatars.githubusercontent.com/u/44516353?s=96&v=4\" width=\"40px;\" alt=\"\"/></a>\n\n\n## Requirements\n\nNow TensorRT 8.x is supported and you can use it.\nThe key cause of the previous bug is the pooling layer Stride setting problem.\n\n## Build and Run\n\n1. Generate .wts\n```\ncp {path-of-tensorrtx}/unet/gen_wts.py Pytorch-UNet/\ncd Pytorch-UNet/\nwget https://github.com/milesial/Pytorch-UNet/releases/download/v3.0/unet_carvana_scale0.5_epoch2.pth\npython gen_wts.py unet_carvana_scale0.5_epoch2.pth\n```\n\n2. Generate TensorRT engine\n```\ncd tensorrtx/unet/\nmkdir build\ncd build\ncmake ..\nmake\ncp {path-of-Pytorch-UNet}/unet.wts .\n./unet -s\n```\n\n3. Run inference\n```\nwget https://raw.githubusercontent.com/wang-xinyu/tensorrtx/f60dcc7bec28846cd973fc95ac829c4e57a11395/unet/samples/0cdf5b5d0ce1_01.jpg\n./unet -d 0cdf5b5d0ce1_01.jpg\n```\n\n4. Check result.jpg\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/207358769-dacf908e-f65d-4b2e-bc53-4fa2a9114c2a.jpg\" height=\"360px;\">\n</p>\n\n# Benchmark\n\nPytorch | TensorRT FP32 | TensorRT FP16\n---- | ----- | ------ \n816x672  | 816x672 | 816x672\n58ms  | 43ms (batchsize 8) | 14ms (batchsize 8)\n\n## More Information\n\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n\n"
  },
  {
    "path": "unet/common.hpp",
    "content": "#ifndef UNET_COMMON_H_\n#define UNET_COMMON_H_\n\n#include <fstream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\nusing namespace nvinfer1;\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\n#endif\n\n"
  },
  {
    "path": "unet/gen_wts.py",
    "content": "import torch\nimport sys\nimport struct\n\ndef main():\n  device = torch.device('cpu')\n  state_dict = torch.load(sys.argv[1], map_location=device)\n\n  f = open(\"unet.wts\", 'w')\n  f.write(\"{}\\n\".format(len(state_dict.keys())))\n  for k, v in state_dict.items():\n    print('key: ', k)\n    print('value: ', v.shape)\n    vr = v.reshape(-1).cpu().numpy()\n    f.write(\"{} {}\".format(k, len(vr)))\n    for vv in vr:\n      f.write(\" \")\n      f.write(struct.pack(\">f\", float(vv)).hex())\n    f.write(\"\\n\")\n  f.close()\n\nif __name__ == '__main__':\n  main()\n\n"
  },
  {
    "path": "unet/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override \n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "unet/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "unet/unet.cpp",
    "content": "#include <iostream>\n#include <chrono>\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include \"common.hpp\"\n\n#define DEVICE 0\n#define USE_FP32  // USE_FP32 or USE_FP16\n#define CONF_THRESH 0.5\n#define BATCH_SIZE 1\n#define cls 2\n#define BILINEAR false\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 640;\nstatic const int INPUT_W = 959;\nstatic const int OUTPUT_SIZE = INPUT_H * INPUT_W * cls;\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\nstatic Logger gLogger;\n\nusing namespace nvinfer1;\n\nILayer* doubleConv(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int ksize, std::string lname, int midch) {\n  Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n  IConvolutionLayer* conv1 = network->addConvolutionNd(input, midch, DimsHW{ ksize, ksize }, weightMap[lname + \".double_conv.0.weight\"], emptywts);\n  conv1->setStrideNd(DimsHW{ 1, 1 });\n  conv1->setPaddingNd(DimsHW{ 1, 1 });\n  conv1->setNbGroups(1);\n  IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".double_conv.1\", 0);\n  IActivationLayer* relu1 = network->addActivation(*bn1->getOutput(0), ActivationType::kLEAKY_RELU);\n  IConvolutionLayer* conv2 = network->addConvolutionNd(*relu1->getOutput(0), outch, DimsHW{ 3, 3 }, weightMap[lname + \".double_conv.3.weight\"], emptywts);\n  conv2->setStrideNd(DimsHW{ 1, 1 });\n  conv2->setPaddingNd(DimsHW{ 1, 1 });\n  conv2->setNbGroups(1);\n  IScaleLayer* bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \".double_conv.4\", 0);\n  IActivationLayer* relu2 = network->addActivation(*bn2->getOutput(0), ActivationType::kLEAKY_RELU);\n  assert(relu2);\n  return relu2;\n}\n\nILayer* down(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int p, std::string lname) {\n  IPoolingLayer* pool1 = network->addPoolingNd(input, PoolingType::kMAX, DimsHW{ 2, 2 });\n  pool1->setStrideNd(DimsHW{ 2, 2 });\n  assert(pool1);\n  ILayer* dcov1 = doubleConv(network, weightMap, *pool1->getOutput(0), outch, 3, lname + \".maxpool_conv.1\", outch);\n  assert(dcov1);\n  return dcov1;\n}\n\nILayer* up(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input1, ITensor& input2, int resize, int outch, int midch, std::string lname) {\n  if (BILINEAR) {\n    // add upsample bilinear\n    IResizeLayer* deconv1 = network->addResize(input1);\n    auto outdims = input2.getDimensions();\n    deconv1->setOutputDimensions(outdims);\n    deconv1->setResizeMode(ResizeMode::kLINEAR);\n    deconv1->setAlignCorners(true);\n\n    int diffx = input2.getDimensions().d[1] - deconv1->getOutput(0)->getDimensions().d[1];\n    int diffy = input2.getDimensions().d[2] - deconv1->getOutput(0)->getDimensions().d[2];\n\n    ILayer* pad1 = network->addPaddingNd(*deconv1->getOutput(0), DimsHW{ diffx / 2, diffy / 2 }, DimsHW{ diffx - (diffx / 2), diffy - (diffy / 2) });\n    // dcov1->setPaddingNd(DimsHW{diffx / 2, diffx - diffx / 2},DimsHW{diffy / 2, diffy - diffy / 2});\n    ITensor* inputTensors[] = { &input2,pad1->getOutput(0) };\n    auto cat = network->addConcatenation(inputTensors, 2);\n    assert(cat);\n    if (midch == 64) {\n      ILayer* dcov1 = doubleConv(network, weightMap, *cat->getOutput(0), outch, 3, lname + \".conv\", outch);\n      assert(dcov1);\n      return dcov1;\n    } else {\n      int midch1 = outch / 2;\n      ILayer* dcov1 = doubleConv(network, weightMap, *cat->getOutput(0), midch1, 3, lname + \".conv\", outch);\n      assert(dcov1);\n      return dcov1;\n    }\n  } else {\n    IDeconvolutionLayer* deconv1 = network->addDeconvolutionNd(input1, resize, DimsHW{ 2, 2 }, weightMap[lname + \".up.weight\"], weightMap[lname + \".up.bias\"]);\n    deconv1->setStrideNd(DimsHW{ 2, 2 });\n    deconv1->setNbGroups(1);\n\n    int diffx = input2.getDimensions().d[1] - deconv1->getOutput(0)->getDimensions().d[1];\n    int diffy = input2.getDimensions().d[2] - deconv1->getOutput(0)->getDimensions().d[2];\n\n    ILayer* pad1 = network->addPaddingNd(*deconv1->getOutput(0), DimsHW{ diffx / 2, diffy / 2 }, DimsHW{ diffx - (diffx / 2), diffy - (diffy / 2) });\n    // dcov1->setPaddingNd(DimsHW{diffx / 2, diffx - diffx / 2},DimsHW{diffy / 2, diffy - diffy / 2});\n    ITensor* inputTensors[] = { &input2,pad1->getOutput(0) };\n    auto cat = network->addConcatenation(inputTensors, 2);\n    assert(cat);\n    ILayer* dcov1 = doubleConv(network, weightMap, *cat->getOutput(0), midch, 3, lname + \".conv\", outch);\n    assert(dcov1);\n    return dcov1;\n  }\n}\n\nILayer* outConv(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, std::string lname) {\n  // Weights emptywts{DataType::kFLOAT, nullptr, 0};\n  IConvolutionLayer* conv1 = network->addConvolutionNd(input, cls, DimsHW{ 1, 1 }, weightMap[lname + \".conv.weight\"], weightMap[lname + \".conv.bias\"]);\n  assert(conv1);\n  conv1->setStrideNd(DimsHW{ 1, 1 });\n  conv1->setPaddingNd(DimsHW{ 0, 0 });\n  conv1->setNbGroups(1);\n  return conv1;\n}\n\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, std::string wts_path) {\n  INetworkDefinition* network = builder->createNetworkV2(0U);\n\n  // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\n  ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{ 3, INPUT_H, INPUT_W });\n  assert(data);\n\n  std::map<std::string, Weights> weightMap = loadWeights(wts_path);\n  Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n\n  // build network\n  auto x1 = doubleConv(network, weightMap, *data, 64, 3, \"inc\", 64);\n  auto x2 = down(network, weightMap, *x1->getOutput(0), 128, 1, \"down1\");\n  auto x3 = down(network, weightMap, *x2->getOutput(0), 256, 1, \"down2\");\n  auto x4 = down(network, weightMap, *x3->getOutput(0), 512, 1, \"down3\");\n  auto channel = 512;\n  if (!BILINEAR) {\n    channel = 1024;\n  }\n  auto x5 = down(network, weightMap, *x4->getOutput(0), channel, 1, \"down4\");\n  ILayer* x6 = up(network, weightMap, *x5->getOutput(0), *x4->getOutput(0), 512, 512, 512, \"up1\");\n  ILayer* x7 = up(network, weightMap, *x6->getOutput(0), *x3->getOutput(0), 256, 256, 256, \"up2\");\n  ILayer* x8 = up(network, weightMap, *x7->getOutput(0), *x2->getOutput(0), 128, 128, 128, \"up3\");\n  ILayer* x9 = up(network, weightMap, *x8->getOutput(0), *x1->getOutput(0), 64, 64, 64, \"up4\");\n  ILayer* x10 = outConv(network, weightMap, *x9->getOutput(0), OUTPUT_SIZE, \"outc\");\n\n  x10->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n  network->markOutput(*x10->getOutput(0));\n\n  // Build engine\n  builder->setMaxBatchSize(maxBatchSize);\n  config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#ifdef USE_FP16\n  config->setFlag(BuilderFlag::kFP16);\n#endif\n  std::cout << \"Building engine, please wait for a while...\" << std::endl;\n  ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n  std::cout << \"Build engine successfully!\" << std::endl;\n\n  // Don't need the network any more\n  network->destroy();\n\n  // Release host memory\n  for (auto& mem : weightMap) {\n    free((void*)(mem.second.values));\n  }\n\n  return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** model_stream, std::string wts_path) {\n  // Create builder\n  IBuilder* builder = createInferBuilder(gLogger);\n  IBuilderConfig* config = builder->createBuilderConfig();\n\n  // Create model to populate the network, then set the outputs and create an engine\n  ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT, wts_path);\n  assert(engine != nullptr);\n\n  // Serialize the engine\n  (*model_stream) = engine->serialize();\n\n  // Close everything down\n  engine->destroy();\n  builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\n  const ICudaEngine& engine = context.getEngine();\n\n  // Pointers to input and output device buffers to pass to engine.\n  // Engine requires exactly IEngine::getNbBindings() number of buffers.\n  assert(engine.getNbBindings() == 2);\n  void* buffers[2];\n\n  // In order to bind the buffers, we need to know the names of the input and output tensors.\n  // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n  const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n  const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n  // Create GPU buffers on device\n  CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n  CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n  // Create stream\n  cudaStream_t stream;\n  CHECK(cudaStreamCreate(&stream));\n\n  // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n  CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n  context.enqueue(batchSize, buffers, stream, nullptr);\n  CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n\n  cudaStreamSynchronize(stream);\n\n  // Release stream and buffers\n  cudaStreamDestroy(stream);\n  CHECK(cudaFree(buffers[inputIndex]));\n  CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv) {\n  cudaSetDevice(DEVICE);\n\n  char* trt_model_stream = nullptr;\n  size_t size = 0;\n  std::string engine_name = \"unet.engine\";\n  std::string wts_path = \"unet.wts\";\n\n  if (argc == 2 && std::string(argv[1]) == \"-s\") {\n    // Create a TensorRT model and serialize it to a file\n    IHostMemory* model_stream{ nullptr };\n    APIToModel(BATCH_SIZE, &model_stream, wts_path);\n    assert(model_stream != nullptr);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n      std::cerr << \"could not open plan output file\" << std::endl;\n      return -1;\n    }\n    p.write(reinterpret_cast<const char*>(model_stream->data()), model_stream->size());\n    model_stream->destroy();\n    return 0;\n  } else if (argc == 3 && std::string(argv[1]) == \"-d\") {\n    // Load engine file\n    std::ifstream file(engine_name, std::ios::binary);\n    if (file.good()) {\n      file.seekg(0, file.end);\n      size = file.tellg();\n      file.seekg(0, file.beg);\n      trt_model_stream = new char[size];\n      assert(trt_model_stream);\n      file.read(trt_model_stream, size);\n      file.close();\n    }\n  } else {\n    std::cerr << \"arguments not right!\" << std::endl;\n    std::cerr << \"./unet -s  // serialize model to plan file\" << std::endl;\n    std::cerr << \"./unet -d ../samples  // deserialize plan file and run inference\" << std::endl;\n    return -1;\n  }\n\n  // Prepare input output data\n  static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];\n  static float prob[BATCH_SIZE * OUTPUT_SIZE];\n\n  // Deserialize engine\n  IRuntime* runtime = createInferRuntime(gLogger);\n  assert(runtime != nullptr);\n  ICudaEngine* engine = runtime->deserializeCudaEngine(trt_model_stream, size);\n  assert(engine != nullptr);\n  IExecutionContext* context = engine->createExecutionContext();\n  assert(context != nullptr);\n  delete[] trt_model_stream;\n\n  cv::Mat img = cv::imread(argv[2]);\n\n  // Preprocess\n  cv::resize(img, img, cv::Size(INPUT_W, INPUT_H));\n  for (int i = 0; i < INPUT_H * INPUT_W; i++) {\n    data[i] = (img.at<cv::Vec3b>(i)[2]) / 255.0;\n    data[i + INPUT_H * INPUT_W] = (img.at<cv::Vec3b>(i)[1]) / 255.0;\n    data[i + 2 * INPUT_H * INPUT_W] = (img.at<cv::Vec3b>(i)[0]) / 255.0;\n  }\n\n  // Run inference\n  auto start = std::chrono::system_clock::now();\n  doInference(*context, data, prob, BATCH_SIZE);\n  auto end = std::chrono::system_clock::now();\n  std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n  // Postprocess\n  cv::Mat result = cv::Mat::zeros(INPUT_H, INPUT_W, CV_8UC3);\n  for (int i = 0; i < INPUT_H * INPUT_W; i++) {\n    float fmax = prob[i];\n    int index = 0;\n    for (int j = 1; j < cls; j++) {\n      if (prob[i + j * INPUT_H * INPUT_W] > fmax) {\n        index = j;\n        fmax = prob[i + j * INPUT_H * INPUT_W];\n      }\n    }\n\n    if (index == 1) {\n      result.at<cv::Vec3b>(i) = cv::Vec3b(255, 255, 255);\n    }\n  }\n\n  cv::imwrite(\"result.jpg\", result);\n\n  // Destroy the engine\n  context->destroy();\n  engine->destroy();\n  runtime->destroy();\n\n  return 0;\n}\n"
  },
  {
    "path": "vgg/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(vgg)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nadd_executable(vgg ${PROJECT_SOURCE_DIR}/vgg11.cpp)\ntarget_link_libraries(vgg nvinfer)\ntarget_link_libraries(vgg cudart)\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "vgg/README.md",
    "content": "# vgg\n\nVGG 11-layer model (configuration \"A\") from\n    \"Very Deep Convolutional Networks For Large-Scale Image Recognition\" <https://arxiv.org/pdf/1409.1556.pdf>\n\nFor the Pytorch implementation, you can refer to [pytorchx/vgg](https://github.com/wang-xinyu/pytorchx/tree/master/vgg)\n\nVGG's architecture is simple, just some conv, relu, maxpool, and fc layers.\n\n```\n// 1. generate vgg.wts from [pytorchx/vgg](https://github.com/wang-xinyu/pytorchx/tree/master/vgg)\n\n// 2. put vgg.wts into tensorrtx/vgg\n\n// 3. build and run\n\ncd tensorrtx/vgg\n\nmkdir build\n\ncd build\n\ncmake ..\n\nmake\n\nsudo ./vgg -s   // serialize model to plan file i.e. 'vgg.engine'\nsudo ./vgg -d   // deserialize plan file and run inference\n\n// 4. see if the output is same as pytorchx/vgg\n```\n\n\n"
  },
  {
    "path": "vgg/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "vgg/vgg11.cpp",
    "content": "#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include \"logging.h\"\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = 224;\nstatic const int INPUT_W = 224;\nstatic const int OUTPUT_SIZE = 1000;\n\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\n\n// Load weights from files shared with TensorRT samples.\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file)\n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt)\n{\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape { 3, INPUT_H, INPUT_W } with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../vgg.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 64, DimsHW{3, 3}, weightMap[\"features.0.weight\"], weightMap[\"features.0.bias\"]);\n    assert(conv1);\n    conv1->setPaddingNd(DimsHW{1, 1});\n    IActivationLayer* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    assert(relu1);\n    IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    assert(pool1);\n    pool1->setStrideNd(DimsHW{2, 2});\n\n    conv1 = network->addConvolutionNd(*pool1->getOutput(0), 128, DimsHW{3, 3}, weightMap[\"features.3.weight\"], weightMap[\"features.3.bias\"]);\n    conv1->setPaddingNd(DimsHW{1, 1});\n    relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    pool1->setStrideNd(DimsHW{2, 2});\n\n    conv1 = network->addConvolutionNd(*pool1->getOutput(0), 256, DimsHW{3, 3}, weightMap[\"features.6.weight\"], weightMap[\"features.6.bias\"]);\n    conv1->setPaddingNd(DimsHW{1, 1});\n    relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    conv1 = network->addConvolutionNd(*relu1->getOutput(0), 256, DimsHW{3, 3}, weightMap[\"features.8.weight\"], weightMap[\"features.8.bias\"]);\n    conv1->setPaddingNd(DimsHW{1, 1});\n    relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    pool1->setStrideNd(DimsHW{2, 2});\n\n    conv1 = network->addConvolutionNd(*pool1->getOutput(0), 512, DimsHW{3, 3}, weightMap[\"features.11.weight\"], weightMap[\"features.11.bias\"]);\n    conv1->setPaddingNd(DimsHW{1, 1});\n    relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    conv1 = network->addConvolutionNd(*relu1->getOutput(0), 512, DimsHW{3, 3}, weightMap[\"features.13.weight\"], weightMap[\"features.13.bias\"]);\n    conv1->setPaddingNd(DimsHW{1, 1});\n    relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    pool1->setStrideNd(DimsHW{2, 2});\n\n    conv1 = network->addConvolutionNd(*pool1->getOutput(0), 512, DimsHW{3, 3}, weightMap[\"features.16.weight\"], weightMap[\"features.16.bias\"]);\n    conv1->setPaddingNd(DimsHW{1, 1});\n    relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    conv1 = network->addConvolutionNd(*relu1->getOutput(0), 512, DimsHW{3, 3}, weightMap[\"features.18.weight\"], weightMap[\"features.18.bias\"]);\n    conv1->setPaddingNd(DimsHW{1, 1});\n    relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);\n    pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    pool1->setStrideNd(DimsHW{2, 2});\n\n    IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool1->getOutput(0), 4096, weightMap[\"classifier.0.weight\"], weightMap[\"classifier.0.bias\"]);\n    assert(fc1);\n    relu1 = network->addActivation(*fc1->getOutput(0), ActivationType::kRELU);\n    fc1 = network->addFullyConnected(*relu1->getOutput(0), 4096, weightMap[\"classifier.3.weight\"], weightMap[\"classifier.3.bias\"]);\n    relu1 = network->addActivation(*fc1->getOutput(0), ActivationType::kRELU);\n    fc1 = network->addFullyConnected(*relu1->getOutput(0), 1000, weightMap[\"classifier.6.weight\"], weightMap[\"classifier.6.bias\"]);\n\n    fc1->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    std::cout << \"set name out\" << std::endl;\n    network->markOutput(*fc1->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(1 << 20);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"build out\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream)\n{\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n    config->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize)\n{\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./vgg -s   // serialize model to plan file\" << std::endl;\n        std::cerr << \"./vgg -d   // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n\n        std::ofstream p(\"vgg.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 1;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"vgg.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        return -1;\n    }\n\n    static float data[3 * INPUT_H * INPUT_W];\n    for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n        data[i] = 1;\n\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    // Run inference\n    static float prob[OUTPUT_SIZE];\n    for (int i = 0; i < 10; i++) {\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    // Print histogram of the output distribution\n    std::cout << \"\\nOutput:\\n\\n\";\n    for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\n    {\n        std::cout << prob[i] << \", \";\n        if (i % 10 == 0) std::cout << i / 10 << std::endl;\n    }\n    std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "vit/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.17.0)\n\nproject(\n  vit\n  VERSION 0.1\n  LANGUAGES C CXX CUDA)\n\nif(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)\n  set(CMAKE_CUDA_ARCHITECTURES 80 86 89 90 100 120)\nendif()\n\nset(CMAKE_CXX_STANDARD 20)\nset(CMAKE_CXX_STANDARD_REQUIRED ON)\nset(CMAKE_CUDA_STANDARD 17)\nset(CMAKE_CUDA_STANDARD_REQUIRED ON)\nset(CMAKE_EXPORT_COMPILE_COMMANDS ON)\nset(CMAKE_INCLUDE_CURRENT_DIR TRUE)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME \"Use static cudaruntime library\" OFF)\n\nfind_package(Threads REQUIRED)\nfind_package(CUDAToolkit REQUIRED)\nfind_package(OpenCV REQUIRED)\n\nif(NOT TARGET TensorRT::TensorRT)\n  include(FindTensorRT.cmake)\nelse()\n  message(\"TensorRT has been found, skipping for ${PROJECT_NAME}\")\nendif()\n\nadd_executable(${PROJECT_NAME} \"${PROJECT_NAME}.cc\" \"cuda_allocator.cc\"\n                               \"profiler.cc\")\ntarget_include_directories(${PROJECT_NAME} PRIVATE ${OpenCV_INCLUDE_DIRS})\ntarget_link_libraries(\n  ${PROJECT_NAME} PUBLIC Threads::Threads CUDA::cudart CUDA::cuda_driver\n                         TensorRT::TensorRT ${OpenCV_LIBS})\n\nif(WIN32)\n  set_target_properties(\n    ${PROJECT_NAME} PROPERTIES MSVC_RUNTIME_LIBRARY\n                               \"MultiThreaded$<$<CONFIG:Debug>:Debug>\")\nendif()\n"
  },
  {
    "path": "vit/FindTensorRT.cmake",
    "content": "cmake_minimum_required(VERSION 3.17.0)\n\nfunction(_guess_path var_name required_files)\n  set(_result \"\")\n\n  foreach(path_entry IN LISTS ARGN)\n    if(NOT EXISTS \"${path_entry}\")\n      message(DEBUG \"skip non-existing path '${path_entry}'\")\n      continue()\n    endif()\n\n    set(_ok TRUE)\n    foreach(required_file IN LISTS required_files)\n      if(NOT EXISTS \"${path_entry}/${required_file}\")\n        set(_ok FALSE)\n        message(DEBUG \"'${path_entry}' missing '${required_file}'\")\n        break()\n      endif()\n    endforeach()\n\n    if(_ok)\n      list(APPEND _result \"${path_entry}\")\n      message(DEBUG \"accept '${path_entry}'\")\n    else()\n      message(DEBUG \"reject '${path_entry}'\")\n    endif()\n  endforeach()\n\n  if(_result STREQUAL \"\")\n    message(\n      FATAL_ERROR\n        \"_guess_path(${var_name}) failed: no valid path found. required_files='${required_files}' candidates='${ARGN}'\"\n    )\n  endif()\n\n  set(${var_name}\n      \"${_result}\"\n      PARENT_SCOPE)\nendfunction()\n\n# add library\nadd_library(TensorRT IMPORTED INTERFACE)\nadd_library(TensorRT::TensorRT ALIAS TensorRT)\n\nset(TRT_VERSION\n    CACHE\n      STRING\n      \"TensorRT version, e.g. \\\"8.6.1.6\\\" or \\\"8.6.1.6+cuda12.0.1.011\\\", \\\"8.6.1.6.Windows10.x86_64.cuda-12.0\\\" etc\"\n)\n\nif(NOT TRT_VERSION STREQUAL \"\" AND NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  message(\n    WARNING\n      \"TRT_VERSION defined by cmake and environment variable both, using the later one\"\n  )\nendif()\n\nif(NOT $ENV{TRT_VERSION} STREQUAL \"\")\n  set(TRT_VERSION $ENV{TRT_VERSION})\nendif()\n\nstring(REGEX MATCH \"([0-9]+)\" _match ${TRT_VERSION})\nset(TRT_MAJOR_VERSION \"${_match}\")\nunset(_match)\n\nif(WIN32)\n  set(TensorRT_DIR \"C:/Program Files/TensorRT-${TRT_VERSION}\")\n  if(NOT EXISTS \"${TensorRT_DIR}\")\n    message(FATAL_ERROR \"TensorRT_DIR=${TensorRT_DIR} does not exist!\")\n  endif()\n\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 10)\n    set(_modules nvinfer_10 nvinfer_plugin_10 nvinfer_vc_plugin_10\n                 nvinfer_dispatch_10 nvinfer_lean_10)\n    message(DEBUG \"Using ${_modules}\")\n  else()\n    set(_modules nvinfer nvinfer_plugin nvinfer_vc_plugin nvinfer_dispatch\n                 nvinfer_lean)\n  endif()\n\n  set(TensorRT_LIBRARY_DIR \"${TensorRT_DIR}/lib\")\n  set(TensorRT_INCLUDE_DIR \"${TensorRT_DIR}/include\")\nelseif(UNIX)\n  string(TOLOWER \"${CMAKE_SYSTEM_PROCESSOR}\" _trt_arch)\n  set(_trt_include_candidates)\n  if(_trt_arch MATCHES \"^(aarch64|arm64|arch64)$\")\n    set(_trt_include_candidates \"/usr/include/aarch64-linux-gnu\" \"/usr/include\"\n                                \"/usr/local/cuda/targets/aarch64-linux/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/aarch64-linux-gnu/lib\"\n        \"/usr/lib/aarch64-linux-gnu\" \"/usr/lib/aarch64-linux-gnu/tegra\"\n        \"/usr/lib\")\n  elseif(_trt_arch MATCHES \"^(x86_64|amd64)$\")\n    set(_trt_include_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/include\"\n        \"/usr/include/x86_64-linux-gnu\" \"/usr/include\")\n    set(_trt_library_candidates\n        \"/usr/local/tensorrt/targets/x86_64-linux-gnu/lib\"\n        \"/usr/lib/x86_64-linux-gnu\" \"/usr/lib\")\n  else()\n    message(FATAL_ERROR \"Unknown architecture\")\n  endif()\n\n  set(_modules nvinfer nvinfer_plugin)\n  if(${TRT_MAJOR_VERSION} GREATER_EQUAL 8)\n    list(APPEND _modules nvinfer_vc_plugin nvinfer_dispatch nvinfer_lean)\n  endif()\n\n  _guess_path(TensorRT_LIBRARY_DIR \"libnvinfer.so;libnvinfer_plugin.so\"\n              ${_trt_library_candidates})\n  message(STATUS \"TensorRT libraries: ${TensorRT_LIBRARY_DIR}\")\n  _guess_path(TensorRT_INCLUDE_DIR \"NvInfer.h\" ${_trt_include_candidates})\n  message(STATUS \"TensorRT includes: ${TensorRT_INCLUDE_DIR}\")\nendif()\n\nforeach(lib IN LISTS _modules)\n  find_library(\n    TensorRT_${lib}_LIBRARY\n    NAMES ${lib}\n    HINTS ${TensorRT_LIBRARY_DIR})\n  list(APPEND TensorRT_LIBRARIES ${TensorRT_${lib}_LIBRARY})\nendforeach()\n\ntarget_link_libraries(TensorRT INTERFACE ${TensorRT_LIBRARIES})\n\nmessage(STATUS \"Found TensorRT libs: ${TensorRT_LIBRARIES}\")\n\nset_target_properties(\n  TensorRT\n  PROPERTIES C_STANDARD 17\n             CXX_STANDARD 17\n             POSITION_INDEPENDENT_CODE ON\n             SKIP_BUILD_RPATH TRUE\n             BUILD_WITH_INSTALL_RPATH TRUE\n             INSTALL_RPATH \"$ORIGIN\"\n             INTERFACE_INCLUDE_DIRECTORIES \"${TensorRT_INCLUDE_DIR}\")\n\nunset(TRT_MAJOR_VERSION)\nunset(_modules)\nunset(_trt_include_candidates)\nunset(_trt_library_candidates)\nunset(_trt_arch)\n"
  },
  {
    "path": "vit/README.md",
    "content": "# Vision Transformers (ViT)\n\n## 1. Overview\n\nThis is a handwritten TensorRT implementation of the Vision Transformers[arxiv.org.2010.11929](https://arxiv.org/abs/2010.11929) paper.\n\n**Note**:\n\n- Swi-GeLU activation layer is supported since TensorRT **10.0**+ SDK, we can use a approximation way as TensorRT does, check below for details.\n\n## 2. Details\n\n### 2.1 Features\n\n- Support TensorRT SDK 8.5.1+ ~ 10.15.1+\n- Support Windows11 OS\n- Support native or self-implemented Swi-GeLU\n- Support native or self-implemented multihead self-attention\n- Support a dummy profiler by default\n- Support a dummy output allocator by default\n- Use optimization profile by default\n\n### 2.2 Current limitations\n\n- cannot use `IAttenion` with TensorRT SDK 10.14 ~ 10.15 because of the bugs in TensorRT\n- TensorRT < 8 is not supported because some ops are not inplemented in cuDNN\n- SM < 86, TensorRT < 10, CUDA < 12 cases are _NOT_ fully tested yet\n\n### 2.3 Usage\n\n1. use `gen_wts.py` to generate `.wts` file.\n\n```bash\npython gen_wts.py\n```\n\n2. build C++ code\n\n```bash\npushd tensorrtx/vit\ncmake -S . -B build -G Ninja --fresh\ncmake --build build\n```\n\n3. serialize `.wts` model to engine file.\n\n```bash\n./build/vit -s\n```\n\n4. run inference\n\n```bash\n./build/vit -d\n```\n\nOn **RTX 4080, TensorRT 10.15.1 SDK**, the output looks like:\n\n```bash\n...\n====\n1880us\n-1.125, 0.4623, -0.1215, -0.007384, -0.004307, -0.7021, -0.748, 0.2031, -0.4862, -0.008939, -1.151, -0.408, -0.3259, 0.2202, 0.04537, -2.008, -0.2832, 0.04394, 0.5326, 0.1724, 0.5655,\n====\nprediction result:\nTop: 0 idx: 285, logits: 8.262, label: Egyptian cat\nTop: 1 idx: 281, logits: 7.872, label: tabby, tabby cat\nTop: 2 idx: 282, logits: 6.477, label: tiger cat\n========== VisionTransformerProfiler ==========\n                                                                                                          TensorRT layer name    Runtime, %  Invocations Runtime, ms\n                                                                  Reformatting CopyNode for Input Tensor 0 to patch embedding          3.2%           20         0.95\n                                                                                                              patch embedding          1.5%           20         0.45\nReformatting CopyNode for Input Tensor 0 to {ForeignNode[(Unnamed Layer* 3) [Constant]...(Unnamed Layer* 518) [ElementWise]]}          0.2%           20         0.06\n                                                                                                        __myl_ReshTran_myl3_0          0.8%           20         0.24\n                                                                __myl_ConcAddCastMeanSubMulMeanAddSqrtDivMulCastMulAdd_myl3_1          0.3%           20         0.08\n                vit.encoder.layer.0.attentionvalue+vit.encoder.layer.0.attentionkey+vit.encoder.layer.0.attentionquery_myl3_2          1.4%           20         0.40\n                                                                                                    __myl_TranReshMove_myl3_3          0.2%           20         0.06\n                                                                                                    __myl_TranReshMove_myl3_4          0.2%           20         0.07\n                                                                                                    __myl_TranReshMove_myl3_5          0.2%           20         0.06\n                                                                                                          _gemm_mha_v2_myl3_6          0.5%           20         0.14\n                                                                                                    __myl_MoveReshTran_myl3_7          0.2%           20         0.06\n...\n========== VisionTransformerProfiler total runtime = 29.67 ms ==========\n```\n\nas is shown above, we successfully triggered the internal MHA fused kernel fusion pass inside TensorRT (i.e., **\"Myelin\"** or **\"myl\"** in short), especially the MHA fused kernel: `_gemm_mha_v2_myl3_6`.\n\n## 3. transformer details\n\n`ViTLayer()` builds one ViT encoder block (Transformer encoder layer) using TensorRT primitives. The implementation corresponds to a **Pre-LayerNorm** Transformer layer (typical for ViT), including:\n\n- LayerNorm before attention\n- Multi-Head Self-Attention (MHSA): QKV projections → scaled dot-product attention → output projection\n- Residual connection\n- LayerNorm after attention\n- Feed-Forward Network (FFN / MLP): dense → GeLU → dense\n- Residual connection\n\nThe function returns the final residual output tensor.\n\n### 3.1 Notation and Tensor Shapes\n\nLet the input tensor (TensorRT `input`) be:\n\n$$\n\\mathbf{X} \\in \\mathbb{R}^{N \\times L \\times D}\n$$\n\nWhere:\n\n- (N): batch size (represented by `N` in your code)\n- (L): sequence length (number of tokens; dynamic in code via `-1`)\n- (D): hidden size, fixed at 768 in this implementation\n\nThe attention head configuration:\n\n$$\nH = \\tt{param.head\\_num}, \\qquad d = \\frac{D}{H}\n$$\n\n### 3.2 Weight shapes (conceptual)\n\nFor a standard Transformer block:\n\n- Q/K/V projection weights:\n  $$\n  \\mathbf{W}_Q, \\mathbf{W}_K, \\mathbf{W}_V \\in \\mathbb{R}^{D \\times D}\n  $$\n- Q/K/V biases (**NOTE**:Not used by native nvidia interface):\n  $$\n  \\mathbf{b}_Q, \\mathbf{b}_K, \\mathbf{b}_V \\in \\mathbb{R}^{D}\n  $$\n- Output projection:\n  $$\n  \\mathbf{W}_O \\in \\mathbb{R}^{D \\times D}, \\quad \\mathbf{b}_O \\in \\mathbb{R}^{D}\n  $$\n- FFN (MLP) with expansion ratio 4:\n  $$\n  \\mathbf{W}_1 \\in \\mathbb{R}^{D \\times 4D}, \\ \\mathbf{b}_1 \\in \\mathbb{R}^{4D}\n  $$\n  $$\n  \\mathbf{W}_2 \\in \\mathbb{R}^{4D \\times D}, \\ \\mathbf{b}_2 \\in \\mathbb{R}^{D}\n  $$\n  Here ($4 D = 3072$).\n\n### 3.3 High-Level Block Structure\n\n_Pre-LN Transformer Encoder Layer_ implements the following canonical computation:\n\n$$\n\\begin{aligned}\n\\mathbf{X}' &= \\mathrm{LN}_1(\\mathbf{X}) \\\\\n\\mathbf{A} &= \\mathrm{MHSA}(\\mathbf{X}') \\\\\n\\mathbf{Y} &= \\mathbf{X} + \\mathbf{A} \\\\\n\\mathbf{Y}' &= \\mathrm{LN}_2(\\mathbf{Y}) \\\\\n\\mathbf{F} &= \\mathrm{FFN}(\\mathbf{Y}') \\\\\n\\mathbf{Z} &= \\mathbf{Y} + \\mathbf{F}\n\\end{aligned}\n$$\n\nThe function returns ($\\mathbf{Z}$).\n\n### 3.4 LayerNorm Definition\n\nLayerNorm is applied over the **last dimension** (D) (hidden size), independently for each ($(n, \\ell)$) position.\n\nFor a token vector ($\\mathbf{x} \\in \\mathbb{R}^{D}$):\n\n$$\n\\mathrm{LN}(\\mathbf{x}) = \\gamma \\odot \\frac{\\mathbf{x} - \\mu}{\\sqrt{\\sigma^2 + \\varepsilon}} + \\beta\n$$\n\nWhere:\n\n$$\n\\mu = \\frac{1}{D}\\sum_{i=1}^{D} x_i,\n\\qquad\n\\sigma^2 = \\frac{1}{D}\\sum_{i=1}^{D}(x_i - \\mu)^2\n$$\n\n- ($\\gamma$) corresponds to `.weight`\n- ($\\beta$) corresponds to `.bias`\n- ($\\varepsilon = \\tt{param.lnorm\\_eps}$)\n\n### 3.5 QKV Projections (Code Section 2.1)\n\n#### 3.5.1 Linear projections\n\nLet:\n\n$$\n\\mathbf{X}' = \\mathrm{LN}_1(\\mathbf{X})\n$$\n\nCompute:\n\n$$\n\\begin{aligned}\n\\mathbf{Q} &= \\mathbf{X}' \\mathbf{W}_Q^\\top + \\mathbf{b}_Q \\\n\\mathbf{K} &= \\mathbf{X}' \\mathbf{W}_K^\\top + \\mathbf{b}_K \\\n\\mathbf{V} &= \\mathbf{X}' \\mathbf{W}_V^\\top + \\mathbf{b}_V\n\\end{aligned}\n\\qquad\n\\mathbf{Q},\\mathbf{K},\\mathbf{V} \\in \\mathbb{R}^{N \\times L \\times D}\n$$\n\n#### 3.5.2 Multi-Head Reshape + Transpose (Shuffle Layers)\n\nMulti-head attention splits the hidden dimension (D) into (H) heads of size (d).\n\n#### 3.5.3 Reshape and transpose\n\nStarting from:\n\n$$\n\\mathbf{Q} \\in \\mathbb{R}^{N \\times L \\times D}\n$$\n\nReshape:\n\n$$\n\\mathbf{Q}_r \\in \\mathbb{R}^{N \\times L \\times H \\times d}\n$$\n\nTranspose (swap axes to put heads first):\n\n$$\n\\mathbf{Q}_h \\in \\mathbb{R}^{N \\times H \\times L \\times d}\n$$\n\nSame for ($\\mathbf{K}$) and ($\\mathbf{V}$).\n\nCode:\n\n```cpp\nq_s->setReshapeDimensions(Dims4{N, -1, H, d});\nq_s->setSecondTranspose({0, 2, 1, 3}); // (N,H,L,d)\n```\n\n#### 3.5.4 SDPA (Scaled Dot-Product Attention)\n\nFor each batch (n) and head (h), define:\n\n$$\n\\mathbf{Q}^{(n,h)} \\in \\mathbb{R}^{L \\times d}, \\quad\n\\mathbf{K}^{(n,h)} \\in \\mathbb{R}^{L \\times d}, \\quad\n\\mathbf{V}^{(n,h)} \\in \\mathbb{R}^{L \\times d}\n$$\n\n#### 3.5.5 Attention logits ($QK^\\top$)\n\n$$\n\\mathbf{S}^{(n,h)} = \\mathbf{Q}^{(n,h)} \\left(\\mathbf{K}^{(n,h)}\\right)^\\top\n\\in \\mathbb{R}^{L \\times L}\n$$\n\nIn tensor form:\n\n$$\n\\mathbf{S} \\in \\mathbb{R}^{N \\times H \\times L \\times L}\n$$\n\nCode:\n\n```cpp\nqk = MatMul(q_s, NONE, k_s, TRANSPOSE); // (N,H,L,d) x (N,H,d,L) -> (N,H,L,L)\n```\n\n#### 3.5.6 Scaling\n\nScaled dot-product uses:\n\n$$\n\\alpha = \\frac{1}{\\sqrt{d}}\n$$\n\n$$\n\\tilde{\\mathbf{S}} = \\alpha \\mathbf{S}\n$$\n\nCode:\n\n```cpp\nscale_val = 1/sqrt(d);\nattn_qk = qk * scale; // ElementWise PROD\n```\n\n#### 3.5.7 Softmax normalization\n\nSoftmax is applied on the **last dimension** (keys index), for each query position, So:\n\n$$\n\\mathbf{P} \\in \\mathbb{R}^{N \\times H \\times L \\times L}\n$$\n\nCode:\n\n```cpp\nqk_softmax = SoftMax(attn_qk);\nqk_softmax->setAxes(1U << (nbDims-1)); // last axis\n```\n\n#### 3.5.8 Weighted sum of values\n\nEach head output:\n\n$$\n\\mathbf{O}^{(n,h)} = \\mathbf{P}^{(n,h)} \\mathbf{V}^{(n,h)}\n\\in \\mathbb{R}^{L \\times d}\n$$\n\nThus:\n\n$$\n\\mathbf{O} \\in \\mathbb{R}^{N \\times H \\times L \\times d}\n$$\n\nCode:\n\n```cpp\nattn_qkv = MatMul(qk_softmax, NONE, v_s, NONE); // (N,H,L,L)x(N,H,L,d)->(N,H,L,d)\n```\n\n### 3.6 Merge Heads + Output Projection\n\n#### 3.6.1 Merge heads\n\nTranspose back:\n\n$$\n\\mathbf{O} \\in \\mathbb{R}^{N \\times H \\times L \\times d}\n\\ \\xrightarrow{\\text{transpose}}\n\\mathbb{R}^{N \\times L \\times H \\times d}\n$$\n\nThen reshape:\n\n$$\n\\mathbb{R}^{N \\times L \\times (H\\cdot d)} = \\mathbb{R}^{N \\times L \\times D}\n$$\n\nCode:\n\n```cpp\nattn_out->setFirstTranspose({0, 2, 1, 3}); // (N,L,H,d)\nattn_out->setReshapeDimensions(Dims3{N, -1, 768}); // (N,L,D)\n```\n\n#### 3.6.2 Output projection\n\n$$\n\\mathbf{A} = \\mathbf{O}_{\\text{merged}} \\mathbf{W}_O^\\top + \\mathbf{b}_O\n\\quad\\in\\mathbb{R}^{N \\times L \\times D}\n$$\n\nCode:\n\n```cpp\nattn_fcw = MatMul(attn_out, out_proj_w^T);\nattn_fcb = attn_fcw + out_proj_b;\n```\n\n### 3.7 Residual Connection After Attention\n\n$$\n\\mathbf{Y} = \\mathbf{X} + \\mathbf{A}\n\\quad\\in\\mathbb{R}^{N \\times L \\times D}\n$$\n\nCode:\n\n```cpp\nattn_residual = input + attn_fcb;\n```\n\nThis identity path is crucial for gradient flow and stability; at inference time it preserves a “direct” signal path even if attention becomes sharp or noisy.\n\n### 3.8 Post-Attention LayerNorm\n\n$$\n\\mathbf{Y}' = \\mathrm{LN}_2(\\mathbf{Y})\n$$\n\nCode:\n\n```cpp\npost_lnorm = Normalization(attn_residual, post_ln_scale, post_ln_bias)\n```\n\n### 3.9 Feed-Forward Network (FFN / MLP)\n\nViT uses a 2-layer MLP with expansion ratio 4 and GeLU activation.\n\n#### 3.9.1 First dense layer (expand to 3072)\n\n$$\n\\mathbf{H} = \\mathbf{Y}' \\mathbf{W}_1^\\top + \\mathbf{b}_1\n\\quad\\in\\mathbb{R}^{N \\times L \\times 4D}\n$$\n\nCode:\n\n```cpp\ninter0 = MatMul(post_lnorm, iw^T); // iw shape conceptually (4D, D)\ninter1 = inter0 + ib;\n```\n\n#### 3.9.2 GeLU activation\n\n$$\n\\mathrm{GeLU}(x) = x \\Phi(x)\n$$\n\nWhere (\\Phi) is the standard normal CDF.\n\nCommon tanh approximation (widely used in implementations):\n\n$$\n\\mathrm{GeLU}(x) \\approx \\frac\n{x\\times \\bigg(1+\\tanh\\Big(\\sqrt\\frac{2}{\\pi}\\times (x+0.044715\\times x^3)\\Big)\\bigg)}\n{2}\n$$\n\nCode calls:\n\n```cpp\ninter_act = addGeLU(net, inter1);\n```\n\n#### 3.9.3 Second dense layer (project back to 768)\n\n$$\n\\mathbf{F} = \\mathrm{GeLU}(\\mathbf{H}) \\mathbf{W}_2^\\top + \\mathbf{b}_2\n\\quad\\in\\mathbb{R}^{N \\times L \\times D}\n$$\n\nCode:\n\n```cpp\nout0 = MatMul(inter_act, ow^T); // ow conceptually (D, 4D)\nout1 = out0 + ob;\n```\n\n### 3.10 Final Residual Connection\n\n$$\n\\mathbf{Z} = \\mathbf{Y} + \\mathbf{F}\n\\quad\\in\\mathbb{R}^{N \\times L \\times D}\n$$\n\nCode:\n\n```cpp\noutput_residual = out1 + attn_residual;\nreturn output_residual;\n```\n\n## 4. Compact Step-by-Step Shape Trace\n\nBelow is a shape trace aligned with the main operations (assuming dynamic (L)):\n\nInput\n\n$$ \\mathbf{X}: (N, L, 768) $$\n\nPre-LN\n\n$$ \\mathbf{X}': (N, L, 768) $$\n\nQ/K/V projections\n\n$$ \\mathbf{Q},\\mathbf{K},\\mathbf{V}: (N, L, 768) $$\n\nReshape + transpose to heads\n\n$$ \\mathbf{Q}\\_h,\\mathbf{K}\\_h,\\mathbf{V}\\_h: (N, H, L, d) $$\n\nAttention logits\n\n$$ \\mathbf{S}: (N, H, L, L) $$\n\nSoftmax weights\n\n$$ \\mathbf{P}: (N, H, L, L) $$\n\nHead outputs\n\n$$ \\mathbf{O}: (N, H, L, d) $$\n\nMerge heads\n\n$$ \\mathbf{O}\\_{\\text{merged}}: (N, L, 768) $$\n\nOutput projection\n\n$$ \\mathbf{A}: (N, L, 768) $$\n\nResidual\n\n$$ \\mathbf{Y}: (N, L, 768) $$\n\nPost-LN\n\n$$ \\mathbf{Y}': (N, L, 768) $$\n\nFFN expand\n\n$$ \\mathbf{H}: (N, L, 3072) $$\n\nFFN project\n\n$$ \\mathbf{F}: (N, L, 768) $$\n\nFinal residual\n\n$$ \\mathbf{Z}: (N, L, 768) $$\n"
  },
  {
    "path": "vit/cuda_allocator.cc",
    "content": "#include \"cuda_allocator.h\"\n#include <cuda.h>\n#include <cstdint>\n#include <cstdlib>\n#include <memory>\n#include <mutex>\n#include \"macros.h\"\n#include \"utils.h\"\n\nnamespace {\nconstexpr int kCudaVersionAsyncMin = 11020;\nconstexpr int kCudaVersionCuMemMin = 12000;\n}  // namespace\n\nstruct CudaOutputAllocator::Allocation {\n    void* ptr{nullptr};\n    std::size_t size{0};\n    OutputAllocKind kind{OutputAllocKind::kCudaMallocManaged};\n    CUmemGenericAllocationHandle handle{};\n    CUdeviceptr addr{};\n    std::size_t mapped_size{0};\n};\n\nstatic auto getCudaRuntimeVersion() -> int {\n    int version = 0;\n    if (cudaRuntimeGetVersion(&version) != cudaSuccess) {\n        return 0;\n    }\n    return version;\n}\n\nstatic auto getCudaDriverVersion() -> int {\n    int version = 0;\n    if (cudaDriverGetVersion(&version) != cudaSuccess) {\n        return 0;\n    }\n    return version;\n}\n\nstd::unique_ptr<CudaOutputAllocator> CudaOutputAllocator::Create(cudaStream_t stream, int device) {\n    CHECK(cudaSetDevice(device));\n    const int rt = getCudaRuntimeVersion();\n    const int drv = getCudaDriverVersion();\n\n    OutputAllocKind kind = OutputAllocKind::kCudaMallocManaged;\n    if (rt >= kCudaVersionCuMemMin && drv >= kCudaVersionCuMemMin) {\n        kind = OutputAllocKind::kCuMem;\n    } else if (rt >= kCudaVersionAsyncMin) {\n        kind = OutputAllocKind::kCudaMallocAsync;\n    }\n    return std::make_unique<CudaOutputAllocator>(stream, kind, device);\n}\n\nCudaOutputAllocator::CudaOutputAllocator(cudaStream_t stream, OutputAllocKind kind, int device)\n    : stream_(stream), kind_(kind), device_(device) {}\n\nCudaOutputAllocator::~CudaOutputAllocator() {\n    std::lock_guard<std::mutex> lock(mutex_);\n    for (auto& entry : allocations_) {\n        release(entry.first, entry.second);\n    }\n}\n\n#if TRT_VERSION < 10000\n// NOLINTNEXTLINE(bugprone-easily-swappable-parameters)\nvoid* CudaOutputAllocator::reallocateOutput(const char* tensorName, void* currentMemory, uint64_t size,\n                                            uint64_t alignment) TRT_NOEXCEPT {\n    (void)alignment;\n    if (!tensorName) {\n        return nullptr;\n    }\n    std::lock_guard<std::mutex> lock(mutex_);\n    auto& alloc = allocations_[tensorName];\n    if (alloc.ptr && size <= alloc.size) {\n        return alloc.ptr;\n    }\n    if (alloc.ptr) {\n        release(tensorName, alloc);\n    } else if (currentMemory != nullptr && size == 0) {\n        return currentMemory;\n    }\n\n    Allocation fresh = allocate(static_cast<std::size_t>(size));\n    if (!fresh.ptr) {\n        return nullptr;\n    }\n    alloc = fresh;\n    return alloc.ptr;\n}\n#else\n// NOLINTNEXTLINE(bugprone-easily-swappable-parameters)\nvoid* CudaOutputAllocator::reallocateOutputAsync(const char* tensorName, void* currentMemory, uint64_t size,\n                                                 uint64_t alignment, cudaStream_t stream) TRT_NOEXCEPT {\n    (void)alignment;\n    if (!tensorName) {\n        return nullptr;\n    }\n    if (stream == nullptr) {\n        stream = stream_;\n    }\n    stream_ = stream;\n    std::lock_guard<std::mutex> lock(mutex_);\n    auto& alloc = allocations_[tensorName];\n    if (alloc.ptr && size <= alloc.size) {\n        return alloc.ptr;\n    }\n    if (alloc.ptr) {\n        release(tensorName, alloc);\n    } else if (currentMemory != nullptr && size == 0) {\n        return currentMemory;\n    }\n\n    Allocation fresh = allocate(static_cast<std::size_t>(size));\n    if (!fresh.ptr) {\n        return nullptr;\n    }\n    alloc = fresh;\n    return alloc.ptr;\n}\n#endif\n\nvoid CudaOutputAllocator::notifyShape(const char* /*tensorName*/, nvinfer1::Dims const& /*dims*/) TRT_NOEXCEPT {}\n\nCudaOutputAllocator::Allocation CudaOutputAllocator::allocate(std::size_t size) {\n    Allocation alloc{};\n    if (size == 0) {\n        return alloc;\n    }\n    if (kind_ == OutputAllocKind::kCudaMallocAsync) {\n        void* ptr = nullptr;\n        if (cudaMallocAsync(&ptr, size, stream_) != cudaSuccess) {\n            return alloc;\n        }\n        alloc.ptr = ptr;\n        alloc.size = size;\n        alloc.kind = OutputAllocKind::kCudaMallocAsync;\n        return alloc;\n    }\n    if (kind_ == OutputAllocKind::kCudaMallocManaged) {\n        void* ptr = nullptr;\n        if (cudaMallocManaged(&ptr, size, cudaMemAttachGlobal) != cudaSuccess) {\n            return alloc;\n        }\n        alloc.ptr = ptr;\n        alloc.size = size;\n        alloc.kind = OutputAllocKind::kCudaMallocManaged;\n        return alloc;\n    }\n\n    if (cudaSetDevice(device_) != cudaSuccess) {\n        return alloc;\n    }\n    if (cuInit(0) != CUDA_SUCCESS) {\n        return alloc;\n    }\n\n    CUmemAllocationProp prop{};\n    prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;\n    prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;\n    prop.location.id = device_;\n\n    std::size_t granularity = 0;\n    if (cuMemGetAllocationGranularity(&granularity, &prop, CU_MEM_ALLOC_GRANULARITY_MINIMUM) != CUDA_SUCCESS) {\n        return alloc;\n    }\n\n    const std::size_t alloc_size = ((size + granularity - 1) / granularity) * granularity;\n    CUmemGenericAllocationHandle handle{};\n    if (cuMemCreate(&handle, alloc_size, &prop, 0) != CUDA_SUCCESS) {\n        return alloc;\n    }\n\n    CUdeviceptr addr = 0;\n    if (cuMemAddressReserve(&addr, alloc_size, 0, 0, 0) != CUDA_SUCCESS) {\n        cuMemRelease(handle);\n        return alloc;\n    }\n\n    if (cuMemMap(addr, alloc_size, 0, handle, 0) != CUDA_SUCCESS) {\n        cuMemAddressFree(addr, alloc_size);\n        cuMemRelease(handle);\n        return alloc;\n    }\n\n    CUmemAccessDesc access_desc{};\n    access_desc.location = prop.location;\n    access_desc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;\n    if (cuMemSetAccess(addr, alloc_size, &access_desc, 1) != CUDA_SUCCESS) {\n        cuMemUnmap(addr, alloc_size);\n        cuMemAddressFree(addr, alloc_size);\n        cuMemRelease(handle);\n        return alloc;\n    }\n    static_assert(sizeof(void*) == sizeof(CUdeviceptr));\n    alloc.ptr = reinterpret_cast<void*>(addr);  // NOLINT(performance-no-int-to-ptr)\n    alloc.size = size;\n    alloc.kind = OutputAllocKind::kCuMem;\n    alloc.handle = handle;\n    alloc.addr = addr;\n    alloc.mapped_size = alloc_size;\n    return alloc;\n}\n\nvoid CudaOutputAllocator::release(const std::string& /*tensorName*/, Allocation& alloc) {\n    if (!alloc.ptr) {\n        return;\n    }\n    if (alloc.kind == OutputAllocKind::kCudaMallocAsync) {\n        cudaFreeAsync(alloc.ptr, stream_);\n    } else if (alloc.kind == OutputAllocKind::kCudaMallocManaged) {\n        cudaFree(alloc.ptr);\n    } else if (alloc.kind == OutputAllocKind::kCuMem) {\n        cuMemUnmap(alloc.addr, alloc.mapped_size);\n        cuMemRelease(alloc.handle);\n        cuMemAddressFree(alloc.addr, alloc.mapped_size);\n    }\n    alloc = Allocation{};\n}\n\nvoid* CudaOutputAllocator::getBuffer(const std::string& tensorName) const {\n    std::lock_guard<std::mutex> lock(mutex_);\n    auto it = allocations_.find(tensorName);\n    if (it == allocations_.end()) {\n        return nullptr;\n    }\n    return it->second.ptr;\n}\n\nstd::size_t CudaOutputAllocator::getSize(const std::string& tensorName) const {\n    std::lock_guard<std::mutex> lock(mutex_);\n    auto it = allocations_.find(tensorName);\n    if (it == allocations_.end()) {\n        return 0;\n    }\n    return it->second.size;\n}\n\nOutputAllocKind CudaOutputAllocator::kind() const {\n    return kind_;\n}\n"
  },
  {
    "path": "vit/cuda_allocator.h",
    "content": "#pragma once\n#include <NvInfer.h>\n#include <cuda_runtime_api.h>\n#include <cstddef>\n#include <memory>\n#include <mutex>\n#include <string>\n#include <unordered_map>\n#include \"macros.h\"\n\nenum class OutputAllocKind : std::uint8_t { kCudaMallocAsync, kCudaMallocManaged, kCuMem };\n\nclass CudaOutputAllocator final : public nvinfer1::IOutputAllocator {\n   public:\n    static std::unique_ptr<CudaOutputAllocator> Create(cudaStream_t stream, int device = 0);\n\n    explicit CudaOutputAllocator(cudaStream_t stream, OutputAllocKind kind, int device = 0);\n    ~CudaOutputAllocator() override;\n\n#if TRT_VERSION < 10000\n    void* reallocateOutput(const char* tensorName, void* currentMemory, uint64_t size,\n                           uint64_t alignment) TRT_NOEXCEPT override;\n#else\n    void* reallocateOutputAsync(const char* tensorName, void* currentMemory, uint64_t size, uint64_t alignment,\n                                cudaStream_t stream) TRT_NOEXCEPT override;\n#endif\n    void notifyShape(const char* tensorName, nvinfer1::Dims const& dims) TRT_NOEXCEPT override;\n\n    void* getBuffer(const std::string& tensorName) const;\n    std::size_t getSize(const std::string& tensorName) const;\n    OutputAllocKind kind() const;\n\n   private:\n    struct Allocation;\n    Allocation allocate(std::size_t size);\n    void release(const std::string& tensorName, Allocation& alloc);\n\n    cudaStream_t stream_{};\n    OutputAllocKind kind_{OutputAllocKind::kCudaMallocManaged};\n    int device_{0};\n    mutable std::mutex mutex_;\n    std::unordered_map<std::string, Allocation> allocations_;\n};\n"
  },
  {
    "path": "vit/gen_wts.py",
    "content": "import struct\n\nimport cv2\nimport numpy as np\nimport torch\nfrom transformers import AutoConfig, AutoImageProcessor, AutoModelForImageClassification\n\n\ndef read_imagenet_labels() -> dict[int, str]:\n    \"\"\"\n    read ImageNet 1000 labels\n\n    Returns:\n        dict[int, str]: labels dict\n    \"\"\"\n    clsid2label = {}\n    with open(\"../assets/imagenet1000_clsidx_to_labels.txt\", \"r\") as f:\n        for i in f.readlines():\n            k, v = i.split(\": \")\n            clsid2label.setdefault(int(k), v[1:-3])\n    return clsid2label\n\n\nUSE_HF_PREPROCESS = False\n\nif __name__ == \"__main__\":\n    hub_model_id = \"google/vit-base-patch16-224\"\n    config = AutoConfig.from_pretrained(hub_model_id)\n    config._attn_implementation = \"eager\"\n    model = AutoModelForImageClassification.from_pretrained(\n        hub_model_id,\n        ignore_mismatched_sizes=False,\n        config=config,\n    )\n\n    model.eval()\n\n    img = cv2.imread(\"../assets/cats.jpg\", cv2.IMREAD_COLOR)\n\n    if USE_HF_PREPROCESS:\n        image_processor = AutoImageProcessor.from_pretrained(hub_model_id)\n        img = image_processor(img, return_tensors=\"pt\")\n        img = img[\"pixel_values\"]\n    else:\n        img: np.array = cv2.resize(img, (224, 224), cv2.INTER_LINEAR)\n        img = (img.astype(np.float32) / 255.0 - np.array([0.5, 0.5, 0.5])) / np.array([0.5, 0.5, 0.5])\n        img = torch.from_numpy(np.transpose(img, (2, 0, 1))[None, ...])\n\n    output = model(img)\n    labels = read_imagenet_labels()\n    for i, j in enumerate(torch.topk(output.logits[0], k=3).indices):\n        print(f\"Top: {i} is {labels[int(j)]}\")\n\n    f = open(\"../models/vit.wts\", \"w\")\n    f.write(\"{}\\n\".format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        print(\"key: \", k)\n        print(\"value: \", v.shape)\n        vr = v.reshape(-1).cpu().numpy()\n        f.write(\"{} {}\".format(k, len(vr)))\n        for vv in vr:\n            f.write(\" \")\n            f.write(struct.pack(\">f\", float(vv)).hex())\n        f.write(\"\\n\")\n"
  },
  {
    "path": "vit/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <cstdint>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include <utility>\n#include \"NvInferRuntime.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(std::move(prefix)), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) noexcept\n        : mOutput(other.mOutput), mPrefix(std::move(other.mPrefix)), mShouldLog(other.mShouldLog) {}\n\n    ~LogStreamConsumerBuffer() override {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    int sync() override {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, std::string prefix, bool shouldLog)\n        : mBuffer(stream, std::move(prefix), shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other) noexcept\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   private:\n    struct TestInfo;\n\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult : std::uint8_t {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << '\\n';\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, TestInfo info)\n            : mStarted(started), mName(std::move(info.name)), mCmdline(std::move(info.cmdline)) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom{false, TestInfo{name, cmdline}};\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    [[nodiscard]] Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    struct TestInfo {\n        std::string name;\n        std::string cmdline;\n    };\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << '\\n';\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kVERBOSE};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINFO};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kWARNING};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kERROR};\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer{logger.getReportableSeverity(), Severity::kINTERNAL_ERROR};\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "vit/macros.h",
    "content": "#pragma once\n#include <NvInfer.h>\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#define TRT_VERSION \\\n    ((NV_TENSORRT_MAJOR * 1000) + (NV_TENSORRT_MINOR * 100) + (NV_TENSORRT_PATCH * 10) + NV_TENSORRT_BUILD)\n\n#if TRT_VERSION < 7220\n#error \"TensorRT >= 7.2.2 is required for this demo.\"\n#endif\n\n#if TRT_VERSION >= 8000\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n"
  },
  {
    "path": "vit/profiler.cc",
    "content": "#include \"profiler.h\"\n#include <NvInfer.h>\n#include <algorithm>\n#include <iomanip>\n#include <string>\n\nvoid Profiler::reportLayerTime(const char* layerName, float ms) noexcept {\n    mProfile[layerName].count++;\n    mProfile[layerName].time += ms;\n    if (std::find(mLayerNames.begin(), mLayerNames.end(), layerName) == mLayerNames.end()) {\n        mLayerNames.emplace_back(layerName);\n    }\n}\n\nProfiler::Profiler(const char* name, const std::vector<Profiler>& srcProfilers) : mName(name) {\n    for (const auto& srcProfiler : srcProfilers) {\n        for (const auto& rec : srcProfiler.mProfile) {\n            auto it = mProfile.find(rec.first);\n            if (it == mProfile.end()) {\n                mProfile.insert(rec);\n            } else {\n                it->second.time += rec.second.time;\n                it->second.count += rec.second.count;\n            }\n        }\n    }\n}\n\nstd::ostream& operator<<(std::ostream& out, const Profiler& value) {\n    out << \"========== \" << value.mName << \" ==========\\n\";\n    float totalTime = 0;\n    std::string layerNameStr = \"TensorRT layer name\";\n    int maxLayerNameLength = std::max(static_cast<int>(layerNameStr.size()), 70);\n    for (const auto& elem : value.mProfile) {\n        totalTime += elem.second.time;\n        maxLayerNameLength = std::max(maxLayerNameLength, static_cast<int>(elem.first.size()));\n    }\n\n    auto old_settings = out.flags();\n    auto old_precision = out.precision();\n    // Output header\n    {\n        out << std::setfill(' ') << std::setw(maxLayerNameLength) << layerNameStr << \" \";\n        out << std::setw(12) << \"Runtime, \" << \"%\" << \" \";\n        out << std::setw(12) << \"Invocations\" << \" \";\n        out << std::setw(12) << \"Runtime, ms\\n\";\n    }\n    for (size_t i = 0; i < value.mLayerNames.size(); i++) {\n        const std::string layerName = value.mLayerNames[i];\n        auto elem = value.mProfile.at(layerName);\n        out << std::setw(maxLayerNameLength) << layerName << \" \";\n        out << std::setw(12) << std::fixed << std::setprecision(1) << (elem.time * 100.0F / totalTime) << \"%\" << \" \";\n        out << std::setw(12) << elem.count << \" \";\n        out << std::setw(12) << std::fixed << std::setprecision(2) << elem.time << \"\\n\";\n    }\n    out.flags(old_settings);\n    out.precision(old_precision);\n    out << \"========== \" << value.mName << \" total runtime = \" << totalTime << \" ms ==========\\n\";\n\n    return out;\n}"
  },
  {
    "path": "vit/profiler.h",
    "content": "#include <NvInfer.h>\n\n#include <iostream>\n#include <map>\n#include <string>\n#include <vector>\n\nclass Profiler final : public nvinfer1::IProfiler {\n   public:\n    struct Record {\n        float time{0};\n        int count{0};\n    };\n    Profiler(const char* name, const std::vector<Profiler>& srcProfilers = std::vector<Profiler>());\n    void reportLayerTime(const char* layerName, float ms) noexcept override;\n    friend std::ostream& operator<<(std::ostream& out, const Profiler& value);\n\n   private:\n    std::string mName;\n    std::vector<std::string> mLayerNames;\n    std::map<std::string, Record> mProfile;\n};\n"
  },
  {
    "path": "vit/utils.h",
    "content": "#pragma once\n#include <cuda_fp16.h>\n#include <cuda_runtime_api.h>\n#include <algorithm>\n#include <cassert>\n#include <cstddef>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <numeric>\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\nconstexpr const std::size_t WORKSPACE_SIZE = 16 << 20;\nnamespace {\n#define CHECK(status)                                     \\\n    do {                                                  \\\n        auto ret = (status);                              \\\n        if (ret != cudaSuccess) {                         \\\n            std::cerr << \"Cuda failure: \" << ret << \"\\n\"; \\\n            std::abort();                                 \\\n        }                                                 \\\n    } while (0)\n\nstatic void checkTrtEnv(int device = 0) {\n#if TRT_VERSION < 8000\n    CHECK(cudaGetDevice(&device));\n    cudaDeviceProp prop{};\n    CHECK(cudaGetDeviceProperties(&prop, device));\n    const int sm = prop.major * 10 + prop.minor;\n    if (sm > 86) {\n        std::cerr << \"TensorRT < 8 does not support SM > 86 on this GPU.\";\n        std::abort();\n    }\n#endif\n}\n\n/**\n * @brief TensorRT weight files have a simple space delimited format:\n * [type] [size] <data x size in hex>\n * \n * @param file input weight file path\n * @return std::map<std::string, nvinfer1::Weights> \n */\nstatic auto loadWeights(const std::string& file) {\n    std::cout << \"Loading weights: \" << file << \"\\n\";\n    std::map<std::string, nvinfer1::Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{.type = nvinfer1::DataType::kFLOAT, .values = nullptr, .count = 0};\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> wt.count;\n\n        // Load blob\n        auto* val = new uint32_t[wt.count];\n        input >> std::hex;\n        for (auto x = 0ll; x < wt.count; ++x) {\n            input >> val[x];\n        }\n        wt.values = val;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\n/**\n * @brief a preprocess function aligning with ImageNet preprocess in torchvision, only support 3-channel image\n * \n * @param img opencv image with BGR layout\n * @param bgr2rgb whether to convert BGR to RGB\n  * @param mean_std subtract mean, then divide std\n  * @param n batch size\n  * @param h resize height\n  * @param w resize width\n  * @return std::vector<half> contiguous flatten image data in fp16 type (CHW)\n  */\nstatic auto preprocess_img(cv::Mat& img, bool bgr2rgb, const std::array<const float, 3>& mean,\n                           const std::array<const float, 3>& std, int64_t n, int32_t h, int32_t w) {\n    const auto c = img.channels();\n    const auto size = c * h * w;\n    if (c != 3) {\n        std::cerr << \"this demo only supports 3 channel input image.\\n\";\n        std::abort();\n    }\n    if (bgr2rgb) {\n        cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n    }\n    cv::resize(img, img, cv::Size(w, h), 0, 0, cv::INTER_LINEAR);\n\n    // Keep preprocessing in fp32 on CPU for correctness, then pack to fp16 CHW for TensorRT input.\n    img.convertTo(img, CV_32FC3, 1.f / 255.f);\n    img = (img - cv::Scalar(mean[0], mean[1], mean[2])) / cv::Scalar(std[0], std[1], std[2]);\n    std::vector<half> chw(static_cast<std::size_t>(n) * c * h * w);\n\n    // fill all batch with the same input image\n    for (int i = 0; i < n; ++i) {\n        for (int y = 0; y < h; ++y) {\n            for (int x = 0; x < w; ++x) {\n                const cv::Vec3f v = img.at<cv::Vec3f>(y, x);\n                chw[i * size + 0 * h * w + y * w + x] = __float2half(v[0]);\n                chw[i * size + 1 * h * w + y * w + x] = __float2half(v[1]);\n                chw[i * size + 2 * h * w + y * w + x] = __float2half(v[2]);\n            }\n        }\n    }\n    return chw;\n}\n\nstatic auto topk(const std::vector<float>& v, int k) -> std::vector<std::pair<int, float>> {\n    if (k <= 0)\n        return {};\n    auto stride = std::min<std::ptrdiff_t>(k, static_cast<std::ptrdiff_t>(v.size()));\n\n    std::vector<int> idx(v.size());\n    std::iota(idx.begin(), idx.end(), 0);\n\n    std::partial_sort(idx.begin(), idx.begin() + stride, idx.end(), [&](int a, int b) { return v[a] > v[b]; });\n\n    std::vector<std::pair<int, float>> out;\n    out.reserve(stride);\n    for (int i = 0; i < stride; ++i)\n        out.emplace_back(idx[i], v[idx[i]]);\n    return out;\n}\n\nstatic auto loadImagenetLabelMap(const std::string& path) {\n    std::map<int, std::string> labels;\n    std::ifstream in(path);\n    if (!in.is_open()) {\n        return labels;\n    }\n    std::string line;\n    while (std::getline(in, line)) {\n        auto colon = line.find(':');\n        if (colon == std::string::npos) {\n            continue;\n        }\n        auto first_quote = line.find('\\'', colon);\n        if (first_quote == std::string::npos) {\n            continue;\n        }\n        auto second_quote = line.find('\\'', first_quote + 1);\n        if (second_quote == std::string::npos) {\n            continue;\n        }\n        int idx = std::stoi(line.substr(0, colon));\n        labels[idx] = line.substr(first_quote + 1, second_quote - first_quote - 1);\n    }\n    return labels;\n}\n}  // namespace\n"
  },
  {
    "path": "vit/vit.cc",
    "content": "#include <NvInfer.h>\n#include <cuda_fp16.h>\n#include <cassert>\n#include <cstring>\n#include <fstream>\n#include <iostream>\n#include \"cuda_allocator.h\"\n#include \"logging.h\"\n#include \"macros.h\"\n#include \"profiler.h\"\n#include \"utils.h\"\n\nusing namespace nvinfer1;\nusing WeightMap = std::map<std::string, Weights>;\nusing M = nvinfer1::MatrixOperation;\nusing E = nvinfer1::ElementWiseOperation;\nusing NDCF = nvinfer1::NetworkDefinitionCreationFlag;\n\nstatic constexpr const int64_t N = 1;\nstatic constexpr const int64_t INPUT_H = 224;\nstatic constexpr const int64_t INPUT_W = 224;\n\nstatic constexpr const char* WTS_PATH = \"../models/vit.wts\";\nstatic constexpr const char* ENGINE_PATH = \"../models/vit.engine\";\nstatic constexpr const char* LABELS_PATH = \"../assets/imagenet1000_clsidx_to_labels.txt\";\nstatic constexpr const std::array<const char*, 2> NAMES = {\"input\", \"logits\"};\nstatic constexpr const std::array<int64_t, 2> SIZES = {3 * INPUT_H * INPUT_W, 1000};\nstatic constexpr const std::array<const float, 3> mean = {0.5f, 0.5f, 0.5f};\nstatic constexpr const std::array<const float, 3> stdv = {0.5f, 0.5f, 0.5f};\n\nstatic Logger gLogger;\n\nstatic auto bytesPerElement(DataType t) -> std::size_t {\n    switch (t) {\n        case DataType::kFLOAT:\n            return 4;\n        case DataType::kHALF:\n            return 2;\n        case DataType::kINT32:\n            return 4;\n#if TRT_VERSION >= 8000\n        case DataType::kBOOL:\n#endif\n#if TRT_VERSION >= 8500\n        case DataType::kUINT8:\n#endif\n        case DataType::kINT8:\n            return 1;\n        default:\n            std::cerr << \"Unsupported TensorRT DataType\\n\";\n            std::abort();\n    }\n}\n\nstatic void convertWeightMapToHalf(WeightMap& w) {\n    for (auto& kv : w) {\n        auto& wt = kv.second;\n        if (wt.type != DataType::kFLOAT || wt.values == nullptr || wt.count <= 0) {\n            continue;\n        }\n\n        auto* half_vals = new half[wt.count];\n        const auto* raw = reinterpret_cast<const uint32_t*>(wt.values);\n        for (int64_t i = 0; i < wt.count; ++i) {\n            float f;\n            std::memcpy(&f, &raw[i], sizeof(float));\n            half_vals[i] = __float2half(f);\n        }\n\n        delete[] raw;\n        wt.type = DataType::kHALF;\n        wt.values = half_vals;\n    }\n}\n\nstruct ViTParam {\n    uint32_t index;\n    uint32_t head_num;\n    float lnorm_eps = 1e-12f;\n};\n\nstatic auto addGeLU(INetworkDefinition* net, ITensor& input) -> ILayer* {\n#if TRT_VERSION < 10000\n    // tanh approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))\n    const auto inputDims = input.getDimensions();\n\n    Dims scalarDims{};\n    scalarDims.nbDims = inputDims.nbDims;\n    for (int i = 0; i < scalarDims.nbDims; ++i) {\n        scalarDims.d[i] = 1;\n    }\n\n    static float _half = 0.5f;\n    static float _one = 1.0f;\n    static float _sqrt_2_div_pi = std::sqrt(2.0f / M_PI);\n    static float _coeff = 0.044715f;\n    auto* _w_half = net->addConstant(scalarDims, Weights{DataType::kFLOAT, &_half, 1});\n    auto* _w_one = net->addConstant(scalarDims, Weights{DataType::kFLOAT, &_one, 1});\n    auto* _w_sqrt_2_div_pi = net->addConstant(scalarDims, Weights{DataType::kFLOAT, &_sqrt_2_div_pi, 1});\n    auto* _w_coeff = net->addConstant(scalarDims, Weights{DataType::kFLOAT, &_coeff, 1});\n\n    auto* _x2 = net->addElementWise(input, input, E::kPROD);\n    auto* x3_0 = net->addElementWise(*_x2->getOutput(0), input, E::kPROD);\n    auto* x3_1 = net->addElementWise(*x3_0->getOutput(0), *_w_coeff->getOutput(0), E::kPROD);\n    auto* x3_2 = net->addElementWise(input, *x3_1->getOutput(0), E::kSUM);\n    auto* scaled = net->addElementWise(*x3_2->getOutput(0), *_w_sqrt_2_div_pi->getOutput(0), E::kPROD);\n\n    auto* t = net->addActivation(*scaled->getOutput(0), ActivationType::kTANH);\n    auto* one_plus = net->addElementWise(*t->getOutput(0), *_w_one->getOutput(0), E::kSUM);\n    auto* half_x = net->addElementWise(input, *_w_half->getOutput(0), E::kPROD);\n    return net->addElementWise(*half_x->getOutput(0), *one_plus->getOutput(0), E::kPROD);\n#else\n    // erf approximation\n    return net->addActivation(input, ActivationType::kGELU_ERF);\n#endif\n}\n\nstatic auto addLinearNorm(INetworkDefinition* net, ITensor& input, ITensor& scale, ITensor& bias,\n                          uint32_t axesMask) noexcept -> ILayer* {\n#if TRT_VERSION >= 11500\n    auto* ln = net->addNormalizationV2(input, scale, bias, axesMask);\n#else\n    auto* ln = net->addNormalization(input, scale, bias, axesMask);\n#endif\n    ln->setEpsilon(1e-12f);\n    return ln;\n}\n\nauto ViTLayer(INetworkDefinition* net, WeightMap& w, ITensor& input, const ViTParam& param) -> ITensor* {\n    std::string name = \"vit.encoder.layer.\" + std::to_string(param.index);\n    auto attn_name = name + \".attention\";\n    int64_t attn_head_size = 768LL / param.head_num;\n\n    auto* qw = net->addConstant(Dims3{1, 768, 768}, w.at(attn_name + \".attention.query.weight\"));\n    auto* kw = net->addConstant(Dims3{1, 768, 768}, w.at(attn_name + \".attention.key.weight\"));\n    auto* vw = net->addConstant(Dims3{1, 768, 768}, w.at(attn_name + \".attention.value.weight\"));\n    /* 1. layer norm before attention */\n    auto pre_ln_name = name + \".layernorm_before\";\n    auto dims = input.getDimensions();\n    uint32_t axes = 1U << static_cast<uint32_t>(dims.nbDims - 1);\n    auto* ln_scale = net->addConstant(Dims3{1, 1, dims.d[dims.nbDims - 1]}, w[pre_ln_name + \".weight\"]);\n    auto* ln_bias = net->addConstant(Dims3{1, 1, dims.d[dims.nbDims - 1]}, w[pre_ln_name + \".bias\"]);\n    auto* pre_lnorm = addLinearNorm(net, input, *ln_scale->getOutput(0), *ln_bias->getOutput(0), axes);\n\n    /** 2. multi-head self-attention */\n    auto* qb = net->addConstant(Dims3{1, 1, 768}, w.at(attn_name + \".attention.query.bias\"));\n    auto* kb = net->addConstant(Dims3{1, 1, 768}, w.at(attn_name + \".attention.key.bias\"));\n    auto* vb = net->addConstant(Dims3{1, 1, 768}, w.at(attn_name + \".attention.value.bias\"));\n    auto* _lno = pre_lnorm->getOutput(0);\n    // 2.1 Q, K attention matmul\n    auto* _q_attn = net->addMatrixMultiply(*_lno, M::kNONE, *qw->getOutput(0), M::kTRANSPOSE);\n    auto* _k_attn = net->addMatrixMultiply(*_lno, M::kNONE, *kw->getOutput(0), M::kTRANSPOSE);\n    auto* _v_attn = net->addMatrixMultiply(*_lno, M::kNONE, *vw->getOutput(0), M::kTRANSPOSE);\n    _q_attn->setName((attn_name + \"query\").c_str());\n    _k_attn->setName((attn_name + \"key\").c_str());\n    _v_attn->setName((attn_name + \"value\").c_str());\n    auto* q_attn = net->addElementWise(*_q_attn->getOutput(0), *qb->getOutput(0), E::kSUM);\n    auto* k_attn = net->addElementWise(*_k_attn->getOutput(0), *kb->getOutput(0), E::kSUM);\n    auto* v_attn = net->addElementWise(*_v_attn->getOutput(0), *vb->getOutput(0), E::kSUM);\n    auto* q_s = net->addShuffle(*q_attn->getOutput(0));\n    auto* k_s = net->addShuffle(*k_attn->getOutput(0));\n    auto* v_s = net->addShuffle(*v_attn->getOutput(0));\n    q_s->setReshapeDimensions(Dims4{0, 0, param.head_num, attn_head_size});\n    q_s->setSecondTranspose({0, 2, 1, 3});\n    k_s->setReshapeDimensions(Dims4{0, 0, param.head_num, attn_head_size});\n    k_s->setSecondTranspose({0, 2, 1, 3});\n    v_s->setReshapeDimensions(Dims4{0, 0, param.head_num, attn_head_size});\n    v_s->setSecondTranspose({0, 2, 1, 3});\n\n    // 2.2 Q, K scaling (and softmax / fused attention)\n    const float scale_f = 1.0f / std::sqrt(static_cast<float>(attn_head_size));\n    if (input.getType() == DataType::kHALF) {\n        auto* scale_val = new half[1];\n        scale_val[0] = __float2half(scale_f);\n        w[attn_name + \".scale\"] = Weights{.type = DataType::kHALF, .values = scale_val, .count = 1};\n    } else {\n        auto* scale_val = new uint32_t[1];\n        std::memcpy(scale_val, &scale_f, sizeof(float));\n        w[attn_name + \".scale\"] = Weights{.type = DataType::kFLOAT, .values = scale_val, .count = 1};\n    }\n    auto* qk_scale_w = net->addConstant(Dims4{1, 1, 1, 1}, w.at(attn_name + \".scale\"));\n\n    // 2.3 QKV attention output and reshape\n#if TRT_VERSION >= 11400 && TRT_VERSION < 11500\n    gLogger.log(Severity::kWARNING,\n                \"IAttention is available in TensorRT 10.14.1 SDK but have bugs, use 10.15.1+ to enable native fused \"\n                \"kernel\");\n#endif\n#if TRT_VERSION >= 11500\n    using ANO = AttentionNormalizationOp;\n    auto* q_scaled = net->addElementWise(*q_s->getOutput(0), *qk_scale_w->getOutput(0), E::kPROD)->getOutput(0);\n    auto* attn = net->addAttention(*q_scaled, *k_s->getOutput(0), *v_s->getOutput(0), ANO::kSOFTMAX, false);\n    assert(attn != nullptr);\n    auto status = attn->setDecomposable(false);\n    assert(status);\n    auto* attn_out = net->addShuffle(*attn->getOutput(0));\n#else\n    auto* qk = net->addMatrixMultiply(*q_s->getOutput(0), M::kNONE, *k_s->getOutput(0), M::kTRANSPOSE);\n    auto* attn_qk = net->addElementWise(*qk->getOutput(0), *qk_scale_w->getOutput(0), E::kPROD);\n    auto* qk_softmax = net->addSoftMax(*attn_qk->getOutput(0));\n    qk_softmax->setAxes(1U << static_cast<uint32_t>(attn_qk->getOutput(0)->getDimensions().nbDims - 1));\n    auto* attn_qkv = net->addMatrixMultiply(*qk_softmax->getOutput(0), M::kNONE, *v_s->getOutput(0), M::kNONE);\n    attn_qkv->setName((attn_name + \".attn_qkv\").c_str());\n    auto* attn_out = net->addShuffle(*attn_qkv->getOutput(0));\n#endif\n    attn_out->setFirstTranspose({0, 2, 1, 3});\n    attn_out->setReshapeDimensions(Dims3{0, 0, 768});\n    // 2.4 attention output projection\n    auto* out_proj_w = net->addConstant(Dims3{1, 768, 768}, w.at(attn_name + \".output.dense.weight\"))->getOutput(0);\n    auto* out_proj_b = net->addConstant(Dims3{1, 1, 768}, w.at(attn_name + \".output.dense.bias\"))->getOutput(0);\n    auto* attn_fcw = net->addMatrixMultiply(*attn_out->getOutput(0), M::kNONE, *out_proj_w, M::kTRANSPOSE);\n    auto* attn_fcb = net->addElementWise(*attn_fcw->getOutput(0), *out_proj_b, E::kSUM);\n    attn_fcb->setName((attn_name + \".out_proj\").c_str());\n\n    /** 3. attention and hidden state residual connection */\n    auto* attn_residual = net->addElementWise(input, *attn_fcb->getOutput(0), E::kSUM);\n    attn_residual->setName((name + \"attn_residual\").c_str());\n\n    /**  4. layer norm after attention */\n    auto post_ln_name = name + \".layernorm_after\";\n    ln_scale = net->addConstant(Dims3{1, 1, dims.d[dims.nbDims - 1]}, w[post_ln_name + \".weight\"]);\n    ln_bias = net->addConstant(Dims3{1, 1, dims.d[dims.nbDims - 1]}, w[post_ln_name + \".bias\"]);\n    auto* _res = attn_residual->getOutput(0);\n    axes = 1U << static_cast<uint32_t>(_res->getDimensions().nbDims - 1);\n    auto* post_lnorm = addLinearNorm(net, *_res, *ln_scale->getOutput(0), *ln_bias->getOutput(0), axes);\n\n    /** 6. intermediate (feed-forward) layer and activation */\n    auto intermediate_name = name + \".intermediate.dense\";\n    std::cout << \"Building: \" << intermediate_name << \"\\n\";\n    auto* iw = net->addConstant(Dims3{1, 3072, 768}, w[intermediate_name + \".weight\"]);\n    auto* ib = net->addConstant(Dims3{1, 1, 3072}, w[intermediate_name + \".bias\"]);\n    ib->setName((intermediate_name + \".bias\").c_str());\n    auto* inter0 = net->addMatrixMultiply(*post_lnorm->getOutput(0), M::kNONE, *iw->getOutput(0), M::kTRANSPOSE);\n    auto* inter1 = net->addElementWise(*inter0->getOutput(0), *ib->getOutput(0), E::kSUM);\n    auto* inter_act = addGeLU(net, *inter1->getOutput(0));\n\n    /** 7. output projection */\n    auto output_name = name + \".output.dense\";\n    std::cout << \"Building: \" << output_name << \"\\n\";\n    auto* ow = net->addConstant(Dims3{1, 768, 3072}, w[output_name + \".weight\"]);\n    auto* ob = net->addConstant(Dims3{1, 1, 768}, w[output_name + \".bias\"]);\n    ob->setName((output_name + \".bias\").c_str());\n    auto* out0 = net->addMatrixMultiply(*inter_act->getOutput(0), M::kNONE, *ow->getOutput(0), M::kTRANSPOSE);\n    auto* out1 = net->addElementWise(*out0->getOutput(0), *ob->getOutput(0), E::kSUM);\n\n    /** 8. residual */\n    auto* output_residual = net->addElementWise(*out1->getOutput(0), *attn_residual->getOutput(0), E::kSUM);\n    output_residual->setName((name + \".output_residual\").c_str());\n    return output_residual->getOutput(0);\n}\n\n// Creat the engine using only the API without any parser.\nauto createEngine(int64_t N, IRuntime* runtime, IBuilder* builder, IBuilderConfig* config,\n                  DataType dt) -> ICudaEngine* {\n    WeightMap w = loadWeights(WTS_PATH);\n    if (dt == DataType::kHALF) {\n        convertWeightMapToHalf(w);\n    }\n\n#if TRT_VERSION >= 10000\n    auto* net = builder->createNetworkV2(1U << static_cast<uint32_t>(NDCF::kSTRONGLY_TYPED));\n#else\n    auto* net = builder->createNetworkV2(1U << static_cast<int>(NDCF::kEXPLICIT_BATCH));\n#endif\n\n    // 1. patch embedding\n    ITensor* data = net->addInput(NAMES[0], dt, Dims4{-1, 3, INPUT_H, INPUT_W});\n    std::string name = \"vit.embeddings.patch_embeddings.projection.\";\n    auto* embed = net->addConvolutionNd(*data, 768, DimsHW{16, 16}, w[name + \"weight\"], w[name + \"bias\"]);\n    embed->setName(\"patch embedding\");\n    embed->setStrideNd(DimsHW{16, 16});\n    auto* s = net->addShuffle(*embed->getOutput(0));\n    s->setReshapeDimensions(Dims3{0, 768, 14LL * 14});\n    s->setSecondTranspose({0, 2, 1});\n\n    // 2. add cls token and position embedding\n    auto* cls_token = net->addConstant(Dims3{1, 1, 768}, w[\"vit.embeddings.cls_token\"]);\n    auto* pos_embed = net->addConstant(Dims3{1, 197, 768}, w[\"vit.embeddings.position_embeddings\"]);\n    const std::array<ITensor*, 2> _cat = {cls_token->getOutput(0), s->getOutput(0)};\n    auto* cat = net->addConcatenation(_cat.data(), 2);\n    cat->setAxis(1);\n    cat->setName(\"cat_clstoken_embed\");\n    auto* pos_added = net->addElementWise(*cat->getOutput(0), *pos_embed->getOutput(0), ElementWiseOperation::kSUM);\n    pos_added->setName(\"position_embed\");\n\n    // 3. transformer encoder layers\n    ITensor* input = pos_added->getOutput(0);\n    for (auto i = 0u; i < 12; i++) {\n        auto* vit = ViTLayer(net, w, *input, {.index = i, .head_num = 12, .lnorm_eps = 1e-12f});\n        input = vit;\n    }\n\n    // 4. layer norm after transformer encoder\n    auto* ln_scale = net->addConstant(Dims3{1, 1, 768}, w[\"vit.layernorm.weight\"]);\n    auto* ln_bias = net->addConstant(Dims3{1, 1, 768}, w[\"vit.layernorm.bias\"]);\n    uint32_t axes = 1U << static_cast<uint32_t>(input->getDimensions().nbDims - 1);\n    auto* post_lnorm = addLinearNorm(net, *input, *ln_scale->getOutput(0), *ln_bias->getOutput(0), axes);\n    // 6. classifier head\n    auto* slice = net->addSlice(*post_lnorm->getOutput(0), Dims3{0, 0, 0}, Dims3{N, 1, 768}, Dims3{1, 1, 1});\n    auto* shuffle = net->addShuffle(*slice->getOutput(0));\n    shuffle->setReshapeDimensions(Dims2{N, 768});\n    auto* cls_w = net->addConstant(DimsHW{1000, 768}, w[\"classifier.weight\"]);\n    auto* cls_b = net->addConstant(DimsHW{1, 1000}, w[\"classifier.bias\"]);\n    auto* cls_0 = net->addMatrixMultiply(*shuffle->getOutput(0), M::kNONE, *cls_w->getOutput(0), M::kTRANSPOSE);\n    auto* cls_1 = net->addElementWise(*cls_b->getOutput(0), *cls_0->getOutput(0), E::kSUM);\n    net->markOutput(*cls_1->getOutput(0));\n\n    Dims4 _min{1, 3, INPUT_H, INPUT_W}, _opt{N, 3, INPUT_H, INPUT_W}, _max{2 * N, 3, INPUT_H, INPUT_W};\n#if TRT_VERSION >= 8000\n    config->setMemoryPoolLimit(MemoryPoolType::kWORKSPACE, WORKSPACE_SIZE);\n    config->setHardwareCompatibilityLevel(HardwareCompatibilityLevel::kAMPERE_PLUS);\n    auto* profile = builder->createOptimizationProfile();\n    profile->setDimensions(NAMES[0], OptProfileSelector::kMIN, _min);\n    profile->setDimensions(NAMES[0], OptProfileSelector::kOPT, _opt);\n    profile->setDimensions(NAMES[0], OptProfileSelector::kMAX, _max);\n    config->addOptimizationProfile(profile);\n    IHostMemory* mem = builder->buildSerializedNetwork(*net, *config);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(mem->data(), mem->size());\n    delete net;\n#else\n    builder->setMaxBatchSize(N);\n    config->setMaxWorkspaceSize(WORKSPACE_SIZE);\n    ICudaEngine* engine = builder->buildEngineWithConfig(*net, *config);\n    net->destroy();\n#endif\n\n    std::cout << \"build finished\\n\";\n    // Release host memory\n    for (auto& mem : w) {\n        if (mem.second.values == nullptr) {\n            continue;\n        }\n        if (mem.second.type == DataType::kHALF) {\n            delete[] reinterpret_cast<const half*>(mem.second.values);\n        } else {\n            // loadWeights() allocates with new uint32_t[]\n            delete[] reinterpret_cast<const uint32_t*>(mem.second.values);\n        }\n    }\n\n    return engine;\n}\n\nstd::vector<std::vector<float>> doInference(IExecutionContext& context, __half* input, std::size_t batchSize) {\n    const ICudaEngine& engine = context.getEngine();\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n    std::vector<void*> buffers;\n#if TRT_VERSION >= 10000\n    auto allocator = CudaOutputAllocator::Create(stream);\n#endif\n\n#if TRT_VERSION >= 8000\n    const int32_t nIO = engine.getNbIOTensors();\n#else\n    const int32_t nIO = engine.getNbBindings();\n#endif\n\n    buffers.resize(nIO, nullptr);\n    for (auto i = 0; i < nIO; ++i) {\n\n#if TRT_VERSION >= 8000\n        // TensorRT 8+ use name based SDK\n        auto* tensor_name = engine.getIOTensorName(i);\n        const auto dtype = engine.getTensorDataType(tensor_name);\n        std::size_t size = batchSize * SIZES[i] * bytesPerElement(dtype);\n#if TRT_VERSION >= 10000\n        // TensorRT 10+ use outuput allocator\n        if (i == 0) {\n            CHECK(cudaMalloc(&buffers[i], size));\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n            context.setTensorAddress(tensor_name, buffers[i]);\n        } else {\n            context.setOutputAllocator(tensor_name, allocator.get());\n        }\n#else\n        if (i != 0) {\n            CHECK(cudaMalloc(&buffers[i], size));\n        } else {\n            CHECK(cudaMalloc(&buffers[i], size));\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n        context.setTensorAddress(tensor_name, buffers[i]);\n#endif\n#else\n        std::size_t size = batchSize * SIZES[i] * sizeof(float);\n        const int32_t idx = engine.getBindingIndex(NAMES[i]);\n        assert(idx == i);\n        CHECK(cudaMalloc(&buffers[i], size));\n        if (i == 0) {\n            CHECK(cudaMemcpyAsync(buffers[i], input, size, cudaMemcpyHostToDevice, stream));\n        }\n#endif\n    }\n\n#if TRT_VERSION >= 8000\n    assert(context.enqueueV3(stream));\n#else\n    assert(context.enqueueV2(buffers.data(), stream, nullptr));\n#endif\n\n    std::vector<std::vector<float>> prob;\n    for (int i = 1; i < nIO; ++i) {\n#if TRT_VERSION >= 10000\n        auto* tensor_name = engine.getIOTensorName(i);\n        const auto dtype = engine.getTensorDataType(tensor_name);\n        std::size_t size = batchSize * SIZES[i] * bytesPerElement(dtype);\n        void* out_ptr = allocator->getBuffer(tensor_name);\n        // D2H data transfer\n        if (dtype == DataType::kHALF) {\n            std::vector<__half> tmp_h(batchSize * SIZES[i]);\n            CHECK(cudaMemcpyAsync(tmp_h.data(), out_ptr, size, cudaMemcpyDeviceToHost, stream));\n            CHECK(cudaStreamSynchronize(stream));\n            std::vector<float> tmp(batchSize * SIZES[i]);\n            for (std::size_t j = 0; j < tmp.size(); ++j) {\n                tmp[j] = __half2float(tmp_h[j]);\n            }\n            prob.emplace_back(std::move(tmp));\n        } else {\n            std::vector<float> tmp(batchSize * SIZES[i], std::nanf(\"\"));\n            CHECK(cudaMemcpyAsync(tmp.data(), out_ptr, size, cudaMemcpyDeviceToHost, stream));\n            prob.emplace_back(std::move(tmp));\n        }\n#else\n        std::vector<float> tmp(batchSize * SIZES[i], std::nanf(\"\"));\n        std::size_t size = batchSize * SIZES[i] * sizeof(float);\n        CHECK(cudaMemcpyAsync(tmp.data(), buffers[i], size, cudaMemcpyDeviceToHost, stream));\n        prob.emplace_back(std::move(tmp));\n#endif\n    }\n    CHECK(cudaStreamSynchronize(stream));\n\n    for (auto& buffer : buffers) {\n        if (buffer != nullptr) {\n            CHECK(cudaFree(buffer));\n        }\n    }\n#if TRT_VERSION >= 10000\n    allocator.reset();\n#endif\n    CHECK(cudaStreamDestroy(stream));\n    return prob;\n}\n\nvoid APIToModel(int32_t N, IRuntime* runtime, IHostMemory** modelStream) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    ICudaEngine* engine = createEngine(N, runtime, builder, config, DataType::kHALF);\n    assert(engine != nullptr);\n\n    (*modelStream) = engine->serialize();\n\n#if TRT_VERSION >= 8000\n    delete engine;\n    delete config;\n    delete builder;\n#else\n    engine->destroy();\n    config->destroy();\n    builder->destroy();\n#endif\n}\n\nauto main(int argc, char** argv) -> int {\n    std::cout << \"TensorRT version: \" << TRT_VERSION << \"\\n\";\n    if (argc != 2) {\n        std::cerr << \"arguments not right!\\n\";\n        std::cerr << \"./vit -s  // serialize model to plan file\\n\";\n        std::cerr << \"./vit -d  // deserialize plan file and run inference\\n\";\n\n        return 1;\n    }\n#ifndef NDEBUG\n    gLogger.setReportableSeverity(nvinfer1::ILogger::Severity::kVERBOSE);\n#endif\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    char* trtModelStream{nullptr};\n    std::streamsize size{0};\n\n    if (std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(N, runtime, &modelStream);\n        assert(modelStream != nullptr);\n        std::ofstream p(ENGINE_PATH, std::ios::binary | std::ios::trunc);\n        if (!p) {\n            std::cerr << \"could not open plan output file\\n\";\n            return -1;\n        }\n        if (modelStream->size() > static_cast<std::size_t>(std::numeric_limits<std::streamsize>::max())) {\n            std::cerr << \"this model is too large to serialize\\n\";\n            return -1;\n        }\n        const auto* data_ptr = reinterpret_cast<const char*>(modelStream->data());\n        auto data_size = static_cast<std::streamsize>(modelStream->size());\n        p.write(data_ptr, data_size);\n#if TRT_VERSION >= 8000\n        delete modelStream;\n#else\n        modelStream->destroy();\n#endif\n        return 0;\n    } else if (std::string(argv[1]) == \"-d\") {\n        std::ifstream file(ENGINE_PATH, std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        } else {\n            std::cerr << \"read engine file error!\\n\";\n            return -1;\n        }\n\n#if TRT_VERSION >= 8000\n        ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n#else\n        ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);\n#endif\n        assert(engine != nullptr);\n        auto* context = engine->createExecutionContext();\n        assert(context != nullptr);\n\n        // VIT use default BGR order\n        auto img = cv::imread(\"../assets/cats.jpg\", cv::IMREAD_COLOR);\n        auto input = preprocess_img(img, false, mean, stdv, N, INPUT_H, INPUT_W);\n\n        Profiler profiler(\"VisionTransformerProfiler\");\n\n        // Warmup: run a few iterations without profiling.\n        for (int i = 0; i < 5; ++i) {\n            (void)doInference(*context, input.data(), N);\n        }\n\n        // Profiled runs\n        context->setProfiler(&profiler);\n        for (int i = 0; i < 20; ++i) {\n            auto start = std::chrono::system_clock::now();\n            auto prob = doInference(*context, input.data(), N);\n            auto end = std::chrono::system_clock::now();\n            auto period = std::chrono::duration_cast<std::chrono::microseconds>(end - start);\n            std::cout << period.count() << \"us\\n\";\n\n            for (const auto& vector : prob) {\n                int idx = 0;\n                for (auto v : vector) {\n                    std::cout << std::setprecision(4) << v << \", \" << std::flush;\n                    if (++idx > 20) {\n                        std::cout << \"\\n====\\n\";\n                        break;\n                    }\n                }\n            }\n\n            if (i == 19) {\n                std::cout << \"prediction result: \\n\";\n                auto labels = loadImagenetLabelMap(LABELS_PATH);\n                int _top = 0;\n                for (auto& [idx, logits] : topk(prob[0], 3)) {\n                    std::cout << \"Top: \" << _top++ << \" idx: \" << idx << \", logits: \" << logits\n                              << \", label: \" << labels[idx] << \"\\n\";\n                }\n                std::cout << profiler << \"\\n\";\n            }\n        }\n        return 0;\n    }\n    return 0;\n}\n"
  },
  {
    "path": "yolo11/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\n\nproject(yolov11)\n\nadd_definitions(-std=c++11)\nadd_definitions(-DAPI_EXPORTS)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nset(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)\nenable_language(CUDA)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\ninclude_directories(${PROJECT_SOURCE_DIR}/plugin)\n\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\nif(CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n  message(\"embed_platform on\")\n  include_directories(/usr/local/cuda/targets/aarch64-linux/include)\n  link_directories(/usr/local/cuda/targets/aarch64-linux/lib)\nelse()\n  message(\"embed_platform off\")\n\n  # cuda\n  include_directories(/usr/local/cuda/include)\n  link_directories(/usr/local/cuda/lib64)\n\n  # tensorrt\n  include_directories(/workspace/shared/TensorRT-8.6.1.6/include)\n  link_directories(/workspace/shared/TensorRT-8.6.1.6/lib)\nendif()\n\nadd_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/plugin/yololayer.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nfile(GLOB_RECURSE SRCS ${PROJECT_SOURCE_DIR}/src/*.cpp ${PROJECT_SOURCE_DIR}/src/*.cu)\n\nadd_executable(yolo11_det ${PROJECT_SOURCE_DIR}/yolo11_det.cpp ${SRCS})\ntarget_link_libraries(yolo11_det nvinfer)\ntarget_link_libraries(yolo11_det cudart)\ntarget_link_libraries(yolo11_det myplugins)\ntarget_link_libraries(yolo11_det ${OpenCV_LIBS})\n\nadd_executable(yolo11_cls ${PROJECT_SOURCE_DIR}/yolo11_cls.cpp ${SRCS})\ntarget_link_libraries(yolo11_cls nvinfer)\ntarget_link_libraries(yolo11_cls cudart)\ntarget_link_libraries(yolo11_cls myplugins)\ntarget_link_libraries(yolo11_cls ${OpenCV_LIBS})\n\nadd_executable(yolo11_seg ${PROJECT_SOURCE_DIR}/yolo11_seg.cpp ${SRCS})\ntarget_link_libraries(yolo11_seg nvinfer)\ntarget_link_libraries(yolo11_seg cudart)\ntarget_link_libraries(yolo11_seg myplugins)\ntarget_link_libraries(yolo11_seg ${OpenCV_LIBS})\n\nadd_executable(yolo11_pose ${PROJECT_SOURCE_DIR}/yolo11_pose.cpp ${SRCS})\ntarget_link_libraries(yolo11_pose nvinfer)\ntarget_link_libraries(yolo11_pose cudart)\ntarget_link_libraries(yolo11_pose myplugins)\ntarget_link_libraries(yolo11_pose ${OpenCV_LIBS})\n\nadd_executable(yolo11_obb ${PROJECT_SOURCE_DIR}/yolo11_obb.cpp ${SRCS})\ntarget_link_libraries(yolo11_obb nvinfer)\ntarget_link_libraries(yolo11_obb cudart)\ntarget_link_libraries(yolo11_obb myplugins)\ntarget_link_libraries(yolo11_obb ${OpenCV_LIBS})\n"
  },
  {
    "path": "yolo11/gen_wts.py",
    "content": "import sys  # noqa: F401\nimport argparse\nimport os\nimport struct\nimport torch\n\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description='Convert .pt file to .wts')\n    parser.add_argument('-w', '--weights', required=True,\n                        help='Input weights (.pt) file path (required)')\n    parser.add_argument(\n        '-o', '--output', help='Output (.wts) file path (optional)')\n    parser.add_argument(\n        '-t', '--type', type=str, default='detect', choices=['detect', 'cls', 'seg', 'pose', 'obb'],\n        help='determines the model is detection/classification')\n    args = parser.parse_args()\n    if not os.path.isfile(args.weights):\n        raise SystemExit('Invalid input file')\n    if not args.output:\n        args.output = os.path.splitext(args.weights)[0] + '.wts'\n    elif os.path.isdir(args.output):\n        args.output = os.path.join(\n            args.output,\n            os.path.splitext(os.path.basename(args.weights))[0] + '.wts')\n    return args.weights, args.output, args.type\n\n\npt_file, wts_file, m_type = parse_args()\n\nprint(f'Generating .wts for {m_type} model')\n\n# Load model\nprint(f'Loading {pt_file}')\n\n# Initialize\ndevice = 'cpu'\n\n# Load model\nmodel = torch.load(pt_file, map_location=device, weights_only=False)  # Load FP32 weights\nmodel = model['ema' if model.get('ema') else 'model'].float()\n\nif m_type in ['detect', 'seg', 'pose', 'obb']:\n    anchor_grid = model.model[-1].anchors * model.model[-1].stride[..., None, None]\n\n    delattr(model.model[-1], 'anchors')\n\nmodel.to(device).eval()\n\nwith open(wts_file, 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\n"
  },
  {
    "path": "yolo11/include/block.h",
    "content": "#pragma once\n\n#include <map>\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file);\n\nnvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                      std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                      std::string lname, float eps);\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, std::vector<int> k, int s, std::string lname);\n\nnvinfer1::IElementWiseLayer* C2F(nvinfer1::INetworkDefinition* network,\n                                 std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                 int c2, int n, bool shortcut, float e, std::string lname);\n\nnvinfer1::IElementWiseLayer* C2(nvinfer1::INetworkDefinition* network,\n                                std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c1,\n                                int c2, int n, bool shortcut, float e, std::string lname);\n\nnvinfer1::IElementWiseLayer* SPPF(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int k, std::string lname);\n\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int grid, int k, int s, int p, std::string lname);\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network,\n                                       std::vector<nvinfer1::IConcatenationLayer*> dets, const int* px_arry,\n                                       int px_arry_num, bool is_segmentation, bool is_pose, bool is_obb);\n\nnvinfer1::IElementWiseLayer* C3K2(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int n, bool c3k, bool shortcut, float e, std::string lname);\n\nnvinfer1::ILayer* C2PSA(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>& weightMap,\n                        nvinfer1::ITensor& input, int c1, int c2, int n, float e, std::string lname);\n\nnvinfer1::ILayer* DWConv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int ch, std::vector<int> k, int s, std::string lname);\n"
  },
  {
    "path": "yolo11/include/calibrator.h",
    "content": "#ifndef ENTROPY_CALIBRATOR_H\n#define ENTROPY_CALIBRATOR_H\n\n#include <NvInfer.h>\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\n//! \\class Int8EntropyCalibrator2\n//!\n//! \\brief Implements Entropy calibrator 2.\n//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.\n//!\nclass Int8EntropyCalibrator2 : public nvinfer1::IInt8EntropyCalibrator2 {\n   public:\n    Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name,\n                           const char* input_blob_name, bool read_cache = true);\n    virtual ~Int8EntropyCalibrator2();\n    int getBatchSize() const TRT_NOEXCEPT override;\n    bool getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT override;\n    const void* readCalibrationCache(size_t& length) TRT_NOEXCEPT override;\n    void writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT override;\n\n   private:\n    int batchsize_;\n    int input_w_;\n    int input_h_;\n    int img_idx_;\n    std::string img_dir_;\n    std::vector<std::string> img_files_;\n    size_t input_count_;\n    std::string calib_table_name_;\n    const char* input_blob_name_;\n    bool read_cache_;\n    void* device_input_;\n    std::vector<char> calib_cache_;\n};\n\n#endif  // ENTROPY_CALIBRATOR_H\n"
  },
  {
    "path": "yolo11/include/config.h",
    "content": "#define USE_FP16\n// #define USE_FP32\n// #define USE_INT8\n\nconst static char* kInputTensorName = \"images\";\nconst static char* kOutputTensorName = \"output\";\nconst static char* kProtoTensorName = \"proto\";\nconst static int kNumClass = 80;\nconst static int kPoseNumClass = 1;\nconst static int kNumberOfPoints = 17;  // number of keypoints total\n// obb model's number of classes\nconstexpr static int kObbNumClass = 15;\nconst static int kObbNe = 1;  // number of extra parameters\nconst static int kBatchSize = 1;\nconst static int kGpuId = 0;\nconst static int kInputH = 640;\nconst static int kInputW = 640;\nconst static int kObbInputH = 1024;\nconst static int kObbInputW = 1024;\nconst static float kNmsThresh = 0.45f;\nconst static float kConfThresh = 0.5f;\nconst static float kConfThreshKeypoints = 0.5f;  // keypoints confidence\nconst static int kMaxInputImageSize = 3000 * 3000;\nconst static int kMaxNumOutputBbox = 1000;\n//Quantization input image folder path\nconst static char* kInputQuantizationFolder = \"./coco_calib\";\n\n// Classfication model's number of classes\nconstexpr static int kClsNumClass = 1000;\n// Classfication model's input shape\nconstexpr static int kClsInputH = 224;\nconstexpr static int kClsInputW = 224;\n"
  },
  {
    "path": "yolo11/include/cuda_utils.h",
    "content": "#ifndef TRTX_CUDA_UTILS_H_\n#define TRTX_CUDA_UTILS_H_\n\n#include <cuda_runtime_api.h>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n#endif  // CUDA_CHECK\n\n#endif  // TRTX_CUDA_UTILS_H_\n"
  },
  {
    "path": "yolo11/include/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"NvInferRuntimeCommon.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "yolo11/include/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#include \"NvInfer.h\"\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "yolo11/include/model.h",
    "content": "#pragma once\n\n#include <assert.h>\n#include <string>\n#include \"NvInfer.h\"\n\nnvinfer1::IHostMemory* buildEngineYolo11Cls(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            std::string& type, int max_channels);\n\nnvinfer1::IHostMemory* buildEngineYolo11Det(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type);\n\nnvinfer1::IHostMemory* buildEngineYolo11Seg(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type);\n\nnvinfer1::IHostMemory* buildEngineYolo11Pose(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                             nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                             int& max_channels, std::string& type);\n\nnvinfer1::IHostMemory* buildEngineYolo11Obb(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type);\n"
  },
  {
    "path": "yolo11/include/postprocess.h",
    "content": "#pragma once\n\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\n// Preprocessing functions\ncv::Rect get_rect(cv::Mat& img, float bbox[4]);\n\n// Processing functions\nvoid batch_process(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                   int bbox_element, const std::vector<cv::Mat>& img_batch);\n\nvoid batch_process_obb(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                       int bbox_element, const std::vector<cv::Mat>& img_batch);\n\nvoid process_decode_ptr_host(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element, cv::Mat& img,\n                             int count);\n\nvoid process_decode_ptr_host_obb(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element,\n                                 cv::Mat& img, int count);\n\n// NMS functions\nvoid nms(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh = 0.5);\n\nvoid batch_nms(std::vector<std::vector<Detection>>& batch_res, float* output, int batch_size, int output_size,\n               float conf_thresh, float nms_thresh = 0.5);\n\nvoid nms_obb(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh = 0.5);\n\nvoid batch_nms_obb(std::vector<std::vector<Detection>>& batch_res, float* output, int batch_size, int output_size,\n                   float conf_thresh, float nms_thresh = 0.5);\n\n// CUDA-related functions\nvoid cuda_decode(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                 cudaStream_t stream);\n\nvoid cuda_nms(float* parray, float nms_threshold, int max_objects, cudaStream_t stream);\n\nvoid cuda_decode_obb(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                     cudaStream_t stream);\n\nvoid cuda_nms_obb(float* parray, float nms_threshold, int max_objects, cudaStream_t stream);\n\n// Drawing functions\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nvoid draw_bbox_obb(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nvoid draw_bbox_keypoints_line(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks,\n                    std::unordered_map<int, std::string>& labels_map);\n"
  },
  {
    "path": "yolo11/include/preprocess.h",
    "content": "#pragma once\n\n#include <map>\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\nvoid cuda_preprocess_init(int max_image_size);\n\nvoid cuda_preprocess_destroy();\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream);\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream);\n"
  },
  {
    "path": "yolo11/include/types.h",
    "content": "#pragma once\n#include \"config.h\"\n\nstruct alignas(float) Detection {\n    //center_x center_y w h\n    float bbox[4];\n    float conf;  // bbox_conf * cls_conf\n    float class_id;\n    float mask[32];\n    float keypoints[kNumberOfPoints * 3];  // 17*3 keypoints\n    float angle;                           // obb angle\n};\n\nstruct AffineMatrix {\n    float value[6];\n};\n\nconst int bbox_element =\n        sizeof(AffineMatrix) / sizeof(float) + 1;  // left, top, right, bottom, confidence, class, keepflag\n"
  },
  {
    "path": "yolo11/include/utils.h",
    "content": "#pragma once\n#include <dirent.h>\n#include <fstream>\n#include <opencv2/opencv.hpp>\n\nstatic inline cv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h) {\n    int w, h, x, y;\n    float r_w = input_w / (img.cols * 1.0);\n    float r_h = input_h / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = input_w;\n        h = r_w * img.rows;\n        x = 0;\n        y = (input_h - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = input_h;\n        x = (input_w - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n    cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\nstatic inline int read_files_in_dir(const char* p_dir_name, std::vector<std::string>& file_names) {\n    DIR* p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 && strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            //            std::cout << \"Found file: \" << cur_file_name << std::endl;\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\n// Function to trim leading and trailing whitespace from a string\nstatic inline std::string trim_leading_whitespace(const std::string& str) {\n    size_t first = str.find_first_not_of(' ');\n    if (std::string::npos == first) {\n        return str;\n    }\n    size_t last = str.find_last_not_of(' ');\n    return str.substr(first, (last - first + 1));\n}\n\n// Src: https://stackoverflow.com/questions/16605967\nstatic inline std::string to_string_with_precision(const float a_value, const int n = 2) {\n    std::ostringstream out;\n    out.precision(n);\n    out << std::fixed << a_value;\n    return out.str();\n}\n\nstatic inline int read_labels(const std::string labels_filename, std::unordered_map<int, std::string>& labels_map) {\n    std::ifstream file(labels_filename);\n    // Read each line of the file\n    std::string line;\n    int index = 0;\n    while (std::getline(file, line)) {\n        // Strip the line of any leading or trailing whitespace\n        line = trim_leading_whitespace(line);\n\n        // Add the stripped line to the labels_map, using the loop index as the key\n        labels_map[index] = line;\n        index++;\n    }\n    // Close the file\n    file.close();\n\n    return 0;\n}\n"
  },
  {
    "path": "yolo11/plugin/yololayer.cu",
    "content": "#include <assert.h>\n#include <math.h>\n#include <iostream>\n#include <vector>\n#include \"cuda_utils.h\"\n#include \"types.h\"\n#include \"yololayer.h\"\n\nnamespace Tn {\ntemplate <typename T>\nvoid write(char*& buffer, const T& val) {\n    *reinterpret_cast<T*>(buffer) = val;\n    buffer += sizeof(T);\n}\n\ntemplate <typename T>\nvoid read(const char*& buffer, T& val) {\n    val = *reinterpret_cast<const T*>(buffer);\n    buffer += sizeof(T);\n}\n}  // namespace Tn\n\n__device__ float sigmoid(float x) {\n    return 1.0f / (1.0f + exp(-x));\n}\n\nnamespace nvinfer1 {\nYoloLayerPlugin::YoloLayerPlugin(int classCount, int numberofpoints, float confthreshkeypoints, int netWidth,\n                                 int netHeight, int maxOut, bool is_segmentation, bool is_pose, bool is_obb,\n                                 const int* strides, int stridesLength) {\n\n    mClassCount = classCount;\n    mNumberofpoints = numberofpoints;\n    mConfthreshkeypoints = confthreshkeypoints;\n    mYoloV8NetWidth = netWidth;\n    mYoloV8netHeight = netHeight;\n    mMaxOutObject = maxOut;\n    mStridesLength = stridesLength;\n    mStrides = new int[stridesLength];\n    memcpy(mStrides, strides, stridesLength * sizeof(int));\n    is_segmentation_ = is_segmentation;\n    is_pose_ = is_pose;\n    is_obb_ = is_obb;\n}\n\nYoloLayerPlugin::~YoloLayerPlugin() {\n    if (mStrides != nullptr) {\n        delete[] mStrides;\n        mStrides = nullptr;\n    }\n}\n\nYoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length) {\n    using namespace Tn;\n    const char *d = reinterpret_cast<const char*>(data), *a = d;\n    read(d, mClassCount);\n    read(d, mNumberofpoints);\n    read(d, mConfthreshkeypoints);\n    read(d, mThreadCount);\n    read(d, mYoloV8NetWidth);\n    read(d, mYoloV8netHeight);\n    read(d, mMaxOutObject);\n    read(d, mStridesLength);\n    mStrides = new int[mStridesLength];\n    for (int i = 0; i < mStridesLength; ++i) {\n        read(d, mStrides[i]);\n    }\n    read(d, is_segmentation_);\n    read(d, is_pose_);\n    read(d, is_obb_);\n\n    assert(d == a + length);\n}\n\nvoid YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT {\n\n    using namespace Tn;\n    char *d = static_cast<char*>(buffer), *a = d;\n    write(d, mClassCount);\n    write(d, mNumberofpoints);\n    write(d, mConfthreshkeypoints);\n    write(d, mThreadCount);\n    write(d, mYoloV8NetWidth);\n    write(d, mYoloV8netHeight);\n    write(d, mMaxOutObject);\n    write(d, mStridesLength);\n    for (int i = 0; i < mStridesLength; ++i) {\n        write(d, mStrides[i]);\n    }\n    write(d, is_segmentation_);\n    write(d, is_pose_);\n    write(d, is_obb_);\n\n    assert(d == a + getSerializationSize());\n}\n\nsize_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT {\n    return sizeof(mClassCount) + sizeof(mNumberofpoints) + sizeof(mConfthreshkeypoints) + sizeof(mThreadCount) +\n           sizeof(mYoloV8netHeight) + sizeof(mYoloV8NetWidth) + sizeof(mMaxOutObject) + sizeof(mStridesLength) +\n           sizeof(int) * mStridesLength + sizeof(is_segmentation_) + sizeof(is_pose_) + sizeof(is_obb_);\n}\n\nint YoloLayerPlugin::initialize() TRT_NOEXCEPT {\n    return 0;\n}\n\nnvinfer1::Dims YoloLayerPlugin::getOutputDimensions(int index, const nvinfer1::Dims* inputs,\n                                                    int nbInputDims) TRT_NOEXCEPT {\n    int total_size = mMaxOutObject * sizeof(Detection) / sizeof(float);\n    return nvinfer1::Dims3(total_size + 1, 1, 1);\n}\n\nvoid YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT {\n    mPluginNamespace = pluginNamespace;\n}\n\nconst char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT {\n    return mPluginNamespace;\n}\n\nnvinfer1::DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes,\n                                                      int nbInputs) const TRT_NOEXCEPT {\n    return nvinfer1::DataType::kFLOAT;\n}\n\nbool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                                   int nbInputs) const TRT_NOEXCEPT {\n\n    return false;\n}\n\nbool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT {\n\n    return false;\n}\n\nvoid YoloLayerPlugin::configurePlugin(nvinfer1::PluginTensorDesc const* in, int nbInput,\n                                      nvinfer1::PluginTensorDesc const* out, int nbOutput) TRT_NOEXCEPT{};\n\nvoid YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                                      IGpuAllocator* gpuAllocator) TRT_NOEXCEPT{};\n\nvoid YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\nconst char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT {\n\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nvoid YoloLayerPlugin::destroy() TRT_NOEXCEPT {\n    delete this;\n}\n\nnvinfer1::IPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT {\n\n    YoloLayerPlugin* p =\n            new YoloLayerPlugin(mClassCount, mNumberofpoints, mConfthreshkeypoints, mYoloV8NetWidth, mYoloV8netHeight,\n                                mMaxOutObject, is_segmentation_, is_pose_, is_obb_, mStrides, mStridesLength);\n    p->setPluginNamespace(mPluginNamespace);\n    return p;\n}\n\nint YoloLayerPlugin::enqueue(int batchSize, const void* TRT_CONST_ENQUEUE* inputs, void* const* outputs,\n                             void* workspace, cudaStream_t stream) TRT_NOEXCEPT {\n    forwardGpu((const float* const*)inputs, (float*)outputs[0], stream, mYoloV8netHeight, mYoloV8NetWidth, batchSize);\n    return 0;\n}\n\n__device__ float Logist(float data) {\n    return 1.0f / (1.0f + expf(-data));\n};\n\n__global__ void CalDetection(const float* input, float* output, int numElements, int maxoutobject, const int grid_h,\n                             int grid_w, const int stride, int classes, int nk, float confkeypoints, int outputElem,\n                             bool is_segmentation, bool is_pose, bool is_obb) {\n    int idx = threadIdx.x + blockDim.x * blockIdx.x;\n    if (idx >= numElements)\n        return;\n\n    const int N_kpts = nk;\n    int total_grid = grid_h * grid_w;\n    int info_len = 4 + classes + (is_segmentation ? 32 : 0) + (is_pose ? N_kpts * 3 : 0) + (is_obb ? 1 : 0);\n    int batchIdx = idx / total_grid;\n    int elemIdx = idx % total_grid;\n    const float* curInput = input + batchIdx * total_grid * info_len;\n    int outputIdx = batchIdx * outputElem;\n\n    int class_id = 0;\n    float max_cls_prob = 0.0;\n    for (int i = 4; i < 4 + classes; i++) {\n        float p = Logist(curInput[elemIdx + i * total_grid]);\n        if (p > max_cls_prob) {\n            max_cls_prob = p;\n            class_id = i - 4;\n        }\n    }\n\n    if (max_cls_prob < 0.1)\n        return;\n\n    int count = (int)atomicAdd(output + outputIdx, 1);\n    if (count >= maxoutobject)\n        return;\n    char* data = (char*)(output + outputIdx) + sizeof(float) + count * sizeof(Detection);\n    Detection* det = (Detection*)(data);\n\n    int row = elemIdx / grid_w;\n    int col = elemIdx % grid_w;\n\n    det->conf = max_cls_prob;\n    det->class_id = class_id;\n    det->bbox[0] = (col + 0.5f - curInput[elemIdx + 0 * total_grid]) * stride;\n    det->bbox[1] = (row + 0.5f - curInput[elemIdx + 1 * total_grid]) * stride;\n    det->bbox[2] = (col + 0.5f + curInput[elemIdx + 2 * total_grid]) * stride;\n    det->bbox[3] = (row + 0.5f + curInput[elemIdx + 3 * total_grid]) * stride;\n\n    if (is_segmentation) {\n        for (int k = 0; k < 32; ++k) {\n            det->mask[k] =\n                    curInput[elemIdx + (4 + classes + (is_pose ? N_kpts * 3 : 0) + (is_obb ? 1 : 0) + k) * total_grid];\n        }\n    }\n\n    if (is_pose) {\n        for (int kpt = 0; kpt < N_kpts; kpt++) {\n            int kpt_x_idx = (4 + classes + (is_segmentation ? 32 : 0) + (is_obb ? 1 : 0) + kpt * 3) * total_grid;\n            int kpt_y_idx = (4 + classes + (is_segmentation ? 32 : 0) + (is_obb ? 1 : 0) + kpt * 3 + 1) * total_grid;\n            int kpt_conf_idx = (4 + classes + (is_segmentation ? 32 : 0) + (is_obb ? 1 : 0) + kpt * 3 + 2) * total_grid;\n\n            float kpt_confidence = sigmoid(curInput[elemIdx + kpt_conf_idx]);\n\n            float kpt_x = (curInput[elemIdx + kpt_x_idx] * 2.0 + col) * stride;\n            float kpt_y = (curInput[elemIdx + kpt_y_idx] * 2.0 + row) * stride;\n\n            bool is_within_bbox =\n                    kpt_x >= det->bbox[0] && kpt_x <= det->bbox[2] && kpt_y >= det->bbox[1] && kpt_y <= det->bbox[3];\n\n            if (kpt_confidence < confkeypoints || !is_within_bbox) {\n                det->keypoints[kpt * 3] = -1;\n                det->keypoints[kpt * 3 + 1] = -1;\n                det->keypoints[kpt * 3 + 2] = -1;\n            } else {\n                det->keypoints[kpt * 3] = kpt_x;\n                det->keypoints[kpt * 3 + 1] = kpt_y;\n                det->keypoints[kpt * 3 + 2] = kpt_confidence;\n            }\n        }\n    }\n\n    if (is_obb) {\n        double pi = CV_PI;\n        auto angle_inx = curInput[elemIdx + (4 + classes + (is_segmentation ? 32 : 0) + (is_pose ? N_kpts * 3 : 0) +\n                                             0) * total_grid];\n        auto angle = (sigmoid(angle_inx) - 0.25f) * pi;\n\n        auto cos1 = cos(angle);\n        auto sin1 = sin(angle);\n        auto xf = (curInput[elemIdx + 2 * total_grid] - curInput[elemIdx + 0 * total_grid]) / 2;\n        auto yf = (curInput[elemIdx + 3 * total_grid] - curInput[elemIdx + 1 * total_grid]) / 2;\n\n        auto x = xf * cos1 - yf * sin1;\n        auto y = xf * sin1 + yf * cos1;\n\n        float cx = (col + 0.5f + x) * stride;\n        float cy = (row + 0.5f + y) * stride;\n\n        float w1 = (curInput[elemIdx + 0 * total_grid] + curInput[elemIdx + 2 * total_grid]) * stride;\n        float h1 = (curInput[elemIdx + 1 * total_grid] + curInput[elemIdx + 3 * total_grid]) * stride;\n        det->bbox[0] = cx;\n        det->bbox[1] = cy;\n        det->bbox[2] = w1;\n        det->bbox[3] = h1;\n        det->angle = angle;\n    }\n}\n\nvoid YoloLayerPlugin::forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV8netHeight,\n                                 int mYoloV8NetWidth, int batchSize) {\n    int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);\n    cudaMemsetAsync(output, 0, sizeof(float), stream);\n    for (int idx = 0; idx < batchSize; ++idx) {\n        CUDA_CHECK(cudaMemsetAsync(output + idx * outputElem, 0, sizeof(float), stream));\n    }\n    int numElem = 0;\n\n    //    const int maxGrids = mStridesLength;\n    //    int grids[maxGrids][2];\n    //    for (int i = 0; i < maxGrids; ++i) {\n    //        grids[i][0] = mYoloV8netHeight / mStrides[i];\n    //        grids[i][1] = mYoloV8NetWidth / mStrides[i];\n    //    }\n\n    int maxGrids = mStridesLength;\n    int flatGridsLen = 2 * maxGrids;\n    int* flatGrids = new int[flatGridsLen];\n\n    for (int i = 0; i < maxGrids; ++i) {\n        flatGrids[2 * i] = mYoloV8netHeight / mStrides[i];\n        flatGrids[2 * i + 1] = mYoloV8NetWidth / mStrides[i];\n    }\n\n    for (unsigned int i = 0; i < maxGrids; i++) {\n        // Access the elements of the original 2D array from the flattened 1D array\n        int grid_h = flatGrids[2 * i];      // Corresponds to the access of grids[i][0]\n        int grid_w = flatGrids[2 * i + 1];  // Corresponds to the access of grids[i][1]\n        int stride = mStrides[i];\n        numElem = grid_h * grid_w * batchSize;  // Calculate the total number of elements\n        if (numElem < mThreadCount)             // Adjust the thread count if needed\n            mThreadCount = numElem;\n\n        // The CUDA kernel call remains unchanged\n        CalDetection<<<(numElem + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream>>>(\n                inputs[i], output, numElem, mMaxOutObject, grid_h, grid_w, stride, mClassCount, mNumberofpoints,\n                mConfthreshkeypoints, outputElem, is_segmentation_, is_pose_, is_obb_);\n    }\n\n    delete[] flatGrids;\n}\n\nPluginFieldCollection YoloPluginCreator::mFC{};\nstd::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\nYoloPluginCreator::YoloPluginCreator() {\n    mPluginAttributes.clear();\n    mFC.nbFields = mPluginAttributes.size();\n    mFC.fields = mPluginAttributes.data();\n}\n\nconst char* YoloPluginCreator::getPluginName() const TRT_NOEXCEPT {\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloPluginCreator::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nconst PluginFieldCollection* YoloPluginCreator::getFieldNames() TRT_NOEXCEPT {\n    return &mFC;\n}\n\nIPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT {\n    assert(fc->nbFields == 1);\n    assert(strcmp(fc->fields[0].name, \"combinedInfo\") == 0);\n    const int* combinedInfo = static_cast<const int*>(fc->fields[0].data);\n    int netinfo_count = 9;\n    int class_count = combinedInfo[0];\n    int numberofpoints = combinedInfo[1];\n    float confthreshkeypoints = combinedInfo[2];\n    int input_w = combinedInfo[3];\n    int input_h = combinedInfo[4];\n    int max_output_object_count = combinedInfo[5];\n    bool is_segmentation = combinedInfo[6];\n    bool is_pose = combinedInfo[7];\n    bool is_obb = combinedInfo[8];\n    const int* px_arry = combinedInfo + netinfo_count;\n    int px_arry_length = fc->fields[0].length - netinfo_count;\n    YoloLayerPlugin* obj =\n            new YoloLayerPlugin(class_count, numberofpoints, confthreshkeypoints, input_w, input_h,\n                                max_output_object_count, is_segmentation, is_pose, is_obb, px_arry, px_arry_length);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\nIPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData,\n                                                     size_t serialLength) TRT_NOEXCEPT {\n    // This object will be deleted when the network is destroyed, which will\n    // call YoloLayerPlugin::destroy()\n    YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\n}  // namespace nvinfer1\n"
  },
  {
    "path": "yolo11/plugin/yololayer.h",
    "content": "#pragma once\n\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n#include \"macros.h\"\n\nnamespace nvinfer1 {\nclass API YoloLayerPlugin : public IPluginV2IOExt {\n   public:\n    YoloLayerPlugin(int classCount, int numberofpoints, float confthreshkeypoints, int netWidth, int netHeight,\n                    int maxOut, bool is_segmentation, bool is_pose, bool is_obb, const int* strides, int stridesLength);\n\n    YoloLayerPlugin(const void* data, size_t length);\n\n    ~YoloLayerPlugin();\n\n    int getNbOutputs() const TRT_NOEXCEPT override { return 1; }\n\n    nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n    int initialize() TRT_NOEXCEPT override;\n\n    virtual void terminate() TRT_NOEXCEPT override {}\n\n    virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0; }\n\n    virtual int enqueue(int batchSize, const void* const* inputs, void* TRT_CONST_ENQUEUE* outputs, void* workspace,\n                        cudaStream_t stream) TRT_NOEXCEPT override;\n\n    virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n    virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n    bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs,\n                                   int nbOutputs) const TRT_NOEXCEPT override {\n        return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n    }\n\n    const char* getPluginType() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    void destroy() TRT_NOEXCEPT override;\n\n    IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n    nvinfer1::DataType getOutputDataType(int32_t index, nvinfer1::DataType const* inputTypes,\n                                         int32_t nbInputs) const TRT_NOEXCEPT;\n\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                      int nbInputs) const TRT_NOEXCEPT override;\n\n    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n    void attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                         IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n    void configurePlugin(PluginTensorDesc const* in, int32_t nbInput, PluginTensorDesc const* out,\n                         int32_t nbOutput) TRT_NOEXCEPT override;\n\n    void detachFromContext() TRT_NOEXCEPT override;\n\n   private:\n    void forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV8netHeight,\n                    int mYoloV8NetWidth, int batchSize);\n\n    int mThreadCount = 256;\n    const char* mPluginNamespace;\n    int mClassCount;\n    int mNumberofpoints;\n    float mConfthreshkeypoints;\n    int mYoloV8NetWidth;\n    int mYoloV8netHeight;\n    int mMaxOutObject;\n    bool is_segmentation_;\n    bool is_pose_;\n    bool is_obb_;\n    int* mStrides;\n    int mStridesLength;\n};\n\nclass API YoloPluginCreator : public IPluginCreator {\n   public:\n    YoloPluginCreator();\n\n    ~YoloPluginCreator() override = default;\n\n    const char* getPluginName() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    const nvinfer1::PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n    nvinfer1::IPluginV2IOExt* createPlugin(const char* name,\n                                           const nvinfer1::PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n    nvinfer1::IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData,\n                                                size_t serialLength) TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override { mNamespace = libNamespace; }\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override { return mNamespace.c_str(); }\n\n   private:\n    std::string mNamespace;\n    static PluginFieldCollection mFC;\n    static std::vector<PluginField> mPluginAttributes;\n};\n\nREGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n}  // namespace nvinfer1\n"
  },
  {
    "path": "yolo11/readme.md",
    "content": "## Introduction\r\n\r\nYolo11 model supports TensorRT-8.\r\n\r\nTraining code [link](https://github.com/ultralytics/ultralytics/archive/refs/tags/v8.3.38.zip)\r\n\r\n## Environment\r\n\r\n* cuda 11.8\r\n* cudnn 8.9.1.23\r\n* tensorrt 8.6.1.6\r\n* opencv 4.8.0\r\n* ultralytics 8.3.0\r\n\r\n## Support\r\n\r\n* [x] YOLO11-det support FP32/FP16/INT8 and Python/C++ API\r\n* [x] YOLO11-cls support FP32/FP16/INT8 and Python/C++ API\r\n* [x] YOLO11-seg support FP32/FP16/INT8 and Python/C++ API\r\n* [x] YOLO11-pose support FP32/FP16/INT8 and Python/C++ API\r\n* [x] YOLO11-obb support FP32/FP16/INT8 and Python/C++ API\r\n\r\n## Config\r\n\r\n* Choose the YOLO11 sub-model n/s/m/l/x from command line arguments.\r\n* Other configs please check [src/config.h](src/config.h)\r\n\r\n## Build and Run\r\n\r\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\r\n\r\n```shell\r\n# Download ultralytics\r\nwget https://github.com/ultralytics/ultralytics/archive/refs/tags/v8.3.0.zip -O ultralytics-8.3.0.zip\r\n# Unzip ultralytics\r\nunzip ultralytics-8.3.0.zip\r\ncd ultralytics-8.3.0\r\n# Download models\r\nwget https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11n.pt -O yolo11n.pt\r\nwget https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11n-cls.pt -O yolo11n-cls.pt\r\nwget https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11n-seg.pt -O yolo11n-seg.pt\r\nwget https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11n-pose.pt -O yolo11n-pose.pt\r\nwget https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11n-obb.pt -O yolo11n-obb.pt\r\n# Generate .wts\r\ncp [PATH-TO-TENSORRTX]/yolo11/gen_wts.py .\r\npython gen_wts.py -w yolo11n.pt -o yolo11n.wts -t detect\r\npython gen_wts.py -w yolo11n-cls.pt -o yolo11n-cls.wts -t cls\r\npython gen_wts.py -w yolo11n-seg.pt -o yolo11n-seg.wts -t seg\r\npython gen_wts.py -w yolo11n-pose.pt -o yolo11n-pose.wts -t pose\r\npython gen_wts.py -w yolo11n-obb.pt -o yolo11n-obb.wts -t obb\r\n# A file 'yolo11n.wts' will be generated.\r\n```\r\n\r\n2. build tensorrtx/yolo11 and run\r\n```shell\r\ncd [PATH-TO-TENSORRTX]/yolo11\r\nmkdir build\r\ncd build\r\ncmake ..\r\nmake\r\n```\r\n\r\n### Detection\r\n```shell\r\ncp [PATH-TO-ultralytics]/yolo11n.wts .\r\n# Build and serialize TensorRT engine\r\n./yolo11_det -s yolo11n.wts yolo11n.engine [n/s/m/l/x]\r\n# Run inference\r\n./yolo11_det -d yolo11n.engine ../images [c/g]\r\n# results saved in build directory\r\n```\r\n\r\n### Classification\r\n```shell\r\ncp [PATH-TO-ultralytics]/yolo11n-cls.wts .\r\n# Build and serialize TensorRT engine\r\n./yolo11_cls -s yolo11n-cls.wts yolo11n-cls.engine [n/s/m/l/x]\r\n# Download ImageNet labels\r\nwget https://github.com/joannzhang00/ImageNet-dataset-classes-labels/blob/main/imagenet_classes.txt\r\n# Run inference\r\n./yolo11_cls -d yolo11n-cls.engine ../images\r\n```\r\n\r\n### Segmentation\r\n```shell\r\ncp [PATH-TO-ultralytics]/yolo11n-seg.wts .\r\n# Build and serialize TensorRT engine\r\n./yolo11_seg -s yolo11n-seg.wts yolo11n-seg.engine [n/s/m/l/x]\r\n# Download the labels file\r\nwget -O coco.txt https://raw.githubusercontent.com/amikelive/coco-labels/master/coco-labels-2014_2017.txt\r\n# Run inference\r\n./yolo11_seg -d yolo11n-seg.engine ../images c coco.txt\r\n```\r\n\r\n### Pose\r\n```shell\r\ncp [PATH-TO-ultralytics]/yolo11n-pose.wts .\r\n# Build and serialize TensorRT engine\r\n./yolo11_pose -s yolo11n-pose.wts yolo11n-pose.engine [n/s/m/l/x]\r\n# Run inference\r\n./yolo11_pose -d yolo11n-pose.engine ../images\r\n```\r\n\r\n### Obb\r\n```shell\r\ncp [PATH-TO-ultralytics]/yolo11n-obb.wts .\r\n# Build and serialize TensorRT engine\r\n./yolo11_obb -s yolo11n-obb.wts yolo11n-obb.engine [n/s/m/l/x]\r\n# Download the image\r\nwget -O P0015.png https://github.com/mpj1234/YOLO11-series-TensorRT8/releases/download/images/P0015.png\r\nmv P0015.png ../images\r\n# Run inference\r\n./yolo11_obb -d yolo11n-obb.engine ../images\r\n```\r\n\r\n3. Optional, load and run the tensorrt model in Python\r\n```shell\r\n// Install python-tensorrt, pycuda, etc.\r\n// Ensure the yolo11n.engine\r\npython yolo11_det_trt.py ./build/yolo11n.engine ./build/libmyplugins.so\r\n# faq: in windows bug pycuda._driver.LogicError\r\n# faq: in linux bug Segmentation fault\r\n# Add the following code to the py file:\r\n# import pycuda.autoinit\r\n# import pycuda.driver as cuda\r\n```\r\n\r\n## INT8 Quantization\r\n1. Prepare calibration images, you can randomly select 1000s images from your train set. For coco, you can also download my calibration images `coco_calib` from [GoogleDrive](https://drive.google.com/drive/folders/1s7jE9DtOngZMzJC1uL307J2MiaGwdRSI?usp=sharing) or [BaiduPan](https://pan.baidu.com/s/1GOm_-JobpyLMAqZWCDUhKg) pwd: a9wh\r\n2. unzip it in yolo11/build\r\n3. set the macro `USE_INT8` in src/config.h and make again\r\n4. serialize the model and test\r\n\r\n## More Information\r\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\r\n"
  },
  {
    "path": "yolo11/src/block.cpp",
    "content": "#include \"block.h\"\n#include <assert.h>\n#include <math.h>\n#include <fstream>\n#include <iostream>\n#include \"config.h\"\n#include \"model.h\"\n#include \"yololayer.h\"\n\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, nvinfer1::Weights> WeightMap;\n\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = nvinfer1::DataType::kFLOAT;\n\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; x++) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        wt.count = size;\n        WeightMap[name] = wt;\n    }\n    return WeightMap;\n}\n\nnvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                      std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                      std::string lname, float eps) {\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights scale{nvinfer1::DataType::kFLOAT, scval, len};\n\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights shift{nvinfer1::DataType::kFLOAT, shval, len};\n\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    nvinfer1::Weights power{nvinfer1::DataType::kFLOAT, pval, len};\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    nvinfer1::IScaleLayer* output = network->addScale(input, nvinfer1::ScaleMode::kCHANNEL, shift, scale, power);\n    assert(output);\n    return output;\n}\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, std::vector<int> k, int s, std::string lname) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k[0], k[1]},\n                                                                  weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    // auto pad\n    int p0 = k[0] / 2;\n    int p1 = k[1] / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p0, p1});\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n\nstatic nvinfer1::ILayer* bottleneck(nvinfer1::INetworkDefinition* network,\n                                    std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                    int c1, int c2, bool shortcut, std::vector<int> k1, std::vector<int> k2, float e,\n                                    std::string lname) {\n    int c_ = (int)((float)c2 * e);\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, c_, k1, 1, lname + \".cv1\");\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *conv1->getOutput(0), c2, k2, 1, lname + \".cv2\");\n\n    if (shortcut && c1 == c2) {\n        nvinfer1::IElementWiseLayer* ew =\n                network->addElementWise(input, *conv2->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n        return ew;\n    }\n    return conv2;\n}\n\nnvinfer1::IElementWiseLayer* SPPF(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int k, std::string lname) {\n    int c_ = c1 / 2;\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, c_, {1, 1}, 1, lname + \".cv1\");\n    nvinfer1::IPoolingLayer* pool1 =\n            network->addPoolingNd(*conv1->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool1->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool1->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::IPoolingLayer* pool2 =\n            network->addPoolingNd(*pool1->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool2->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::IPoolingLayer* pool3 =\n            network->addPoolingNd(*pool2->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool3->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool3->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::ITensor* inputTensors[] = {conv1->getOutput(0), pool1->getOutput(0), pool2->getOutput(0),\n                                         pool3->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensors, 4);\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv2\");\n    return conv2;\n}\n\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int grid, int k, int s, int p, std::string lname) {\n\n    nvinfer1::IShuffleLayer* shuffle1 = network->addShuffle(input);\n    shuffle1->setReshapeDimensions(nvinfer1::Dims4{kBatchSize, 4, 16, grid});\n    shuffle1->setSecondTranspose(nvinfer1::Permutation{0, 2, 1, 3});\n    nvinfer1::ISoftMaxLayer* softmax = network->addSoftMax(*shuffle1->getOutput(0));\n    softmax->setAxes(1 << 1);\n\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(*softmax->getOutput(0), 1, nvinfer1::DimsHW{1, 1}, weightMap[lname], bias_empty);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n\n    nvinfer1::IShuffleLayer* shuffle2 = network->addShuffle(*conv->getOutput(0));\n    shuffle2->setReshapeDimensions(nvinfer1::Dims3{kBatchSize, 4, grid});\n\n    return shuffle2;\n}\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network,\n                                       std::vector<nvinfer1::IConcatenationLayer*> dets, const int* px_arry,\n                                       int px_arry_num, bool is_segmentation, bool is_pose, bool is_obb) {\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    const int netinfo_count = 9;  // Assuming the first 5 elements are for netinfo as per existing code.\n    const int total_count = netinfo_count + px_arry_num;  // Total number of elements for netinfo and px_arry combined.\n\n    std::vector<int> combinedInfo(total_count);\n    int class_num = kNumClass;\n    if (is_pose)\n        class_num = kPoseNumClass;\n    else if (is_obb)\n        class_num = kObbNumClass;\n    int input_w = kInputW;\n    if (is_obb)\n        input_w = kObbInputW;\n    int input_h = kInputH;\n    if (is_obb)\n        input_h = kObbInputH;\n    // Fill in the first 5 elements as per existing netinfo.\n    combinedInfo[0] = class_num;\n    combinedInfo[1] = kNumberOfPoints;\n    combinedInfo[2] = kConfThreshKeypoints;\n    combinedInfo[3] = input_w;\n    combinedInfo[4] = input_h;\n    combinedInfo[5] = kMaxNumOutputBbox;\n    combinedInfo[6] = is_segmentation;\n    combinedInfo[7] = is_pose;\n    combinedInfo[8] = is_obb;\n\n    // Copy the contents of px_arry into the combinedInfo vector after the initial 5 elements.\n    std::copy(px_arry, px_arry + px_arry_num, combinedInfo.begin() + netinfo_count);\n\n    // Now let's create the PluginField object to hold this combined information.\n    nvinfer1::PluginField pluginField;\n    pluginField.name = \"combinedInfo\";  // This can be any name that the plugin will recognize\n    pluginField.data = combinedInfo.data();\n    pluginField.type = nvinfer1::PluginFieldType::kINT32;\n    pluginField.length = combinedInfo.size();\n\n    // Create the PluginFieldCollection to hold the PluginField object.\n    nvinfer1::PluginFieldCollection pluginFieldCollection;\n    pluginFieldCollection.nbFields = 1;  // We have just one field, but it's a combined array\n    pluginFieldCollection.fields = &pluginField;\n\n    // Create the plugin object using the PluginFieldCollection.\n    nvinfer1::IPluginV2* pluginObject = creator->createPlugin(\"yololayer\", &pluginFieldCollection);\n\n    // We assume that the plugin is to be added onto the network.\n    // Prepare input tensors for the YOLO Layer.\n    std::vector<nvinfer1::ITensor*> inputTensors;\n    for (auto det : dets) {\n        inputTensors.push_back(det->getOutput(0));  // Assuming each IConcatenationLayer has one output tensor.\n    }\n\n    // Add the plugin to the network using the prepared input tensors.\n    nvinfer1::IPluginV2Layer* yoloLayer = network->addPluginV2(inputTensors.data(), inputTensors.size(), *pluginObject);\n\n    return yoloLayer;  // Return the added YOLO layer.\n}\n\nstatic nvinfer1::ILayer* C3k(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int c1, int c2, int n, bool shortcut, std::vector<int> k1,\n                             std::vector<int> k2, float e, std::string lname) {\n    int c_ = (int)((float)c2 * e);\n    auto cv1 = convBnSiLU(network, weightMap, input, c_, {1, 1}, 1, lname + \".cv1\");\n    auto cv2 = convBnSiLU(network, weightMap, input, c_, {1, 1}, 1, lname + \".cv2\");\n    nvinfer1::ITensor* y1 = cv1->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        auto b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, k1, k2, 1.0, lname + \".m.\" + std::to_string(i));\n        y1 = b->getOutput(0);\n    }\n\n    nvinfer1::ITensor* inputTensors[] = {y1, cv2->getOutput(0)};\n    auto cat = network->addConcatenation(inputTensors, 2);\n\n    auto cv3 = convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv3\");\n    return cv3;\n}\n\nnvinfer1::IElementWiseLayer* C3K2(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int n, bool c3k, bool shortcut, float e, std::string lname) {\n    int c_ = (float)c2 * e;\n\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, 2 * c_, {1, 1}, 1, lname + \".cv1\");\n    nvinfer1::Dims d = conv1->getOutput(0)->getDimensions();\n\n    nvinfer1::ISliceLayer* split1 =\n            network->addSlice(*conv1->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n                              nvinfer1::Dims4{d.d[0], d.d[1] / 2, d.d[2], d.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* split2 =\n            network->addSlice(*conv1->getOutput(0), nvinfer1::Dims4{0, d.d[1] / 2, 0, 0},\n                              nvinfer1::Dims4{d.d[0], d.d[1] / 2, d.d[2], d.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ITensor* inputTensor0[] = {split1->getOutput(0), split2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensor0, 2);\n    nvinfer1::ITensor* y1 = split2->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        nvinfer1::ILayer* b;\n        if (c3k) {\n            b = C3k(network, weightMap, *y1, c_, c_, 2, shortcut, {3, 3}, {3, 3}, 0.5,\n                    lname + \".m.\" + std::to_string(i));\n        } else {\n            b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, {3, 3}, {3, 3}, 0.5,\n                           lname + \".m.\" + std::to_string(i));\n        }\n        y1 = b->getOutput(0);\n\n        nvinfer1::ITensor* inputTensors[] = {cat->getOutput(0), b->getOutput(0)};\n        cat = network->addConcatenation(inputTensors, 2);\n    }\n\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv2\");\n\n    return conv2;\n}\n\nstatic nvinfer1::ILayer* convBn(nvinfer1::INetworkDefinition* network,\n                                std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int ch,\n                                int k, int s, std::string lname, int g = 1) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k, k}, weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    int p = k / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n    conv->setNbGroups(g);\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n    return bn;\n}\n\nstatic nvinfer1::ILayer* Attention(nvinfer1::INetworkDefinition* network,\n                                   std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                   int dim, int num_heads, float attn_ratio, std::string lname) {\n    int head_dim = dim / num_heads;\n    int key_dim = head_dim * attn_ratio;\n    float scale = pow(key_dim, -0.5);\n    int nh_kd = key_dim * num_heads;\n    int h = dim + nh_kd * 2;\n\n    auto d = input.getDimensions();\n    int B = d.d[0];\n    int H = d.d[2];\n    int W = d.d[3];\n    int N = H * W;\n    auto* qkv = convBn(network, weightMap, input, h, 1, 1, lname + \".qkv\");\n    // qkv.view(B, self.num_heads, -1, N)\n    auto shuffle = network->addShuffle(*qkv->getOutput(0));\n    shuffle->setReshapeDimensions(nvinfer1::Dims4{B, num_heads, -1, N});\n    // q, k, v = .split([self.key_dim, self.key_dim, self.head_dim], dim=2)\n    auto d1 = shuffle->getOutput(0)->getDimensions();\n    auto q = network->addSlice(*shuffle->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n                               nvinfer1::Dims4{d1.d[0], d1.d[1], key_dim, d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    auto k = network->addSlice(*shuffle->getOutput(0), nvinfer1::Dims4{0, 0, key_dim, 0},\n                               nvinfer1::Dims4{d1.d[0], d1.d[1], key_dim, d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    auto v = network->addSlice(*shuffle->getOutput(0), nvinfer1::Dims4{0, 0, key_dim * 2, 0},\n                               nvinfer1::Dims4{d1.d[0], d1.d[1], head_dim, d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    // attn = ((q.transpose(-2, -1) @ k) * self.scale)\n    auto qT = network->addShuffle(*q->getOutput(0));\n    qT->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});\n    auto matmul = network->addMatrixMultiply(*qT->getOutput(0), nvinfer1::MatrixOperation::kNONE, *k->getOutput(0),\n                                             nvinfer1::MatrixOperation::kNONE);\n    // There are not many memory leaks, and I will change it when I have time\n    float* scale_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    scale_val[0] = scale;\n    nvinfer1::Weights s_w{nvinfer1::DataType::kFLOAT, scale_val, 1};\n    float* shift_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    shift_val[0] = 0;\n    nvinfer1::Weights sh_w{nvinfer1::DataType::kFLOAT, shift_val, 1};\n    float* power_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    power_val[0] = 1;\n    nvinfer1::Weights p_w{nvinfer1::DataType::kFLOAT, power_val, 1};\n    nvinfer1::IScaleLayer* scaleLayer =\n            network->addScale(*matmul->getOutput(0), nvinfer1::ScaleMode::kUNIFORM, sh_w, s_w, p_w);\n    // attn = attn.softmax(dim=-1)\n    nvinfer1::ISoftMaxLayer* softmax = network->addSoftMax(*scaleLayer->getOutput(0));\n    softmax->setAxes(1 << 3);\n    // x = (v @ attn.transpose(-2, -1)).view(B, -1, H, W) + self.pe(v.reshape(B, -1, H, W))\n    auto attnT = network->addShuffle(*softmax->getOutput(0));\n    attnT->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});\n    auto matmul2 = network->addMatrixMultiply(*v->getOutput(0), nvinfer1::MatrixOperation::kNONE, *attnT->getOutput(0),\n                                              nvinfer1::MatrixOperation::kNONE);\n    auto reshape = network->addShuffle(*matmul2->getOutput(0));\n    reshape->setReshapeDimensions(nvinfer1::Dims4{B, -1, H, W});\n    auto v_reshape = network->addShuffle(*v->getOutput(0));\n    v_reshape->setReshapeDimensions(nvinfer1::Dims4{B, -1, H, W});\n    // self.pe = Conv(dim, dim, 3, 1, g=dim, act=False)\n    auto pe = convBn(network, weightMap, *v_reshape->getOutput(0), dim, 3, 1, lname + \".pe\", dim);\n    auto sum = network->addElementWise(*reshape->getOutput(0), *pe->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    // x = self.proj(x)\n    // self.proj = Conv(dim, dim, 1, act=False)\n    auto proj = convBn(network, weightMap, *sum->getOutput(0), dim, 1, 1, lname + \".proj\");\n    return proj;\n}\n\nstatic nvinfer1::ILayer* PSABlock(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int dim,\n                                  float attn_ratio, int num_heads, bool shortcut, std::string lname) {\n    // x = x + self.attn(x) if self.add else self.attn(x)\n    auto attn = Attention(network, weightMap, input, dim, num_heads, attn_ratio, lname + \".attn\");\n    nvinfer1::ILayer* shortcut_layer = nullptr;\n    if (shortcut) {\n        shortcut_layer = network->addElementWise(input, *attn->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    } else {\n        shortcut_layer = attn;\n    }\n    // self.ffn = nn.Sequential(Conv(c, c * 2, 1), Conv(c * 2, c, 1, act=False))\n    // x = x + self.ffn(x) if self.add else self.ffn(x)\n    auto ffn0 = convBnSiLU(network, weightMap, *shortcut_layer->getOutput(0), dim * 2, {1, 1}, 1, lname + \".ffn.0\");\n    auto ffn1 = convBn(network, weightMap, *ffn0->getOutput(0), dim, 1, 1, lname + \".ffn.1\");\n    if (shortcut) {\n        return network->addElementWise(*shortcut_layer->getOutput(0), *ffn1->getOutput(0),\n                                       nvinfer1::ElementWiseOperation::kSUM);\n    } else {\n        return ffn1;\n    }\n}\n\nnvinfer1::ILayer* C2PSA(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>& weightMap,\n                        nvinfer1::ITensor& input, int c1, int c2, int n, float e, std::string lname) {\n    assert(network != nullptr);\n    int c = c1 * e;\n\n    // cv1 branch\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, 2 * c, {1, 1}, 1, lname + \".cv1\");\n    nvinfer1::ITensor* cv1_out = conv1->getOutput(0);\n\n    // Split the output of cv1 into two tensors\n    nvinfer1::Dims dims = cv1_out->getDimensions();\n    nvinfer1::ISliceLayer* split1 = network->addSlice(*cv1_out, nvinfer1::Dims4{0, 0, 0, 0},\n                                                      nvinfer1::Dims4{dims.d[0], dims.d[1] / 2, dims.d[2], dims.d[3]},\n                                                      nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* split2 = network->addSlice(*cv1_out, nvinfer1::Dims4{0, dims.d[1] / 2, 0, 0},\n                                                      nvinfer1::Dims4{dims.d[0], dims.d[1] / 2, dims.d[2], dims.d[3]},\n                                                      nvinfer1::Dims4{1, 1, 1, 1});\n\n    // Create y1 bottleneck sequence\n    nvinfer1::ITensor* y = split2->getOutput(0);\n    for (int i = 0; i < n; ++i) {\n        auto* bottleneck_layer =\n                PSABlock(network, weightMap, *y, c, 0.5, c / 64, true, lname + \".m.\" + std::to_string(i));\n        y = bottleneck_layer->getOutput(0);  // update 'y1' to be the output of the current bottleneck\n    }\n\n    // Concatenate y1 with the second split of cv1\n    nvinfer1::ITensor* concatInputs[2] = {split1->getOutput(0), y};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(concatInputs, 2);\n\n    // cv2 to produce the final output\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv2\");\n\n    return conv2;\n}\n\nnvinfer1::ILayer* DWConv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int ch, std::vector<int> k, int s, std::string lname) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k[0], k[1]},\n                                                                  weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setNbGroups(ch);\n    // auto pad\n    int p0 = k[0] / 2;\n    int p1 = k[1] / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p0, p1});\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n"
  },
  {
    "path": "yolo11/src/calibrator.cpp",
    "content": "#include \"calibrator.h\"\n#include <fstream>\n#include <iostream>\n#include <iterator>\n#include <opencv2/dnn/dnn.hpp>\n#include \"cuda_utils.h\"\n#include \"utils.h\"\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir,\n                                               const char* calib_table_name, const char* input_blob_name,\n                                               bool read_cache)\n    : batchsize_(batchsize),\n      input_w_(input_w),\n      input_h_(input_h),\n      img_idx_(0),\n      img_dir_(img_dir),\n      calib_table_name_(calib_table_name),\n      input_blob_name_(input_blob_name),\n      read_cache_(read_cache) {\n    input_count_ = 3 * input_w * input_h * batchsize;\n    CUDA_CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n    read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2() {\n    CUDA_CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const TRT_NOEXCEPT {\n    return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT {\n    if (img_idx_ + batchsize_ > (int)img_files_.size()) {\n        return false;\n    }\n\n    std::vector<cv::Mat> input_imgs_;\n    for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n        std::cout << img_files_[i] << \"  \" << i << std::endl;\n        cv::Mat temp = cv::imread(img_dir_ + \"/\" + img_files_[i]);\n        if (temp.empty()) {\n            std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n            return false;\n        }\n        cv::Mat pr_img = preprocess_img(temp, input_w_, input_h_);\n        input_imgs_.push_back(pr_img);\n    }\n    img_idx_ += batchsize_;\n    cv::Mat blob = cv::dnn::blobFromImages(input_imgs_, 1.0 / 255.0, cv::Size(input_w_, input_h_), cv::Scalar(0, 0, 0),\n                                           true, false);\n    CUDA_CHECK(cudaMemcpy(device_input_, blob.ptr<float>(0), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n    assert(!strcmp(names[0], input_blob_name_));\n    bindings[0] = device_input_;\n    return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length) TRT_NOEXCEPT {\n    std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n    calib_cache_.clear();\n    std::ifstream input(calib_table_name_, std::ios::binary);\n    input >> std::noskipws;\n    if (read_cache_ && input.good()) {\n        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n    }\n    length = calib_cache_.size();\n    return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT {\n    std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n    std::ofstream output(calib_table_name_, std::ios::binary);\n    output.write(reinterpret_cast<const char*>(cache), length);\n}\n"
  },
  {
    "path": "yolo11/src/model.cpp",
    "content": "#include <math.h>\n#include <iostream>\n\n#include \"block.h\"\n#include \"calibrator.h\"\n#include \"config.h\"\n#include \"model.h\"\n\nstatic int get_width(int x, float gw, int max_channels, int divisor = 8) {\n    auto channel = std::min(x, max_channels);\n    channel = int(ceil((channel * gw) / divisor)) * divisor;\n    return channel;\n}\n\nstatic int get_depth(int x, float gd) {\n    if (x == 1)\n        return 1;\n    int r = round(x * gd);\n    if (x * gd - int(x * gd) == 0.5 && (int(x * gd) % 2) == 0)\n        --r;\n    return std::max<int>(r, 1);\n}\n\nvoid calculateStrides(nvinfer1::IElementWiseLayer* conv_layers[], int size, int reference_size, int strides[]) {\n    for (int i = 0; i < size; ++i) {\n        nvinfer1::ILayer* layer = conv_layers[i];\n        nvinfer1::Dims dims = layer->getOutput(0)->getDimensions();\n        int feature_map_size = dims.d[2];\n        strides[i] = reference_size / feature_map_size;\n    }\n}\n\nnvinfer1::IHostMemory* buildEngineYolo11Cls(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            std::string& type, int max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    //\tnvinfer1::INetworkDefinition *network = builder->createNetworkV2(0U);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    // ****************************************** YOLO11 INPUT **********************************************\n    nvinfer1::ITensor* data =\n            network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kClsInputH, kClsInputW});\n    assert(data);\n\n    // ***************************************** YOLO11 BACKBONE ********************************************\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), {3, 3}, 2, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, *conv0->getOutput(0),\n                                                    get_width(128, gw, max_channels), {3, 3}, 2, \"model.1\");\n    bool c3k = false;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        c3k = true;\n    }\n    nvinfer1::IElementWiseLayer* conv2 =\n            C3K2(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), c3k, true, 0.25, \"model.2\");\n    nvinfer1::IElementWiseLayer* conv3 = convBnSiLU(network, weightMap, *conv2->getOutput(0),\n                                                    get_width(256, gw, max_channels), {3, 3}, 2, \"model.3\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv4 =\n            C3K2(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.25, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 = convBnSiLU(network, weightMap, *conv4->getOutput(0),\n                                                    get_width(512, gw, max_channels), {3, 3}, 2, \"model.5\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv6 =\n            C3K2(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.6\");\n    nvinfer1::IElementWiseLayer* conv7 = convBnSiLU(network, weightMap, *conv6->getOutput(0),\n                                                    get_width(1024, gw, max_channels), {3, 3}, 2, \"model.7\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv8 =\n            C3K2(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.8\");\n    auto* conv9 = C2PSA(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                        get_width(1024, gw, max_channels), get_depth(2, gd), 0.5, \"model.9\");\n\n    // ********************************************* YOLO11 HEAD *********************************************\n\n    auto conv_class = convBnSiLU(network, weightMap, *conv9->getOutput(0), 1280, {1, 1}, 1, \"model.10.conv\");\n    // Adjusted code\n    nvinfer1::Dims dims =\n            conv_class->getOutput(0)->getDimensions();  // Obtain the dimensions of the output of conv_class\n    assert(dims.nbDims == 4);  // Make sure there are exactly 3 dimensions (channels, height, width)\n\n    nvinfer1::IPoolingLayer* pool2 = network->addPoolingNd(*conv_class->getOutput(0), nvinfer1::PoolingType::kAVERAGE,\n                                                           nvinfer1::DimsHW{dims.d[2], dims.d[3]});\n    assert(pool2);\n\n    // Fully connected layer declaration\n    auto shuffle_0 = network->addShuffle(*pool2->getOutput(0));\n    shuffle_0->setReshapeDimensions(nvinfer1::Dims2{kBatchSize, 1280});\n    auto linear_weight = weightMap[\"model.10.linear.weight\"];\n    auto constant_weight = network->addConstant(nvinfer1::Dims2{kClsNumClass, 1280}, linear_weight);\n    auto constant_bias =\n            network->addConstant(nvinfer1::Dims2{kBatchSize, kClsNumClass}, weightMap[\"model.10.linear.bias\"]);\n    auto linear_matrix_multipy =\n            network->addMatrixMultiply(*shuffle_0->getOutput(0), nvinfer1::MatrixOperation::kNONE,\n                                       *constant_weight->getOutput(0), nvinfer1::MatrixOperation::kTRANSPOSE);\n    auto yolo = network->addElementWise(*linear_matrix_multipy->getOutput(0), *constant_bias->getOutput(0),\n                                        nvinfer1::ElementWiseOperation::kSUM);\n    assert(yolo);\n\n    // Set the name for the output tensor and mark it as network output\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    // Set the maximum batch size and workspace size\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 16 * (1 << 20));\n\n    // Configuration according to the precision mode being used\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform supports int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kClsInputW, kClsInputH, kInputQuantizationFolder,\n                                                  \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    // Begin building the engine; this may take a while\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Cleanup the network definition and allocated weights\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolo11Det(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    //\tnvinfer1::INetworkDefinition *network = builder->createNetworkV2(0U);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    /*******************************************************************************************************\n    ******************************************  YOLO11 INPUT  **********************************************\n    *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n    *****************************************  YOLO11 BACKBONE  ********************************************\n    *******************************************************************************************************/\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), {3, 3}, 2, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, *conv0->getOutput(0),\n                                                    get_width(128, gw, max_channels), {3, 3}, 2, \"model.1\");\n    // 11233\n    bool c3k = false;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        c3k = true;\n    }\n    nvinfer1::IElementWiseLayer* conv2 =\n            C3K2(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), c3k, true, 0.25, \"model.2\");\n    nvinfer1::IElementWiseLayer* conv3 = convBnSiLU(network, weightMap, *conv2->getOutput(0),\n                                                    get_width(256, gw, max_channels), {3, 3}, 2, \"model.3\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv4 =\n            C3K2(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.25, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 = convBnSiLU(network, weightMap, *conv4->getOutput(0),\n                                                    get_width(512, gw, max_channels), {3, 3}, 2, \"model.5\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv6 =\n            C3K2(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.6\");\n    nvinfer1::IElementWiseLayer* conv7 = convBnSiLU(network, weightMap, *conv6->getOutput(0),\n                                                    get_width(1024, gw, max_channels), {3, 3}, 2, \"model.7\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv8 =\n            C3K2(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.8\");\n    nvinfer1::IElementWiseLayer* conv9 =\n            SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 5, \"model.9\");\n    auto* conv10 = C2PSA(network, weightMap, *conv9->getOutput(0), get_width(1024, gw, max_channels),\n                         get_width(1024, gw, max_channels), get_depth(2, gd), 0.5, \"model.10\");\n    /*******************************************************************************************************\n    *********************************************  YOLO11 HEAD  ********************************************\n    *******************************************************************************************************/\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample11 = network->addResize(*conv10->getOutput(0));\n    assert(upsample11);\n    upsample11->setResizeMode(nvinfer1::InterpolationMode::kNEAREST);\n    upsample11->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor12[] = {upsample11->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat12 = network->addConcatenation(inputTensor12, 2);\n\n    nvinfer1::IElementWiseLayer* conv13 =\n            C3K2(network, weightMap, *cat12->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.5, \"model.13\");\n\n    nvinfer1::IResizeLayer* upsample14 = network->addResize(*conv13->getOutput(0));\n    assert(upsample14);\n    upsample14->setResizeMode(nvinfer1::InterpolationMode::kNEAREST);\n    upsample14->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor15[] = {upsample14->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat15 = network->addConcatenation(inputTensor15, 2);\n\n    nvinfer1::IElementWiseLayer* conv16 =\n            C3K2(network, weightMap, *cat15->getOutput(0), get_width(256, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), c3k, true, 0.5, \"model.16\");\n\n    nvinfer1::IElementWiseLayer* conv17 = convBnSiLU(network, weightMap, *conv16->getOutput(0),\n                                                     get_width(256, gw, max_channels), {3, 3}, 2, \"model.17\");\n    nvinfer1::ITensor* inputTensor18[] = {conv17->getOutput(0), conv13->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat18 = network->addConcatenation(inputTensor18, 2);\n    nvinfer1::IElementWiseLayer* conv19 =\n            C3K2(network, weightMap, *cat18->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.5, \"model.19\");\n\n    nvinfer1::IElementWiseLayer* conv20 = convBnSiLU(network, weightMap, *conv19->getOutput(0),\n                                                     get_width(512, gw, max_channels), {3, 3}, 2, \"model.20\");\n    nvinfer1::ITensor* inputTensor21[] = {conv20->getOutput(0), conv10->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21 = network->addConcatenation(inputTensor21, 2);\n    nvinfer1::IElementWiseLayer* conv22 =\n            C3K2(network, weightMap, *cat21->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.22\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLO11 OUTPUT  ******************************************\n    *******************************************************************************************************/\n    // c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], min(self.nc, 100))  # channels\n    int c2 = std::max(std::max(16, get_width(256, gw, max_channels) / 4), 16 * 4);\n    int c3 = std::max(get_width(256, gw, max_channels), std::min(kNumClass, 100));\n\n    // output0\n    nvinfer1::IElementWiseLayer* conv23_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv23_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv2_0_0->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_0_2 =\n            network->addConvolutionNd(*conv23_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv2.0.2.weight\"], weightMap[\"model.23.cv2.0.2.bias\"]);\n    conv23_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_0_0_0 = DWConv(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), {3, 3},\n                                    1, \"model.23.cv3.0.0.0\");\n    auto* conv23_cv3_0_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_0_0_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.0.0.1\");\n    auto* conv23_cv3_0_1_0 =\n            DWConv(network, weightMap, *conv23_cv3_0_0_1->getOutput(0), c3, {3, 3}, 1, \"model.23.cv3.0.1.0\");\n    auto* conv23_cv3_0_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_0_1_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.0.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv3_0_2 =\n            network->addConvolutionNd(*conv23_cv3_0_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv3.0.2.weight\"], weightMap[\"model.23.cv3.0.2.bias\"]);\n    conv23_cv3_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_0[] = {conv23_cv2_0_2->getOutput(0), conv23_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_0 = network->addConcatenation(inputTensor23_0, 2);\n\n    // output1\n    nvinfer1::IElementWiseLayer* conv23_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv19->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv23_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv2_1_0->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_1_2 =\n            network->addConvolutionNd(*conv23_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv2.1.2.weight\"], weightMap[\"model.23.cv2.1.2.bias\"]);\n    conv23_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_1_0_0 = DWConv(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), {3, 3},\n                                    1, \"model.23.cv3.1.0.0\");\n    auto* conv23_cv3_1_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_1_0_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.1.0.1\");\n    auto* conv23_cv3_1_1_0 =\n            DWConv(network, weightMap, *conv23_cv3_1_0_1->getOutput(0), c3, {3, 3}, 1, \"model.23.cv3.1.1.0\");\n    auto* conv23_cv3_1_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_1_1_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.1.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv3_1_2 =\n            network->addConvolutionNd(*conv23_cv3_1_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv3.1.2.weight\"], weightMap[\"model.23.cv3.1.2.bias\"]);\n    conv23_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_1[] = {conv23_cv2_1_2->getOutput(0), conv23_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_1 = network->addConcatenation(inputTensor23_1, 2);\n\n    // output2\n    nvinfer1::IElementWiseLayer* conv23_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv22->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv23_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv23_cv2_2_0->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_2_2 =\n            network->addConvolutionNd(*conv23_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv2.2.2.weight\"], weightMap[\"model.23.cv2.2.2.bias\"]);\n    conv23_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_2_0_0 = DWConv(network, weightMap, *conv22->getOutput(0), get_width(1024, gw, max_channels),\n                                    {3, 3}, 1, \"model.23.cv3.2.0.0\");\n    auto* conv23_cv3_2_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_2_0_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.2.0.1\");\n    auto* conv23_cv3_2_1_0 =\n            DWConv(network, weightMap, *conv23_cv3_2_0_1->getOutput(0), c3, {3, 3}, 1, \"model.23.cv3.2.1.0\");\n    auto* conv23_cv3_2_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_2_1_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.2.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv3_2_2 =\n            network->addConvolutionNd(*conv23_cv3_2_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv3.2.2.weight\"], weightMap[\"model.23.cv3.2.2.bias\"]);\n    conv23_cv3_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_2[] = {conv23_cv2_2_2->getOutput(0), conv23_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_2 = network->addConcatenation(inputTensor23_2, 2);\n\n    /*******************************************************************************************************\n    *********************************************  YOLO11 DETECT  ******************************************\n    *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle23_0 = network->addShuffle(*cat23_0->getOutput(0));\n    shuffle23_0->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split23_0_0 = network->addSlice(\n            *shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_0_1 =\n            network->addSlice(*shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])},\n                              nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* dfl23_0 =\n            DFL(network, weightMap, *split23_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor22_dfl_0[] = {dfl23_0->getOutput(0), split23_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_0 = network->addConcatenation(inputTensor22_dfl_0, 2);\n    cat22_dfl_0->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle23_1 = network->addShuffle(*cat23_1->getOutput(0));\n    shuffle23_1->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split23_1_0 = network->addSlice(\n            *shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_1_1 =\n            network->addSlice(*shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_1 =\n            DFL(network, weightMap, *split23_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor22_dfl_1[] = {dfl23_1->getOutput(0), split23_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_1 = network->addConcatenation(inputTensor22_dfl_1, 2);\n    cat22_dfl_1->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle23_2 = network->addShuffle(*cat23_2->getOutput(0));\n    shuffle23_2->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split23_2_0 = network->addSlice(\n            *shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_2_1 =\n            network->addSlice(*shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_2 =\n            DFL(network, weightMap, *split23_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor22_dfl_2[] = {dfl23_2->getOutput(0), split23_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_2 = network->addConcatenation(inputTensor22_dfl_2, 2);\n    cat22_dfl_2->setAxis(1);\n\n    nvinfer1::IPluginV2Layer* yolo =\n            addYoLoLayer(network, std::vector<nvinfer1::IConcatenationLayer*>{cat22_dfl_0, cat22_dfl_1, cat22_dfl_2},\n                         strides, stridesLength, false, false, false);\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nstatic nvinfer1::IElementWiseLayer* convBnSiLUProto(nvinfer1::INetworkDefinition* network,\n                                                    std::map<std::string, nvinfer1::Weights> weightMap,\n                                                    nvinfer1::ITensor& input, int ch, int k, int s, int p,\n                                                    std::string lname) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k, k}, weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n    conv->setName((lname + \".conv\").c_str());\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n    bn->setName((lname + \".bn\").c_str());\n    // This concat operator is not used for calculation, in order to prevent the operator fusion unrealized error when int8 is quantized.\n    // Error Code 10: Internal Error (Could not find any implementation for node\n    // model.22.proto.cv3.conv + model.22.proto.cv3.sigmoid + PWN(PWN((Unnamed Layer* 353) [Activation]), PWN(model.22.proto.cv3.silu)).)\n\n#if defined(USE_INT8)\n    nvinfer1::ITensor* inputTensors[] = {bn->getOutput(0)};\n    auto concat = network->addConcatenation(inputTensors, 1);\n    nvinfer1::IActivationLayer* sigmoid =\n            network->addActivation(*concat->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    assert(sigmoid);\n    bn->setName((lname + \".sigmoid\").c_str());\n    nvinfer1::IElementWiseLayer* ew = network->addElementWise(*concat->getOutput(0), *sigmoid->getOutput(0),\n                                                              nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    ew->setName((lname + \".silu\").c_str());\n#else\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    assert(sigmoid);\n    bn->setName((lname + \".sigmoid\").c_str());\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    ew->setName((lname + \".silu\").c_str());\n#endif\n    return ew;\n}\n\nstatic nvinfer1::IElementWiseLayer* Proto(nvinfer1::INetworkDefinition* network,\n                                          std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input,\n                                          std::string lname, float gw, int max_channels) {\n    int mid_channel = get_width(256, gw, max_channels);\n    auto cv1 = convBnSiLU(network, weightMap, input, mid_channel, {3, 3}, 1, \"model.23.proto.cv1\");\n    //    float *convTranpsose_bais = (float *) weightMap[\"model.23.proto.upsample.bias\"].values;\n    //    int convTranpsose_bais_len = weightMap[\"model.23.proto.upsample.bias\"].count;\n    //    nvinfer1::Weights bias{nvinfer1::DataType::kFLOAT, convTranpsose_bais, convTranpsose_bais_len};\n    auto convTranpsose = network->addDeconvolutionNd(*cv1->getOutput(0), mid_channel, nvinfer1::DimsHW{2, 2},\n                                                     weightMap[\"model.23.proto.upsample.weight\"],\n                                                     weightMap[\"model.23.proto.upsample.bias\"]);\n    assert(convTranpsose);\n    convTranpsose->setStrideNd(nvinfer1::DimsHW{2, 2});\n    convTranpsose->setPadding(nvinfer1::DimsHW{0, 0});\n    auto cv2 =\n            convBnSiLU(network, weightMap, *convTranpsose->getOutput(0), mid_channel, {3, 3}, 1, \"model.23.proto.cv2\");\n    auto cv3 = convBnSiLUProto(network, weightMap, *cv2->getOutput(0), 32, 1, 1, 0, \"model.23.proto.cv3\");\n    assert(cv3);\n    return cv3;\n}\n\nstatic nvinfer1::IShuffleLayer* cv4_conv_combined(nvinfer1::INetworkDefinition* network,\n                                                  std::map<std::string, nvinfer1::Weights>& weightMap,\n                                                  nvinfer1::ITensor& input, std::string lname, int grid_shape, float gw,\n                                                  const std::string& algo_type, int max_channels) {\n    int nm_nk = 0;\n    int c4 = 0;\n\n    if (algo_type == \"seg\") {\n        nm_nk = 32;\n        c4 = std::max(get_width(256, gw, max_channels) / 4, nm_nk);\n    } else if (algo_type == \"pose\") {\n        nm_nk = kNumberOfPoints * 3;\n        c4 = std::max(get_width(256, gw, max_channels) / 4, kNumberOfPoints * 3);\n    } else if (algo_type == \"obb\") {\n        nm_nk = kObbNe;\n        c4 = std::max(get_width(256, gw, max_channels) / 4, nm_nk);\n    } else {\n        std::cerr << \"Unknown algo type: \" << algo_type << std::endl;\n        return nullptr;\n    }\n\n    auto cv0 = convBnSiLU(network, weightMap, input, c4, {3, 3}, 1, lname + \".0\");\n    auto cv1 = convBnSiLU(network, weightMap, *cv0->getOutput(0), c4, {3, 3}, 1, lname + \".1\");\n    float* cv2_bais_value = (float*)weightMap[lname + \".2\" + \".bias\"].values;\n    int cv2_bais_len = weightMap[lname + \".2\" + \".bias\"].count;\n    nvinfer1::Weights cv2_bais{nvinfer1::DataType::kFLOAT, cv2_bais_value, cv2_bais_len};\n    auto cv2 = network->addConvolutionNd(*cv1->getOutput(0), nm_nk, nvinfer1::DimsHW{1, 1},\n                                         weightMap[lname + \".2\" + \".weight\"], cv2_bais);\n    cv2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    nvinfer1::IShuffleLayer* cv2_shuffle = network->addShuffle(*cv2->getOutput(0));\n    cv2_shuffle->setReshapeDimensions(nvinfer1::Dims3{kBatchSize, nm_nk, grid_shape});\n\n    return cv2_shuffle;\n}\n\nnvinfer1::IHostMemory* buildEngineYolo11Seg(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    //\tnvinfer1::INetworkDefinition *network = builder->createNetworkV2(0U);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    /*******************************************************************************************************\n    ******************************************  YOLO11 INPUT  **********************************************\n    *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n    *****************************************  YOLO11 BACKBONE  ********************************************\n    *******************************************************************************************************/\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), {3, 3}, 2, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, *conv0->getOutput(0),\n                                                    get_width(128, gw, max_channels), {3, 3}, 2, \"model.1\");\n    bool c3k = false;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        c3k = true;\n    }\n    nvinfer1::IElementWiseLayer* conv2 =\n            C3K2(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), c3k, true, 0.25, \"model.2\");\n    nvinfer1::IElementWiseLayer* conv3 = convBnSiLU(network, weightMap, *conv2->getOutput(0),\n                                                    get_width(256, gw, max_channels), {3, 3}, 2, \"model.3\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv4 =\n            C3K2(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.25, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 = convBnSiLU(network, weightMap, *conv4->getOutput(0),\n                                                    get_width(512, gw, max_channels), {3, 3}, 2, \"model.5\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv6 =\n            C3K2(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.6\");\n    nvinfer1::IElementWiseLayer* conv7 = convBnSiLU(network, weightMap, *conv6->getOutput(0),\n                                                    get_width(1024, gw, max_channels), {3, 3}, 2, \"model.7\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv8 =\n            C3K2(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.8\");\n    nvinfer1::IElementWiseLayer* conv9 =\n            SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 5, \"model.9\");\n    auto* conv10 = C2PSA(network, weightMap, *conv9->getOutput(0), get_width(1024, gw, max_channels),\n                         get_width(1024, gw, max_channels), get_depth(2, gd), 0.5, \"model.10\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLO11 HEAD  ********************************************\n    *******************************************************************************************************/\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample11 = network->addResize(*conv10->getOutput(0));\n    assert(upsample11);\n    upsample11->setResizeMode(nvinfer1::InterpolationMode::kNEAREST);\n    upsample11->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor12[] = {upsample11->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat12 = network->addConcatenation(inputTensor12, 2);\n\n    nvinfer1::IElementWiseLayer* conv13 =\n            C3K2(network, weightMap, *cat12->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.5, \"model.13\");\n\n    nvinfer1::IResizeLayer* upsample14 = network->addResize(*conv13->getOutput(0));\n    assert(upsample14);\n    upsample14->setResizeMode(nvinfer1::InterpolationMode::kNEAREST);\n    upsample14->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor15[] = {upsample14->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat15 = network->addConcatenation(inputTensor15, 2);\n\n    nvinfer1::IElementWiseLayer* conv16 =\n            C3K2(network, weightMap, *cat15->getOutput(0), get_width(256, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), c3k, true, 0.5, \"model.16\");\n\n    nvinfer1::IElementWiseLayer* conv17 = convBnSiLU(network, weightMap, *conv16->getOutput(0),\n                                                     get_width(256, gw, max_channels), {3, 3}, 2, \"model.17\");\n    nvinfer1::ITensor* inputTensor18[] = {conv17->getOutput(0), conv13->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat18 = network->addConcatenation(inputTensor18, 2);\n    nvinfer1::IElementWiseLayer* conv19 =\n            C3K2(network, weightMap, *cat18->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.5, \"model.19\");\n\n    nvinfer1::IElementWiseLayer* conv20 = convBnSiLU(network, weightMap, *conv19->getOutput(0),\n                                                     get_width(512, gw, max_channels), {3, 3}, 2, \"model.20\");\n    nvinfer1::ITensor* inputTensor21[] = {conv20->getOutput(0), conv10->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21 = network->addConcatenation(inputTensor21, 2);\n    nvinfer1::IElementWiseLayer* conv22 =\n            C3K2(network, weightMap, *cat21->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.22\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLO11 OUTPUT  ******************************************\n    *******************************************************************************************************/\n    // c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], min(self.nc, 100))  # channels\n    int c2 = std::max(std::max(16, get_width(256, gw, max_channels) / 4), 16 * 4);\n    int c3 = std::max(get_width(256, gw, max_channels), std::min(kNumClass, 100));\n\n    // output0\n    nvinfer1::IElementWiseLayer* conv23_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv23_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv2_0_0->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_0_2 =\n            network->addConvolutionNd(*conv23_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv2.0.2.weight\"], weightMap[\"model.23.cv2.0.2.bias\"]);\n    conv23_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_0_0_0 = DWConv(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), {3, 3},\n                                    1, \"model.23.cv3.0.0.0\");\n    auto* conv23_cv3_0_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_0_0_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.0.0.1\");\n    auto* conv23_cv3_0_1_0 =\n            DWConv(network, weightMap, *conv23_cv3_0_0_1->getOutput(0), c3, {3, 3}, 1, \"model.23.cv3.0.1.0\");\n    auto* conv23_cv3_0_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_0_1_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.0.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv3_0_2 =\n            network->addConvolutionNd(*conv23_cv3_0_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv3.0.2.weight\"], weightMap[\"model.23.cv3.0.2.bias\"]);\n    conv23_cv3_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_0[] = {conv23_cv2_0_2->getOutput(0), conv23_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_0 = network->addConcatenation(inputTensor23_0, 2);\n\n    // output1\n    nvinfer1::IElementWiseLayer* conv23_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv19->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv23_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv2_1_0->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_1_2 =\n            network->addConvolutionNd(*conv23_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv2.1.2.weight\"], weightMap[\"model.23.cv2.1.2.bias\"]);\n    conv23_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_1_0_0 = DWConv(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), {3, 3},\n                                    1, \"model.23.cv3.1.0.0\");\n    auto* conv23_cv3_1_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_1_0_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.1.0.1\");\n    auto* conv23_cv3_1_1_0 =\n            DWConv(network, weightMap, *conv23_cv3_1_0_1->getOutput(0), c3, {3, 3}, 1, \"model.23.cv3.1.1.0\");\n    auto* conv23_cv3_1_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_1_1_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.1.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv3_1_2 =\n            network->addConvolutionNd(*conv23_cv3_1_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv3.1.2.weight\"], weightMap[\"model.23.cv3.1.2.bias\"]);\n    conv23_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_1[] = {conv23_cv2_1_2->getOutput(0), conv23_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_1 = network->addConcatenation(inputTensor23_1, 2);\n\n    // output2\n    nvinfer1::IElementWiseLayer* conv23_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv22->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv23_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv23_cv2_2_0->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_2_2 =\n            network->addConvolutionNd(*conv23_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv2.2.2.weight\"], weightMap[\"model.23.cv2.2.2.bias\"]);\n    conv23_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_2_0_0 = DWConv(network, weightMap, *conv22->getOutput(0), get_width(1024, gw, max_channels),\n                                    {3, 3}, 1, \"model.23.cv3.2.0.0\");\n    auto* conv23_cv3_2_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_2_0_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.2.0.1\");\n    auto* conv23_cv3_2_1_0 =\n            DWConv(network, weightMap, *conv23_cv3_2_0_1->getOutput(0), c3, {3, 3}, 1, \"model.23.cv3.2.1.0\");\n    auto* conv23_cv3_2_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_2_1_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.2.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv3_2_2 =\n            network->addConvolutionNd(*conv23_cv3_2_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv3.2.2.weight\"], weightMap[\"model.23.cv3.2.2.bias\"]);\n    conv23_cv3_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_2[] = {conv23_cv2_2_2->getOutput(0), conv23_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_2 = network->addConcatenation(inputTensor23_2, 2);\n\n    /*******************************************************************************************************\n    *********************************************  YOLO11 DETECT  ******************************************\n    *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle23_0 = network->addShuffle(*cat23_0->getOutput(0));\n    shuffle23_0->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split23_0_0 = network->addSlice(\n            *shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_0_1 =\n            network->addSlice(*shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])},\n                              nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* dfl23_0 =\n            DFL(network, weightMap, *split23_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n\n    nvinfer1::IShuffleLayer* shuffle23_1 = network->addShuffle(*cat23_1->getOutput(0));\n    shuffle23_1->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split23_1_0 = network->addSlice(\n            *shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_1_1 =\n            network->addSlice(*shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_1 =\n            DFL(network, weightMap, *split23_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n\n    nvinfer1::IShuffleLayer* shuffle23_2 = network->addShuffle(*cat23_2->getOutput(0));\n    shuffle23_2->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split23_2_0 = network->addSlice(\n            *shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_2_1 =\n            network->addSlice(*shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_2 =\n            DFL(network, weightMap, *split23_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n\n    // det0\n    auto proto_coef_0 = cv4_conv_combined(network, weightMap, *conv16->getOutput(0), \"model.23.cv4.0\",\n                                          (kInputH / strides[0]) * (kInputW / strides[0]), gw, \"seg\", max_channels);\n    nvinfer1::ITensor* inputTensor23_dfl_0[] = {dfl23_0->getOutput(0), split23_0_1->getOutput(0),\n                                                proto_coef_0->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_0 = network->addConcatenation(inputTensor23_dfl_0, 3);\n    cat23_dfl_0->setAxis(1);\n\n    // det1\n    auto proto_coef_1 = cv4_conv_combined(network, weightMap, *conv19->getOutput(0), \"model.23.cv4.1\",\n                                          (kInputH / strides[1]) * (kInputW / strides[1]), gw, \"seg\", max_channels);\n    nvinfer1::ITensor* inputTensor23_dfl_1[] = {dfl23_1->getOutput(0), split23_1_1->getOutput(0),\n                                                proto_coef_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_1 = network->addConcatenation(inputTensor23_dfl_1, 3);\n    cat23_dfl_1->setAxis(1);\n\n    // det2\n    auto proto_coef_2 = cv4_conv_combined(network, weightMap, *conv22->getOutput(0), \"model.23.cv4.2\",\n                                          (kInputH / strides[2]) * (kInputW / strides[2]), gw, \"seg\", max_channels);\n    nvinfer1::ITensor* inputTensor23_dfl_2[] = {dfl23_2->getOutput(0), split23_2_1->getOutput(0),\n                                                proto_coef_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_2 = network->addConcatenation(inputTensor23_dfl_2, 3);\n    cat23_dfl_2->setAxis(1);\n\n    nvinfer1::IPluginV2Layer* yolo =\n            addYoLoLayer(network, std::vector<nvinfer1::IConcatenationLayer*>{cat23_dfl_0, cat23_dfl_1, cat23_dfl_2},\n                         strides, stridesLength, true, false, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    auto proto = Proto(network, weightMap, *conv16->getOutput(0), \"model.23.proto\", gw, max_channels);\n    proto->getOutput(0)->setName(kProtoTensorName);\n    network->markOutput(*proto->getOutput(0));\n\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolo11Pose(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                             nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                             int& max_channels, std::string& type) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    //\tnvinfer1::INetworkDefinition *network = builder->createNetworkV2(0U);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    /*******************************************************************************************************\n    ******************************************  YOLO11 INPUT  **********************************************\n    *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n    *****************************************  YOLO11 BACKBONE  ********************************************\n    *******************************************************************************************************/\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), {3, 3}, 2, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, *conv0->getOutput(0),\n                                                    get_width(128, gw, max_channels), {3, 3}, 2, \"model.1\");\n    bool c3k = false;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        c3k = true;\n    }\n    nvinfer1::IElementWiseLayer* conv2 =\n            C3K2(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), c3k, true, 0.25, \"model.2\");\n    nvinfer1::IElementWiseLayer* conv3 = convBnSiLU(network, weightMap, *conv2->getOutput(0),\n                                                    get_width(256, gw, max_channels), {3, 3}, 2, \"model.3\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv4 =\n            C3K2(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.25, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 = convBnSiLU(network, weightMap, *conv4->getOutput(0),\n                                                    get_width(512, gw, max_channels), {3, 3}, 2, \"model.5\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv6 =\n            C3K2(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.6\");\n    nvinfer1::IElementWiseLayer* conv7 = convBnSiLU(network, weightMap, *conv6->getOutput(0),\n                                                    get_width(1024, gw, max_channels), {3, 3}, 2, \"model.7\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv8 =\n            C3K2(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.8\");\n    nvinfer1::IElementWiseLayer* conv9 =\n            SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 5, \"model.9\");\n    auto* conv10 = C2PSA(network, weightMap, *conv9->getOutput(0), get_width(1024, gw, max_channels),\n                         get_width(1024, gw, max_channels), get_depth(2, gd), 0.5, \"model.10\");\n    /*******************************************************************************************************\n    *********************************************  YOLO11 HEAD  ********************************************\n    *******************************************************************************************************/\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample11 = network->addResize(*conv10->getOutput(0));\n    assert(upsample11);\n    upsample11->setResizeMode(nvinfer1::InterpolationMode::kNEAREST);\n    upsample11->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor12[] = {upsample11->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat12 = network->addConcatenation(inputTensor12, 2);\n\n    nvinfer1::IElementWiseLayer* conv13 =\n            C3K2(network, weightMap, *cat12->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.5, \"model.13\");\n\n    nvinfer1::IResizeLayer* upsample14 = network->addResize(*conv13->getOutput(0));\n    assert(upsample14);\n    upsample14->setResizeMode(nvinfer1::InterpolationMode::kNEAREST);\n    upsample14->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor15[] = {upsample14->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat15 = network->addConcatenation(inputTensor15, 2);\n\n    nvinfer1::IElementWiseLayer* conv16 =\n            C3K2(network, weightMap, *cat15->getOutput(0), get_width(256, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), c3k, true, 0.5, \"model.16\");\n\n    nvinfer1::IElementWiseLayer* conv17 = convBnSiLU(network, weightMap, *conv16->getOutput(0),\n                                                     get_width(256, gw, max_channels), {3, 3}, 2, \"model.17\");\n    nvinfer1::ITensor* inputTensor18[] = {conv17->getOutput(0), conv13->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat18 = network->addConcatenation(inputTensor18, 2);\n    nvinfer1::IElementWiseLayer* conv19 =\n            C3K2(network, weightMap, *cat18->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.5, \"model.19\");\n\n    nvinfer1::IElementWiseLayer* conv20 = convBnSiLU(network, weightMap, *conv19->getOutput(0),\n                                                     get_width(512, gw, max_channels), {3, 3}, 2, \"model.20\");\n    nvinfer1::ITensor* inputTensor21[] = {conv20->getOutput(0), conv10->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21 = network->addConcatenation(inputTensor21, 2);\n    nvinfer1::IElementWiseLayer* conv22 =\n            C3K2(network, weightMap, *cat21->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.22\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLO11 OUTPUT  ******************************************\n    *******************************************************************************************************/\n    // c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], min(self.nc, 100))  # channels\n    int c2 = std::max(std::max(16, get_width(256, gw, max_channels) / 4), 16 * 4);\n    int c3 = std::max(get_width(256, gw, max_channels), std::min(kPoseNumClass, 100));\n\n    // output0\n    nvinfer1::IElementWiseLayer* conv23_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv23_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv2_0_0->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_0_2 =\n            network->addConvolutionNd(*conv23_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv2.0.2.weight\"], weightMap[\"model.23.cv2.0.2.bias\"]);\n    conv23_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_0_0_0 = DWConv(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), {3, 3},\n                                    1, \"model.23.cv3.0.0.0\");\n    auto* conv23_cv3_0_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_0_0_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.0.0.1\");\n    auto* conv23_cv3_0_1_0 =\n            DWConv(network, weightMap, *conv23_cv3_0_0_1->getOutput(0), c3, {3, 3}, 1, \"model.23.cv3.0.1.0\");\n    auto* conv23_cv3_0_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_0_1_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.0.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv3_0_2 =\n            network->addConvolutionNd(*conv23_cv3_0_1_1->getOutput(0), kPoseNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv3.0.2.weight\"], weightMap[\"model.23.cv3.0.2.bias\"]);\n    conv23_cv3_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_0[] = {conv23_cv2_0_2->getOutput(0), conv23_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_0 = network->addConcatenation(inputTensor23_0, 2);\n\n    // output1\n    nvinfer1::IElementWiseLayer* conv23_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv19->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv23_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv2_1_0->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_1_2 =\n            network->addConvolutionNd(*conv23_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv2.1.2.weight\"], weightMap[\"model.23.cv2.1.2.bias\"]);\n    conv23_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_1_0_0 = DWConv(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), {3, 3},\n                                    1, \"model.23.cv3.1.0.0\");\n    auto* conv23_cv3_1_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_1_0_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.1.0.1\");\n    auto* conv23_cv3_1_1_0 =\n            DWConv(network, weightMap, *conv23_cv3_1_0_1->getOutput(0), c3, {3, 3}, 1, \"model.23.cv3.1.1.0\");\n    auto* conv23_cv3_1_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_1_1_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.1.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv3_1_2 =\n            network->addConvolutionNd(*conv23_cv3_1_1_1->getOutput(0), kPoseNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv3.1.2.weight\"], weightMap[\"model.23.cv3.1.2.bias\"]);\n    conv23_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_1[] = {conv23_cv2_1_2->getOutput(0), conv23_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_1 = network->addConcatenation(inputTensor23_1, 2);\n\n    // output2\n    nvinfer1::IElementWiseLayer* conv23_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv22->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv23_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv23_cv2_2_0->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_2_2 =\n            network->addConvolutionNd(*conv23_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv2.2.2.weight\"], weightMap[\"model.23.cv2.2.2.bias\"]);\n    conv23_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_2_0_0 = DWConv(network, weightMap, *conv22->getOutput(0), get_width(1024, gw, max_channels),\n                                    {3, 3}, 1, \"model.23.cv3.2.0.0\");\n    auto* conv23_cv3_2_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_2_0_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.2.0.1\");\n    auto* conv23_cv3_2_1_0 =\n            DWConv(network, weightMap, *conv23_cv3_2_0_1->getOutput(0), c3, {3, 3}, 1, \"model.23.cv3.2.1.0\");\n    auto* conv23_cv3_2_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_2_1_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.2.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv3_2_2 =\n            network->addConvolutionNd(*conv23_cv3_2_1_1->getOutput(0), kPoseNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv3.2.2.weight\"], weightMap[\"model.23.cv3.2.2.bias\"]);\n    conv23_cv3_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_2[] = {conv23_cv2_2_2->getOutput(0), conv23_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_2 = network->addConcatenation(inputTensor23_2, 2);\n    /*******************************************************************************************************\n    *********************************************  YOLO11 DETECT  ******************************************\n    *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    /**************************************************************************************P3****************************************************************************************************************************************/\n    nvinfer1::IShuffleLayer* shuffle23_0 = network->addShuffle(*cat23_0->getOutput(0));\n    shuffle23_0->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kPoseNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split23_0_0 = network->addSlice(\n            *shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_0_1 = network->addSlice(\n            *shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n            nvinfer1::Dims3{kBatchSize, kPoseNumClass, (kInputH / strides[0]) * (kInputW / strides[0])},\n            nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* dfl23_0 =\n            DFL(network, weightMap, *split23_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n\n    // det0\n    auto shuffle_conv16 = cv4_conv_combined(network, weightMap, *conv16->getOutput(0), \"model.23.cv4.0\",\n                                            (kInputH / strides[0]) * (kInputW / strides[0]), gw, \"pose\", max_channels);\n\n    nvinfer1::ITensor* inputTensor23_dfl_0[] = {dfl23_0->getOutput(0), split23_0_1->getOutput(0),\n                                                shuffle_conv16->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_0 = network->addConcatenation(inputTensor23_dfl_0, 3);\n    cat23_dfl_0->setAxis(1);\n\n    /********************************************************************************************P4**********************************************************************************************************************************/\n    nvinfer1::IShuffleLayer* shuffle23_1 = network->addShuffle(*cat23_1->getOutput(0));\n    shuffle23_1->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kPoseNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split23_1_0 = network->addSlice(\n            *shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_1_1 = network->addSlice(\n            *shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n            nvinfer1::Dims3{kBatchSize, kPoseNumClass, (kInputH / strides[1]) * (kInputW / strides[1])},\n            nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_1 =\n            DFL(network, weightMap, *split23_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n\n    // det1\n    auto shuffle_conv19 = cv4_conv_combined(network, weightMap, *conv19->getOutput(0), \"model.23.cv4.1\",\n                                            (kInputH / strides[1]) * (kInputW / strides[1]), gw, \"pose\", max_channels);\n\n    nvinfer1::ITensor* inputTensor23_dfl_1[] = {dfl23_1->getOutput(0), split23_1_1->getOutput(0),\n                                                shuffle_conv19->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_1 = network->addConcatenation(inputTensor23_dfl_1, 3);\n    cat23_dfl_1->setAxis(1);\n\n    /********************************************************************************************P5**********************************************************************************************************************************/\n    nvinfer1::IShuffleLayer* shuffle23_2 = network->addShuffle(*cat23_2->getOutput(0));\n    shuffle23_2->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kPoseNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split23_2_0 = network->addSlice(\n            *shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_2_1 = network->addSlice(\n            *shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n            nvinfer1::Dims3{kBatchSize, kPoseNumClass, (kInputH / strides[2]) * (kInputW / strides[2])},\n            nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_2 =\n            DFL(network, weightMap, *split23_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n\n    // det2\n    auto shuffle_conv22 = cv4_conv_combined(network, weightMap, *conv22->getOutput(0), \"model.23.cv4.2\",\n                                            (kInputH / strides[2]) * (kInputW / strides[2]), gw, \"pose\", max_channels);\n    nvinfer1::ITensor* inputTensor23_dfl_2[] = {dfl23_2->getOutput(0), split23_2_1->getOutput(0),\n                                                shuffle_conv22->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_2 = network->addConcatenation(inputTensor23_dfl_2, 3);\n    cat23_dfl_2->setAxis(1);\n\n    nvinfer1::IPluginV2Layer* yolo =\n            addYoLoLayer(network, std::vector<nvinfer1::IConcatenationLayer*>{cat23_dfl_0, cat23_dfl_1, cat23_dfl_2},\n                         strides, stridesLength, false, true, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolo11Obb(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    //\tnvinfer1::INetworkDefinition *network = builder->createNetworkV2(0U);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    /*******************************************************************************************************\n    ******************************************  YOLO11 INPUT  **********************************************\n    *******************************************************************************************************/\n    nvinfer1::ITensor* data =\n            network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kObbInputH, kObbInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n    *****************************************  YOLO11 BACKBONE  ********************************************\n    *******************************************************************************************************/\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), {3, 3}, 2, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, *conv0->getOutput(0),\n                                                    get_width(128, gw, max_channels), {3, 3}, 2, \"model.1\");\n    // 11233\n    bool c3k = false;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        c3k = true;\n    }\n    nvinfer1::IElementWiseLayer* conv2 =\n            C3K2(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), c3k, true, 0.25, \"model.2\");\n    nvinfer1::IElementWiseLayer* conv3 = convBnSiLU(network, weightMap, *conv2->getOutput(0),\n                                                    get_width(256, gw, max_channels), {3, 3}, 2, \"model.3\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv4 =\n            C3K2(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.25, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 = convBnSiLU(network, weightMap, *conv4->getOutput(0),\n                                                    get_width(512, gw, max_channels), {3, 3}, 2, \"model.5\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv6 =\n            C3K2(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.6\");\n    nvinfer1::IElementWiseLayer* conv7 = convBnSiLU(network, weightMap, *conv6->getOutput(0),\n                                                    get_width(1024, gw, max_channels), {3, 3}, 2, \"model.7\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv8 =\n            C3K2(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.8\");\n    nvinfer1::IElementWiseLayer* conv9 =\n            SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 5, \"model.9\");\n    auto* conv10 = C2PSA(network, weightMap, *conv9->getOutput(0), get_width(1024, gw, max_channels),\n                         get_width(1024, gw, max_channels), get_depth(2, gd), 0.5, \"model.10\");\n    /*******************************************************************************************************\n    *********************************************  YOLO11 HEAD  ********************************************\n    *******************************************************************************************************/\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample11 = network->addResize(*conv10->getOutput(0));\n    assert(upsample11);\n    upsample11->setResizeMode(nvinfer1::InterpolationMode::kNEAREST);\n    upsample11->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor12[] = {upsample11->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat12 = network->addConcatenation(inputTensor12, 2);\n\n    nvinfer1::IElementWiseLayer* conv13 =\n            C3K2(network, weightMap, *cat12->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.5, \"model.13\");\n\n    nvinfer1::IResizeLayer* upsample14 = network->addResize(*conv13->getOutput(0));\n    assert(upsample14);\n    upsample14->setResizeMode(nvinfer1::InterpolationMode::kNEAREST);\n    upsample14->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor15[] = {upsample14->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat15 = network->addConcatenation(inputTensor15, 2);\n\n    nvinfer1::IElementWiseLayer* conv16 =\n            C3K2(network, weightMap, *cat15->getOutput(0), get_width(256, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), c3k, true, 0.5, \"model.16\");\n\n    nvinfer1::IElementWiseLayer* conv17 = convBnSiLU(network, weightMap, *conv16->getOutput(0),\n                                                     get_width(256, gw, max_channels), {3, 3}, 2, \"model.17\");\n    nvinfer1::ITensor* inputTensor18[] = {conv17->getOutput(0), conv13->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat18 = network->addConcatenation(inputTensor18, 2);\n    nvinfer1::IElementWiseLayer* conv19 =\n            C3K2(network, weightMap, *cat18->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.5, \"model.19\");\n\n    nvinfer1::IElementWiseLayer* conv20 = convBnSiLU(network, weightMap, *conv19->getOutput(0),\n                                                     get_width(512, gw, max_channels), {3, 3}, 2, \"model.20\");\n    nvinfer1::ITensor* inputTensor21[] = {conv20->getOutput(0), conv10->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21 = network->addConcatenation(inputTensor21, 2);\n    nvinfer1::IElementWiseLayer* conv22 =\n            C3K2(network, weightMap, *cat21->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.22\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLO11 OUTPUT  ******************************************\n    *******************************************************************************************************/\n    // c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], min(self.nc, 100))  # channels\n    // c4 = max(ch[0] // 4, self.ne)\n    int c2 = std::max(std::max(16, get_width(256, gw, max_channels) / 4), 16 * 4);\n    int c3 = std::max(get_width(256, gw, max_channels), std::min(kObbNumClass, 100));\n    int c4 = std::max(get_width(256, gw, max_channels) / 4, kObbNe);\n\n    // output0\n    nvinfer1::IElementWiseLayer* conv23_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv23_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv2_0_0->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_0_2 =\n            network->addConvolutionNd(*conv23_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv2.0.2.weight\"], weightMap[\"model.23.cv2.0.2.bias\"]);\n    conv23_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_0_0_0 = DWConv(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), {3, 3},\n                                    1, \"model.23.cv3.0.0.0\");\n    auto* conv23_cv3_0_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_0_0_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.0.0.1\");\n    auto* conv23_cv3_0_1_0 =\n            DWConv(network, weightMap, *conv23_cv3_0_0_1->getOutput(0), c3, {3, 3}, 1, \"model.23.cv3.0.1.0\");\n    auto* conv23_cv3_0_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_0_1_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.0.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv3_0_2 =\n            network->addConvolutionNd(*conv23_cv3_0_1_1->getOutput(0), kObbNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv3.0.2.weight\"], weightMap[\"model.23.cv3.0.2.bias\"]);\n    conv23_cv3_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_0[] = {conv23_cv2_0_2->getOutput(0), conv23_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_0 = network->addConcatenation(inputTensor23_0, 2);\n\n    // output1\n    nvinfer1::IElementWiseLayer* conv23_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv19->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv23_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv2_1_0->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_1_2 =\n            network->addConvolutionNd(*conv23_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv2.1.2.weight\"], weightMap[\"model.23.cv2.1.2.bias\"]);\n    conv23_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_1_0_0 = DWConv(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), {3, 3},\n                                    1, \"model.23.cv3.1.0.0\");\n    auto* conv23_cv3_1_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_1_0_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.1.0.1\");\n    auto* conv23_cv3_1_1_0 =\n            DWConv(network, weightMap, *conv23_cv3_1_0_1->getOutput(0), c3, {3, 3}, 1, \"model.23.cv3.1.1.0\");\n    auto* conv23_cv3_1_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_1_1_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.1.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv3_1_2 =\n            network->addConvolutionNd(*conv23_cv3_1_1_1->getOutput(0), kObbNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv3.1.2.weight\"], weightMap[\"model.23.cv3.1.2.bias\"]);\n    conv23_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_1[] = {conv23_cv2_1_2->getOutput(0), conv23_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_1 = network->addConcatenation(inputTensor23_1, 2);\n\n    // output2\n    nvinfer1::IElementWiseLayer* conv23_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv22->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv23_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv23_cv2_2_0->getOutput(0), c2, {3, 3}, 1, \"model.23.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_2_2 =\n            network->addConvolutionNd(*conv23_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv2.2.2.weight\"], weightMap[\"model.23.cv2.2.2.bias\"]);\n    conv23_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_2_0_0 = DWConv(network, weightMap, *conv22->getOutput(0), get_width(1024, gw, max_channels),\n                                    {3, 3}, 1, \"model.23.cv3.2.0.0\");\n    auto* conv23_cv3_2_0_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_2_0_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.2.0.1\");\n    auto* conv23_cv3_2_1_0 =\n            DWConv(network, weightMap, *conv23_cv3_2_0_1->getOutput(0), c3, {3, 3}, 1, \"model.23.cv3.2.1.0\");\n    auto* conv23_cv3_2_1_1 =\n            convBnSiLU(network, weightMap, *conv23_cv3_2_1_0->getOutput(0), c3, {1, 1}, 1, \"model.23.cv3.2.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv3_2_2 =\n            network->addConvolutionNd(*conv23_cv3_2_1_1->getOutput(0), kObbNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.23.cv3.2.2.weight\"], weightMap[\"model.23.cv3.2.2.bias\"]);\n    conv23_cv3_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_2[] = {conv23_cv2_2_2->getOutput(0), conv23_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_2 = network->addConcatenation(inputTensor23_2, 2);\n\n    /*******************************************************************************************************\n    *********************************************  YOLO11 DETECT  ******************************************\n    *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kObbInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle23_0 = network->addShuffle(*cat23_0->getOutput(0));\n    shuffle23_0->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kObbNumClass, (kObbInputH / strides[0]) * (kObbInputW / strides[0])});\n    nvinfer1::ISliceLayer* split23_0_0 =\n            network->addSlice(*shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n                              nvinfer1::Dims3{kBatchSize, 64, (kObbInputH / strides[0]) * (kObbInputW / strides[0])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_0_1 = network->addSlice(\n            *shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n            nvinfer1::Dims3{kBatchSize, kObbNumClass, (kObbInputH / strides[0]) * (kObbInputW / strides[0])},\n            nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_0 =\n            DFL(network, weightMap, *split23_0_0->getOutput(0), 4,\n                (kObbInputH / strides[0]) * (kObbInputW / strides[0]), 1, 1, 0, \"model.23.dfl.conv.weight\");\n\n    nvinfer1::IShuffleLayer* shuffle23_1 = network->addShuffle(*cat23_1->getOutput(0));\n    shuffle23_1->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kObbNumClass, (kObbInputH / strides[1]) * (kObbInputW / strides[1])});\n    nvinfer1::ISliceLayer* split23_1_0 =\n            network->addSlice(*shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n                              nvinfer1::Dims3{kBatchSize, 64, (kObbInputH / strides[1]) * (kObbInputW / strides[1])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_1_1 = network->addSlice(\n            *shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n            nvinfer1::Dims3{kBatchSize, kObbNumClass, (kObbInputH / strides[1]) * (kObbInputW / strides[1])},\n            nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_1 =\n            DFL(network, weightMap, *split23_1_0->getOutput(0), 4,\n                (kObbInputH / strides[1]) * (kObbInputW / strides[1]), 1, 1, 0, \"model.23.dfl.conv.weight\");\n\n    nvinfer1::IShuffleLayer* shuffle23_2 = network->addShuffle(*cat23_2->getOutput(0));\n    shuffle23_2->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kObbNumClass, (kObbInputH / strides[2]) * (kObbInputW / strides[2])});\n    nvinfer1::ISliceLayer* split23_2_0 =\n            network->addSlice(*shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n                              nvinfer1::Dims3{kBatchSize, 64, (kObbInputH / strides[2]) * (kObbInputW / strides[2])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_2_1 = network->addSlice(\n            *shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n            nvinfer1::Dims3{kBatchSize, kObbNumClass, (kObbInputH / strides[2]) * (kObbInputW / strides[2])},\n            nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_2 =\n            DFL(network, weightMap, *split23_2_0->getOutput(0), 4,\n                (kObbInputH / strides[2]) * (kObbInputW / strides[2]), 1, 1, 0, \"model.23.dfl.conv.weight\");\n\n    // det0\n    auto shuffle_conv16 =\n            cv4_conv_combined(network, weightMap, *conv16->getOutput(0), \"model.23.cv4.0\",\n                              (kObbInputH / strides[0]) * (kObbInputW / strides[0]), gw, \"obb\", max_channels);\n\n    nvinfer1::ITensor* inputTensor23_dfl_0[] = {dfl23_0->getOutput(0), split23_0_1->getOutput(0),\n                                                shuffle_conv16->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_0 = network->addConcatenation(inputTensor23_dfl_0, 3);\n    cat23_dfl_0->setAxis(1);\n\n    // det1\n    auto shuffle_conv19 =\n            cv4_conv_combined(network, weightMap, *conv19->getOutput(0), \"model.23.cv4.1\",\n                              (kObbInputH / strides[1]) * (kObbInputW / strides[1]), gw, \"obb\", max_channels);\n    nvinfer1::ITensor* inputTensor23_dfl_1[] = {dfl23_1->getOutput(0), split23_1_1->getOutput(0),\n                                                shuffle_conv19->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_1 = network->addConcatenation(inputTensor23_dfl_1, 3);\n    cat23_dfl_1->setAxis(1);\n\n    // det2\n    auto shuffle_conv22 =\n            cv4_conv_combined(network, weightMap, *conv22->getOutput(0), \"model.23.cv4.2\",\n                              (kObbInputH / strides[2]) * (kObbInputW / strides[2]), gw, \"obb\", max_channels);\n    nvinfer1::ITensor* inputTensor23_dfl_2[] = {dfl23_2->getOutput(0), split23_2_1->getOutput(0),\n                                                shuffle_conv22->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_2 = network->addConcatenation(inputTensor23_dfl_2, 3);\n    cat23_dfl_2->setAxis(1);\n\n    // yolo layer\n    nvinfer1::IPluginV2Layer* yolo =\n            addYoLoLayer(network, std::vector<nvinfer1::IConcatenationLayer*>{cat23_dfl_0, cat23_dfl_1, cat23_dfl_2},\n                         strides, stridesLength, false, false, true);\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kObbInputW, kObbInputH, kInputQuantizationFolder,\n                                                  \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n"
  },
  {
    "path": "yolo11/src/postprocess.cpp",
    "content": "#include \"postprocess.h\"\n#include \"utils.h\"\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    float l, r, t, b;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n\n    if (r_h > r_w) {\n        l = bbox[0];\n        r = bbox[2];\n        t = bbox[1] - (kInputH - r_w * img.rows) / 2;\n        b = bbox[3] - (kInputH - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - (kInputW - r_h * img.cols) / 2;\n        r = bbox[2] - (kInputW - r_h * img.cols) / 2;\n        t = bbox[1];\n        b = bbox[3];\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\ncv::Rect get_rect_obb(cv::Mat& img, float bbox[4]) {\n    float l, r, t, b;\n    float r_w = kObbInputW / (img.cols * 1.0);\n    float r_h = kObbInputH / (img.rows * 1.0);\n\n    if (r_h > r_w) {\n        l = bbox[0];\n        r = bbox[2];\n        t = bbox[1] - (kObbInputH - r_w * img.rows) / 2;\n        b = bbox[3] - (kObbInputH - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - (kObbInputW - r_h * img.cols) / 2;\n        r = bbox[2] - (kObbInputW - r_h * img.cols) / 2;\n        t = bbox[1];\n        b = bbox[3];\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\ncv::Rect get_rect_adapt_landmark(cv::Mat& img, float bbox[4], float lmk[kNumberOfPoints * 3]) {\n    float l, r, t, b;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n    if (r_h > r_w) {\n        l = bbox[0] / r_w;\n        r = bbox[2] / r_w;\n        t = (bbox[1] - (kInputH - r_w * img.rows) / 2) / r_w;\n        b = (bbox[3] - (kInputH - r_w * img.rows) / 2) / r_w;\n        for (int i = 0; i < kNumberOfPoints * 3; i += 3) {\n            lmk[i] /= r_w;\n            lmk[i + 1] = (lmk[i + 1] - (kInputH - r_w * img.rows) / 2) / r_w;\n            // lmk[i + 2]\n        }\n    } else {\n        l = (bbox[0] - (kInputW - r_h * img.cols) / 2) / r_h;\n        r = (bbox[2] - (kInputW - r_h * img.cols) / 2) / r_h;\n        t = bbox[1] / r_h;\n        b = bbox[3] / r_h;\n        for (int i = 0; i < kNumberOfPoints * 3; i += 3) {\n            lmk[i] = (lmk[i] - (kInputW - r_h * img.cols) / 2) / r_h;\n            lmk[i + 1] /= r_h;\n            // lmk[i + 2]\n        }\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\nstatic float iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n            (std::max)(lbox[0], rbox[0]),\n            (std::min)(lbox[2], rbox[2]),\n            (std::max)(lbox[1], rbox[1]),\n            (std::min)(lbox[3], rbox[3]),\n    };\n\n    if (interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS = (interBox[1] - interBox[0]) * (interBox[3] - interBox[2]);\n    float unionBoxS = (lbox[2] - lbox[0]) * (lbox[3] - lbox[1]) + (rbox[2] - rbox[0]) * (rbox[3] - rbox[1]) - interBoxS;\n    return interBoxS / unionBoxS;\n}\n\nstatic bool cmp(const Detection& a, const Detection& b) {\n    if (a.conf == b.conf) {\n        return a.bbox[0] < b.bbox[0];\n    }\n    return a.conf > b.conf;\n}\n\nvoid nms(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh) {\n    int det_size = sizeof(Detection) / sizeof(float);\n    std::map<float, std::vector<Detection>> m;\n\n    for (int i = 0; i < output[0]; i++) {\n        if (output[1 + det_size * i + 4] <= conf_thresh || isnan(output[1 + det_size * i + 4]))\n            continue;\n        Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        if (m.count(det.class_id) == 0)\n            m.emplace(det.class_id, std::vector<Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                    dets.erase(dets.begin() + n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\nvoid batch_nms(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size,\n               float conf_thresh, float nms_thresh) {\n    res_batch.resize(batch_size);\n    for (int i = 0; i < batch_size; i++) {\n        nms(res_batch[i], &output[i * output_size], conf_thresh, nms_thresh);\n    }\n}\n\nvoid process_decode_ptr_host(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element, cv::Mat& img,\n                             int count) {\n    Detection det;\n    for (int i = 0; i < count; i++) {\n        int basic_pos = 1 + i * bbox_element;\n        int keep_flag = decode_ptr_host[basic_pos + 6];\n        if (keep_flag == 1) {\n            det.bbox[0] = decode_ptr_host[basic_pos + 0];\n            det.bbox[1] = decode_ptr_host[basic_pos + 1];\n            det.bbox[2] = decode_ptr_host[basic_pos + 2];\n            det.bbox[3] = decode_ptr_host[basic_pos + 3];\n            det.conf = decode_ptr_host[basic_pos + 4];\n            det.class_id = decode_ptr_host[basic_pos + 5];\n            res.push_back(det);\n        }\n    }\n}\n\nvoid batch_process(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                   int bbox_element, const std::vector<cv::Mat>& img_batch) {\n    res_batch.resize(batch_size);\n    int count = static_cast<int>(*decode_ptr_host);\n    count = std::min(count, kMaxNumOutputBbox);\n    for (int i = 0; i < batch_size; i++) {\n        auto& img = const_cast<cv::Mat&>(img_batch[i]);\n        process_decode_ptr_host(res_batch[i], &decode_ptr_host[i * count], bbox_element, img, count);\n    }\n}\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        cv::Mat img = img_batch[i];\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect(img, res[j].bbox);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                        cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n        }\n    }\n}\n\nvoid draw_bbox_keypoints_line(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    const std::vector<std::pair<int, int>> skeleton_pairs = {\n            {0, 1}, {0, 2},  {0, 5}, {0, 6},  {1, 2},   {1, 3},   {2, 4},   {5, 6},   {5, 7},  {5, 11},\n            {6, 8}, {6, 12}, {7, 9}, {8, 10}, {11, 12}, {11, 13}, {12, 14}, {13, 15}, {14, 16}};\n\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        cv::Mat img = img_batch[i];\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect_adapt_landmark(img, res[j].bbox, res[j].keypoints);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                        cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n\n            for (int k = 0; k < kNumberOfPoints * 3; k += 3) {\n                if (res[j].keypoints[k + 2] > 0.5) {\n                    cv::circle(img, cv::Point((int)res[j].keypoints[k], (int)res[j].keypoints[k + 1]), 3,\n                               cv::Scalar(0, 0x27, 0xC1), -1);\n                }\n            }\n\n            for (const auto& bone : skeleton_pairs) {\n                int kp1_idx = bone.first * 3;\n                int kp2_idx = bone.second * 3;\n                if (res[j].keypoints[kp1_idx + 2] > 0.5 && res[j].keypoints[kp2_idx + 2] > 0.5) {\n                    cv::Point p1((int)res[j].keypoints[kp1_idx], (int)res[j].keypoints[kp1_idx + 1]);\n                    cv::Point p2((int)res[j].keypoints[kp2_idx], (int)res[j].keypoints[kp2_idx + 1]);\n                    cv::line(img, p1, p2, cv::Scalar(0, 0x27, 0xC1), 2);\n                }\n            }\n        }\n    }\n}\n\ncv::Mat scale_mask(cv::Mat mask, cv::Mat img) {\n    int x, y, w, h;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = kInputW;\n        h = r_w * img.rows;\n        x = 0;\n        y = (kInputH - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = kInputH;\n        x = (kInputW - w) / 2;\n        y = 0;\n    }\n    cv::Rect r(x, y, w, h);\n    cv::Mat res;\n    cv::resize(mask(r), res, img.size());\n    return res;\n}\n\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks,\n                    std::unordered_map<int, std::string>& labels_map) {\n    static std::vector<uint32_t> colors = {0xFF3838, 0xFF9D97, 0xFF701F, 0xFFB21D, 0xCFD231, 0x48F90A, 0x92CC17,\n                                           0x3DDB86, 0x1A9334, 0x00D4BB, 0x2C99A8, 0x00C2FF, 0x344593, 0x6473FF,\n                                           0x0018EC, 0x8438FF, 0x520085, 0xCB38FF, 0xFF95C8, 0xFF37C7};\n    for (size_t i = 0; i < dets.size(); i++) {\n        cv::Mat img_mask = scale_mask(masks[i], img);\n        auto color = colors[(int)dets[i].class_id % colors.size()];\n        auto bgr = cv::Scalar(color & 0xFF, color >> 8 & 0xFF, color >> 16 & 0xFF);\n\n        cv::Rect r = get_rect(img, dets[i].bbox);\n        for (int x = r.x; x < r.x + r.width; x++) {\n            for (int y = r.y; y < r.y + r.height; y++) {\n                float val = img_mask.at<float>(y, x);\n                if (val <= 0.5)\n                    continue;\n                img.at<cv::Vec3b>(y, x)[0] = img.at<cv::Vec3b>(y, x)[0] / 2 + bgr[0] / 2;\n                img.at<cv::Vec3b>(y, x)[1] = img.at<cv::Vec3b>(y, x)[1] / 2 + bgr[1] / 2;\n                img.at<cv::Vec3b>(y, x)[2] = img.at<cv::Vec3b>(y, x)[2] / 2 + bgr[2] / 2;\n            }\n        }\n\n        cv::rectangle(img, r, bgr, 2);\n\n        // Get the size of the text\n        cv::Size textSize =\n                cv::getTextSize(labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf),\n                                cv::FONT_HERSHEY_PLAIN, 1.2, 2, NULL);\n        // Set the top left corner of the rectangle\n        cv::Point topLeft(r.x, r.y - textSize.height);\n\n        // Set the bottom right corner of the rectangle\n        cv::Point bottomRight(r.x + textSize.width, r.y + textSize.height);\n\n        // Set the thickness of the rectangle lines\n        int lineThickness = 2;\n\n        // Draw the rectangle on the image\n        cv::rectangle(img, topLeft, bottomRight, bgr, -1);\n\n        cv::putText(img, labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf),\n                    cv::Point(r.x, r.y + 4), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar::all(0xFF), 2);\n    }\n}\n\nvoid process_decode_ptr_host_obb(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element,\n                                 cv::Mat& img, int count) {\n    Detection det;\n    for (int i = 0; i < count; i++) {\n        int basic_pos = 1 + i * bbox_element;\n        int keep_flag = decode_ptr_host[basic_pos + 6];\n        if (keep_flag == 1) {\n            det.bbox[0] = decode_ptr_host[basic_pos + 0];\n            det.bbox[1] = decode_ptr_host[basic_pos + 1];\n            det.bbox[2] = decode_ptr_host[basic_pos + 2];\n            det.bbox[3] = decode_ptr_host[basic_pos + 3];\n            det.conf = decode_ptr_host[basic_pos + 4];\n            det.class_id = decode_ptr_host[basic_pos + 5];\n            det.angle = decode_ptr_host[basic_pos + 7];\n            res.push_back(det);\n        }\n    }\n}\n\nvoid batch_process_obb(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                       int bbox_element, const std::vector<cv::Mat>& img_batch) {\n    res_batch.resize(batch_size);\n    int count = static_cast<int>(*decode_ptr_host);\n    count = std::min(count, kMaxNumOutputBbox);\n    for (int i = 0; i < batch_size; i++) {\n        auto& img = const_cast<cv::Mat&>(img_batch[i]);\n        process_decode_ptr_host_obb(res_batch[i], &decode_ptr_host[i * count], bbox_element, img, count);\n    }\n}\n\nstd::tuple<float, float, float> convariance_matrix(Detection res) {\n    float w = res.bbox[2];\n    float h = res.bbox[3];\n\n    float a = w * w / 12.0;\n    float b = h * h / 12.0;\n    float c = res.angle;\n\n    float cos_r = std::cos(c);\n    float sin_r = std::sin(c);\n\n    float cos_r2 = cos_r * cos_r;\n    float sin_r2 = sin_r * sin_r;\n\n    float a_val = a * cos_r2 + b * sin_r2;\n    float b_val = a * sin_r2 + b * cos_r2;\n    float c_val = (a - b) * cos_r * sin_r;\n\n    return std::make_tuple(a_val, b_val, c_val);\n}\n\nstatic float probiou(const Detection& res1, const Detection& res2, float eps = 1e-7) {\n    // Calculate the prob iou between oriented bounding boxes, https://arxiv.org/pdf/2106.06072v1.pdf.\n    float a1, b1, c1, a2, b2, c2;\n    std::tuple<float, float, float> matrix1 = {a1, b1, c1};\n    std::tuple<float, float, float> matrix2 = {a2, b2, c2};\n    matrix1 = convariance_matrix(res1);\n    matrix2 = convariance_matrix(res2);\n    a1 = std::get<0>(matrix1);\n    b1 = std::get<1>(matrix1);\n    c1 = std::get<2>(matrix1);\n    a2 = std::get<0>(matrix2);\n    b2 = std::get<1>(matrix2);\n    c2 = std::get<2>(matrix2);\n\n    float x1 = res1.bbox[0], y1 = res1.bbox[1];\n    float x2 = res2.bbox[0], y2 = res2.bbox[1];\n\n    float t1 = ((a1 + a2) * std::pow(y1 - y2, 2) + (b1 + b2) * std::pow(x1 - x2, 2)) /\n               ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2) + eps);\n    float t2 = ((c1 + c2) * (x2 - x1) * (y1 - y2)) / ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2) + eps);\n    float t3 = std::log(\n            ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2)) /\n                    (4 * std::sqrt(std::max(a1 * b1 - c1 * c1, 0.0f)) * std::sqrt(std::max(a2 * b2 - c2 * c2, 0.0f)) +\n                     eps) +\n            eps);\n\n    float bd = 0.25f * t1 + 0.5f * t2 + 0.5f * t3;\n    bd = std::max(std::min(bd, 100.0f), eps);\n    float hd = std::sqrt(1.0 - std::exp(-bd) + eps);\n\n    return 1 - hd;\n}\n\nvoid nms_obb(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh) {\n    int det_size = sizeof(Detection) / sizeof(float);\n    std::map<float, std::vector<Detection>> m;\n\n    for (int i = 0; i < output[0]; i++) {\n\n        if (output[1 + det_size * i + 4] <= conf_thresh)\n            continue;\n        Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        if (m.count(det.class_id) == 0)\n            m.emplace(det.class_id, std::vector<Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (probiou(item, dets[n]) >= nms_thresh) {\n                    dets.erase(dets.begin() + n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\nvoid batch_nms_obb(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size,\n                   float conf_thresh, float nms_thresh) {\n    res_batch.resize(batch_size);\n    for (int i = 0; i < batch_size; i++) {\n        nms_obb(res_batch[i], &output[i * output_size], conf_thresh, nms_thresh);\n    }\n}\n\nstatic std::vector<cv::Point> get_corner(cv::Mat& img, const Detection& box) {\n    float cos_value, sin_value;\n\n    // Calculate center point and width/height\n    float x1 = box.bbox[0];\n    float y1 = box.bbox[1];\n    float w = box.bbox[2];\n    float h = box.bbox[3];\n    float angle = box.angle * 180.0f / CV_PI;  // Convert radians to degrees\n\n    // Print original angle\n    std::cout << \"Original angle: \" << angle << std::endl;\n\n    // Swap width and height if height is greater than or equal to width\n    if (h >= w) {\n        std::swap(w, h);\n        angle = fmod(angle + 90.0f, 180.0f);  // Adjust angle to be within [0, 180)\n    }\n\n    // Ensure the angle is between 0 and 180 degrees\n    if (angle < 0) {\n        angle += 360.0f;  // Convert to positive value\n    }\n    if (angle > 180.0f) {\n        angle -= 180.0f;  // Subtract 180 from angles greater than 180\n    }\n\n    // Print adjusted angle\n    std::cout << \"Adjusted angle: \" << angle << std::endl;\n\n    // Convert to normal angle value\n    float normal_angle = fmod(angle, 180.0f);\n    if (normal_angle < 0) {\n        normal_angle += 180.0f;  // Ensure it's a positive value\n    }\n\n    // Print normal angle value\n    std::cout << \"Normal angle: \" << normal_angle << std::endl;\n\n    cos_value = std::cos(angle * CV_PI / 180.0f);  // Convert to radians\n    sin_value = std::sin(angle * CV_PI / 180.0f);\n\n    // Calculate each corner point\n    float l = x1 - w / 2;  // Left boundary\n    float r = x1 + w / 2;  // Right boundary\n    float t = y1 - h / 2;  // Top boundary\n    float b = y1 + h / 2;  // Bottom boundary\n\n    // Use get_rect function to scale the coordinates\n    float bbox[4] = {l, t, r, b};\n    cv::Rect rect = get_rect_obb(img, bbox);\n\n    float x_ = (rect.x + rect.x + rect.width) / 2;   // Center x\n    float y_ = (rect.y + rect.y + rect.height) / 2;  // Center y\n    float width = rect.width;                        // Width\n    float height = rect.height;                      // Height\n\n    // Calculate each corner point\n    std::vector<cv::Point> corner_points(4);\n    float vec1x = width / 2 * cos_value;\n    float vec1y = width / 2 * sin_value;\n    float vec2x = -height / 2 * sin_value;\n    float vec2y = height / 2 * cos_value;\n\n    corner_points[0] = cv::Point(int(round(x_ + vec1x + vec2x)), int(round(y_ + vec1y + vec2y)));  // Top-left corner\n    corner_points[1] = cv::Point(int(round(x_ + vec1x - vec2x)), int(round(y_ + vec1y - vec2y)));  // Top-right corner\n    corner_points[2] =\n            cv::Point(int(round(x_ - vec1x - vec2x)), int(round(y_ - vec1y - vec2y)));  // Bottom-right corner\n    corner_points[3] = cv::Point(int(round(x_ - vec1x + vec2x)), int(round(y_ - vec1y + vec2y)));  // Bottom-left corner\n\n    // Check and adjust corner points to ensure the rectangle is parallel to image boundaries\n    for (auto& point : corner_points) {\n        point.x = std::max(0, std::min(point.x, img.cols - 1));\n        point.y = std::max(0, std::min(point.y, img.rows - 1));\n    }\n\n    return corner_points;\n}\n\nvoid draw_bbox_obb(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    static std::vector<uint32_t> colors = {0xFF3838, 0xFF9D97, 0xFF701F, 0xFFB21D, 0xCFD231, 0x48F90A, 0x92CC17,\n                                           0x3DDB86, 0x1A9334, 0x00D4BB, 0x2C99A8, 0x00C2FF, 0x344593, 0x6473FF,\n                                           0x0018EC, 0x8438FF, 0x520085, 0xCB38FF, 0xFF95C8, 0xFF37C7};\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        auto& img = img_batch[i];\n        for (auto& obj : res) {\n            auto color = colors[(int)obj.class_id % colors.size()];\n            auto bgr = cv::Scalar(color & 0xFF, color >> 8 & 0xFF, color >> 16 & 0xFF);\n            auto corner_points = get_corner(img, obj);\n            cv::polylines(img, std::vector<std::vector<cv::Point>>{corner_points}, true, bgr, 1);\n\n            auto text = (std::to_string((int)(obj.class_id)) + \":\" + to_string_with_precision(obj.conf));\n            cv::Size textsize = cv::getTextSize(text, 0, 0.3, 1, nullptr);\n\n            int width = textsize.width;\n            int height = textsize.height;\n            bool outside = (corner_points[0].y - height >= 3) ? true : false;\n            cv::Point p1(corner_points[0].x, corner_points[0].y), p2;\n            p2.x = corner_points[0].x + width;\n            if (outside) {\n                p2.y = corner_points[0].y - height - 3;\n            } else {\n                p2.y = corner_points[0].y + height + 3;\n            }\n            cv::rectangle(img, p1, p2, bgr, -1, cv::LINE_AA);\n            cv::putText(\n                    img, text,\n                    cv::Point(corner_points[0].x, (outside ? corner_points[0].y - 2 : corner_points[0].y + height + 2)),\n                    0, 0.3, cv::Scalar::all(255), 1, cv::LINE_AA);\n        }\n    }\n}\n"
  },
  {
    "path": "yolo11/src/postprocess.cu",
    "content": "//\n// Created by lindsay on 23-7-17.\n//\n#include \"postprocess.h\"\n#include \"types.h\"\n\nstatic __global__ void decode_kernel_obb(float* predict, int num_bboxes, float confidence_threshold, float* parray,\n                                         int max_objects) {\n    float count = predict[0];\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    if (position >= count)\n        return;\n\n    float* pitem = predict + 1 + position * (sizeof(Detection) / sizeof(float));\n    int index = atomicAdd(parray, 1);\n    if (index >= max_objects)\n        return;\n\n    float confidence = pitem[4];\n\n    if (confidence < confidence_threshold)\n        return;\n    //[center_x center_y w h conf class_id  mask[32] keypoints[51] angle]\n    float cx = pitem[0];\n    float cy = pitem[1];\n    float width = pitem[2];\n    float height = pitem[3];\n    float label = pitem[5];\n    float angle = pitem[89];\n\n    float* pout_item = parray + 1 + index * bbox_element;\n    *pout_item++ = cx;\n    *pout_item++ = cy;\n    *pout_item++ = width;\n    *pout_item++ = height;\n    *pout_item++ = confidence;\n    *pout_item++ = label;\n    *pout_item++ = 1;  // 1 = keep, 0 = ignore\n    *pout_item++ = angle;\n}\n\nstatic __global__ void decode_kernel(float* predict, int num_bboxes, float confidence_threshold, float* parray,\n                                     int max_objects) {\n    float count = predict[0];\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    if (position >= count)\n        return;\n\n    float* pitem = predict + 1 + position * (sizeof(Detection) / sizeof(float));\n    int index = atomicAdd(parray, 1);\n    if (index >= max_objects)\n        return;\n\n    float confidence = pitem[4];\n    if (confidence < confidence_threshold)\n        return;\n\n    float left = pitem[0];\n    float top = pitem[1];\n    float right = pitem[2];\n    float bottom = pitem[3];\n    float label = pitem[5];\n\n    float* pout_item = parray + 1 + index * bbox_element;\n    *pout_item++ = left;\n    *pout_item++ = top;\n    *pout_item++ = right;\n    *pout_item++ = bottom;\n    *pout_item++ = confidence;\n    *pout_item++ = label;\n    *pout_item++ = 1;  // 1 = keep, 0 = ignore\n}\n\nstatic __device__ float box_iou(float aleft, float atop, float aright, float abottom, float bleft, float btop,\n                                float bright, float bbottom) {\n    float cleft = max(aleft, bleft);\n    float ctop = max(atop, btop);\n    float cright = min(aright, bright);\n    float cbottom = min(abottom, bbottom);\n    float c_area = max(cright - cleft, 0.0f) * max(cbottom - ctop, 0.0f);\n    if (c_area == 0.0f)\n        return 0.0f;\n\n    float a_area = max(0.0f, aright - aleft) * max(0.0f, abottom - atop);\n    float b_area = max(0.0f, bright - bleft) * max(0.0f, bbottom - btop);\n    return c_area / (a_area + b_area - c_area);\n}\n\nstatic __global__ void nms_kernel(float* bboxes, int max_objects, float threshold) {\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    int count = min(static_cast<int>(bboxes[0]), max_objects);\n    if (position >= count)\n        return;\n\n    float* pcurrent = bboxes + 1 + position * bbox_element;\n    for (int i = 0; i < count; ++i) {\n        float* pitem = bboxes + 1 + i * bbox_element;\n        if (i == position || pcurrent[5] != pitem[5])\n            continue;\n        if (pitem[4] >= pcurrent[4]) {\n            if (pitem[4] == pcurrent[4] && i < position)\n                continue;\n            float iou =\n                    box_iou(pcurrent[0], pcurrent[1], pcurrent[2], pcurrent[3], pitem[0], pitem[1], pitem[2], pitem[3]);\n            if (iou > threshold) {\n                pcurrent[6] = 0;\n                return;\n            }\n        }\n    }\n}\n\nstatic __device__ void convariance_matrix(float w, float h, float r, float& a, float& b, float& c) {\n    float a_val = w * w / 12.0f;\n    float b_val = h * h / 12.0f;\n    float cos_r = cosf(r);\n    float sin_r = sinf(r);\n\n    a = a_val * cos_r * cos_r + b_val * sin_r * sin_r;\n    b = a_val * sin_r * sin_r + b_val * cos_r * cos_r;\n    c = (a_val - b_val) * sin_r * cos_r;\n}\n\nstatic __device__ float box_probiou(float cx1, float cy1, float w1, float h1, float r1, float cx2, float cy2, float w2,\n                                    float h2, float r2, float eps = 1e-7) {\n\n    // Calculate the prob iou between oriented bounding boxes, https://arxiv.org/pdf/2106.06072v1.pdf.\n    float a1, b1, c1, a2, b2, c2;\n    convariance_matrix(w1, h1, r1, a1, b1, c1);\n    convariance_matrix(w2, h2, r2, a2, b2, c2);\n\n    float t1 = ((a1 + a2) * powf(cy1 - cy2, 2) + (b1 + b2) * powf(cx1 - cx2, 2)) /\n               ((a1 + a2) * (b1 + b2) - powf(c1 + c2, 2) + eps);\n    float t2 = ((c1 + c2) * (cx2 - cx1) * (cy1 - cy2)) / ((a1 + a2) * (b1 + b2) - powf(c1 + c2, 2) + eps);\n    float t3 = logf(((a1 + a2) * (b1 + b2) - powf(c1 + c2, 2)) /\n                            (4 * sqrtf(fmaxf(a1 * b1 - c1 * c1, 0.0f)) * sqrtf(fmaxf(a2 * b2 - c2 * c2, 0.0f)) + eps) +\n                    eps);\n    float bd = 0.25f * t1 + 0.5f * t2 + 0.5f * t3;\n    bd = fmaxf(fminf(bd, 100.0f), eps);\n    float hd = sqrtf(1.0f - expf(-bd) + eps);\n    return 1 - hd;\n}\n\nstatic __global__ void nms_kernel_obb(float* bboxes, int max_objects, float threshold) {\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    int count = min(static_cast<int>(bboxes[0]), max_objects);\n    if (position >= count)\n        return;\n\n    float* pcurrent = bboxes + 1 + position * bbox_element;\n    for (int i = 0; i < count; ++i) {\n        float* pitem = bboxes + 1 + i * bbox_element;\n        if (i == position || pcurrent[5] != pitem[5])\n            continue;\n        if (pitem[4] >= pcurrent[4]) {\n            if (pitem[4] == pcurrent[4] && i < position)\n                continue;\n            float iou = box_probiou(pcurrent[0], pcurrent[1], pcurrent[2], pcurrent[3], pcurrent[7], pitem[0], pitem[1],\n                                    pitem[2], pitem[3], pitem[7]);\n            if (iou > threshold) {\n                pcurrent[6] = 0;\n                return;\n            }\n        }\n    }\n}\n\nvoid cuda_decode(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                 cudaStream_t stream) {\n    int block = 256;\n    int grid = ceil(num_bboxes / (float)block);\n    decode_kernel<<<grid, block, 0, stream>>>((float*)predict, num_bboxes, confidence_threshold, parray, max_objects);\n}\n\nvoid cuda_nms(float* parray, float nms_threshold, int max_objects, cudaStream_t stream) {\n    int block = max_objects < 256 ? max_objects : 256;\n    int grid = ceil(max_objects / (float)block);\n    nms_kernel<<<grid, block, 0, stream>>>(parray, max_objects, nms_threshold);\n}\n\nvoid cuda_decode_obb(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                     cudaStream_t stream) {\n    int block = 256;\n    int grid = ceil(num_bboxes / (float)block);\n    decode_kernel_obb<<<grid, block, 0, stream>>>((float*)predict, num_bboxes, confidence_threshold, parray,\n                                                  max_objects);\n}\n\nvoid cuda_nms_obb(float* parray, float nms_threshold, int max_objects, cudaStream_t stream) {\n    int block = max_objects < 256 ? max_objects : 256;\n    int grid = ceil(max_objects / (float)block);\n    nms_kernel_obb<<<grid, block, 0, stream>>>(parray, max_objects, nms_threshold);\n}\n"
  },
  {
    "path": "yolo11/src/preprocess.cu",
    "content": "#include \"cuda_utils.h\"\n#include \"preprocess.h\"\n\nstatic uint8_t* img_buffer_host = nullptr;\nstatic uint8_t* img_buffer_device = nullptr;\n\n__global__ void warpaffine_kernel(uint8_t* src, int src_line_size, int src_width, int src_height, float* dst,\n                                  int dst_width, int dst_height, uint8_t const_value_st, AffineMatrix d2s, int edge) {\n    int position = blockDim.x * blockIdx.x + threadIdx.x;\n    if (position >= edge)\n        return;\n\n    float m_x1 = d2s.value[0];\n    float m_y1 = d2s.value[1];\n    float m_z1 = d2s.value[2];\n    float m_x2 = d2s.value[3];\n    float m_y2 = d2s.value[4];\n    float m_z2 = d2s.value[5];\n\n    int dx = position % dst_width;\n    int dy = position / dst_width;\n    float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;\n    float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;\n    float c0, c1, c2;\n\n    if (src_x <= -1 || src_x >= src_width || src_y <= -1 || src_y >= src_height) {\n        // out of range\n        c0 = const_value_st;\n        c1 = const_value_st;\n        c2 = const_value_st;\n    } else {\n        int y_low = floorf(src_y);\n        int x_low = floorf(src_x);\n        int y_high = y_low + 1;\n        int x_high = x_low + 1;\n\n        uint8_t const_value[] = {const_value_st, const_value_st, const_value_st};\n        float ly = src_y - y_low;\n        float lx = src_x - x_low;\n        float hy = 1 - ly;\n        float hx = 1 - lx;\n        float w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;\n        uint8_t* v1 = const_value;\n        uint8_t* v2 = const_value;\n        uint8_t* v3 = const_value;\n        uint8_t* v4 = const_value;\n\n        if (y_low >= 0) {\n            if (x_low >= 0)\n                v1 = src + y_low * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v2 = src + y_low * src_line_size + x_high * 3;\n        }\n\n        if (y_high < src_height) {\n            if (x_low >= 0)\n                v3 = src + y_high * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v4 = src + y_high * src_line_size + x_high * 3;\n        }\n\n        c0 = w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0];\n        c1 = w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1];\n        c2 = w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2];\n    }\n\n    // bgr to rgb\n    float t = c2;\n    c2 = c0;\n    c0 = t;\n\n    // normalization\n    c0 = c0 / 255.0f;\n    c1 = c1 / 255.0f;\n    c2 = c2 / 255.0f;\n\n    // rgbrgbrgb to rrrgggbbb\n    int area = dst_width * dst_height;\n    float* pdst_c0 = dst + dy * dst_width + dx;\n    float* pdst_c1 = pdst_c0 + area;\n    float* pdst_c2 = pdst_c1 + area;\n    *pdst_c0 = c0;\n    *pdst_c1 = c1;\n    *pdst_c2 = c2;\n}\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream) {\n    int img_size = src_width * src_height * 3;\n    // copy data to pinned memory\n    memcpy(img_buffer_host, src, img_size);\n    // copy data to device memory\n    CUDA_CHECK(cudaMemcpyAsync(img_buffer_device, img_buffer_host, img_size, cudaMemcpyHostToDevice, stream));\n\n    AffineMatrix s2d, d2s;\n    float scale = std::min(dst_height / (float)src_height, dst_width / (float)src_width);\n\n    s2d.value[0] = scale;\n    s2d.value[1] = 0;\n    s2d.value[2] = -scale * src_width * 0.5 + dst_width * 0.5;\n    s2d.value[3] = 0;\n    s2d.value[4] = scale;\n    s2d.value[5] = -scale * src_height * 0.5 + dst_height * 0.5;\n    cv::Mat m2x3_s2d(2, 3, CV_32F, s2d.value);\n    cv::Mat m2x3_d2s(2, 3, CV_32F, d2s.value);\n    cv::invertAffineTransform(m2x3_s2d, m2x3_d2s);\n\n    memcpy(d2s.value, m2x3_d2s.ptr<float>(0), sizeof(d2s.value));\n\n    int jobs = dst_height * dst_width;\n    int threads = 256;\n    int blocks = ceil(jobs / (float)threads);\n    warpaffine_kernel<<<blocks, threads, 0, stream>>>(img_buffer_device, src_width * 3, src_width, src_height, dst,\n                                                      dst_width, dst_height, 128, d2s, jobs);\n}\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream) {\n    int dst_size = dst_width * dst_height * 3;\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        cuda_preprocess(img_batch[i].ptr(), img_batch[i].cols, img_batch[i].rows, &dst[dst_size * i], dst_width,\n                        dst_height, stream);\n        CUDA_CHECK(cudaStreamSynchronize(stream));\n    }\n}\n\nvoid cuda_preprocess_init(int max_image_size) {\n    // prepare input data in pinned memory\n    CUDA_CHECK(cudaMallocHost((void**)&img_buffer_host, max_image_size * 3));\n    // prepare input data in device memory\n    CUDA_CHECK(cudaMalloc((void**)&img_buffer_device, max_image_size * 3));\n}\n\nvoid cuda_preprocess_destroy() {\n    CUDA_CHECK(cudaFree(img_buffer_device));\n    CUDA_CHECK(cudaFreeHost(img_buffer_host));\n}\n"
  },
  {
    "path": "yolo11/yolo11_cls.cpp",
    "content": "#include \"calibrator.h\"\n#include \"config.h\"\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"utils.h\"\n\n#include <chrono>\n#include <cmath>\n#include <iostream>\n#include <numeric>\n#include <opencv2/opencv.hpp>\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\nconst static int kOutputSize = kClsNumClass;\n\nvoid batch_preprocess(std::vector<cv::Mat>& imgs, float* output, int dst_width = 224, int dst_height = 224) {\n    for (size_t b = 0; b < imgs.size(); b++) {\n        int h = imgs[b].rows;\n        int w = imgs[b].cols;\n        int m = std::min(h, w);\n        int top = (h - m) / 2;\n        int left = (w - m) / 2;\n        cv::Mat img = imgs[b](cv::Rect(left, top, m, m));\n        cv::resize(img, img, cv::Size(dst_width, dst_height), 0, 0, cv::INTER_LINEAR);\n        cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n        img.convertTo(img, CV_32F, 1 / 255.0);\n\n        std::vector<cv::Mat> channels(3);\n        cv::split(img, channels);\n\n        // CHW format\n        for (int c = 0; c < 3; ++c) {\n            int i = 0;\n            for (int row = 0; row < dst_height; ++row) {\n                for (int col = 0; col < dst_width; ++col) {\n                    output[b * 3 * dst_height * dst_width + c * dst_height * dst_width + i] =\n                            channels[c].at<float>(row, col);\n                    ++i;\n                }\n            }\n        }\n    }\n}\n\nstd::vector<float> softmax(float* prob, int n) {\n    std::vector<float> res;\n    float sum = 0.0f;\n    float t;\n    for (int i = 0; i < n; i++) {\n        t = expf(prob[i]);\n        res.push_back(t);\n        sum += t;\n    }\n    for (int i = 0; i < n; i++) {\n        res[i] /= sum;\n    }\n    return res;\n}\n\nstd::vector<int> topk(const std::vector<float>& vec, int k) {\n    std::vector<int> topk_index;\n    std::vector<size_t> vec_index(vec.size());\n    std::iota(vec_index.begin(), vec_index.end(), 0);\n\n    std::sort(vec_index.begin(), vec_index.end(),\n              [&vec](size_t index_1, size_t index_2) { return vec[index_1] > vec[index_2]; });\n\n    int k_num = std::min<int>(vec.size(), k);\n\n    for (int i = 0; i < k_num; ++i) {\n        topk_index.push_back(vec_index[i]);\n    }\n\n    return topk_index;\n}\n\nstd::vector<std::string> read_classes(std::string file_name) {\n    std::vector<std::string> classes;\n    std::ifstream ifs(file_name, std::ios::in);\n    if (!ifs.is_open()) {\n        std::cerr << file_name << \" is not found, pls refer to README and download it.\" << std::endl;\n        assert(0);\n    }\n    std::string s;\n    while (std::getline(ifs, s)) {\n        classes.push_back(s);\n    }\n    ifs.close();\n    return classes;\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, float& gd, float& gw,\n                std::string& img_dir, std::string& type, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && (argc == 5)) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        auto net = std::string(argv[4]);\n        if (net[0] == 'n') {\n            gd = 0.50;\n            gw = 0.25;\n            max_channels = 1024;\n            type = \"n\";\n        } else if (net[0] == 's') {\n            gd = 0.50;\n            gw = 0.50;\n            max_channels = 1024;\n            type = \"s\";\n        } else if (net[0] == 'm') {\n            gd = 0.50;\n            gw = 1.00;\n            max_channels = 512;\n            type = \"m\";\n        } else if (net[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n            type = \"l\";\n        } else if (net[0] == 'x') {\n            gd = 1.0;\n            gw = 1.50;\n            max_channels = 512;\n            type = \"x\";\n        } else {\n            return false;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 4) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nvoid prepare_buffers(ICudaEngine* engine, float** gpu_input_buffer, float** gpu_output_buffer, float** cpu_input_buffer,\n                     float** output_buffer_host) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)gpu_input_buffer, kBatchSize * 3 * kClsInputH * kClsInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)gpu_output_buffer, kBatchSize * kOutputSize * sizeof(float)));\n\n    *cpu_input_buffer = new float[kBatchSize * 3 * kClsInputH * kClsInputW];\n    *output_buffer_host = new float[kBatchSize * kOutputSize];\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* input, float* output,\n           int batchSize) {\n    CUDA_CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * 3 * kClsInputH * kClsInputW * sizeof(float),\n                               cudaMemcpyHostToDevice, stream));\n    context.enqueueV2(buffers, stream, nullptr);\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                               stream));\n    cudaStreamSynchronize(stream);\n}\n\nvoid serialize_engine(float& gd, float& gw, std::string& wts_name, std::string& engine_name, std::string& type,\n                      int max_channels) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    // Create model to populate the network, then set the outputs and create an engine\n    IHostMemory* serialized_engine = nullptr;\n    //engine = buildEngineYolo11Cls(max_batchsize, builder, config, DataType::kFLOAT, gd, gw, wts_name);\n    serialized_engine = buildEngineYolo11Cls(builder, config, DataType::kFLOAT, wts_name, gd, gw, type, max_channels);\n    assert(serialized_engine);\n    // Save engine to file\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cerr << \"Could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    // Close everything down\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nint main(int argc, char** argv) {\n    // yolo11_cls -s ../models/yolo11n-cls.wts ../models/yolo11n-cls.fp32.trt n\n    // yolo11_cls -d ../models/yolo11n-cls.fp32.trt ../images\n    cudaSetDevice(kGpuId);\n    std::string wts_name;\n    std::string engine_name;\n    float gd = 0.0f, gw = 0.0f;\n    std::string img_dir;\n    std::string type;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, gd, gw, img_dir, type, max_channels)) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./yolo11_cls -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./yolo11_cls -d [.engine] ../images  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(gd, gw, wts_name, engine_name, type, max_channels);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* cpu_input_buffer = nullptr;\n    float* output_buffer_host = nullptr;\n    prepare_buffers(engine, &device_buffers[0], &device_buffers[1], &cpu_input_buffer, &output_buffer_host);\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    // Read imagenet labels\n    auto classes = read_classes(\"imagenet_classes.txt\");\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n\n        // Preprocess\n        batch_preprocess(img_batch, cpu_input_buffer);\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n        infer(*context, stream, (void**)device_buffers, cpu_input_buffer, output_buffer_host, kBatchSize);\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n\n        // Postprocess and get top-k result\n        for (size_t b = 0; b < img_name_batch.size(); b++) {\n            float* p = &output_buffer_host[b * kOutputSize];\n            auto res = softmax(p, kOutputSize);\n            auto topk_idx = topk(res, 3);\n            std::cout << img_name_batch[b] << std::endl;\n            for (auto idx : topk_idx) {\n                std::cout << \"  \" << classes[idx] << \" \" << res[idx] << std::endl;\n            }\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    delete[] cpu_input_buffer;\n    delete[] output_buffer_host;\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n    return 0;\n}\n"
  },
  {
    "path": "yolo11/yolo11_cls_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport os\nimport shutil\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport torch\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\nwith open(\"imagenet_classes.txt\") as f:\n    classes = [line.strip() for line in f.readlines()]\n\n\nclass YoLo11TRT(object):\n    \"\"\"\n    description: A YOLO11 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n        self.mean = (0.485, 0.456, 0.406)\n        self.std = (0.229, 0.224, 0.225)\n\n        for binding in engine:\n            print('binding:', binding, engine.get_binding_shape(binding))\n            self.batch_size = engine.get_binding_shape(binding)[0]\n            size = trt.volume(engine.get_binding_shape(\n                binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_input_image = np.empty(\n            shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            batch_image_raw.append(image_raw)\n            input_image = self.preprocess_cls_image(image_raw)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            classes_ls, predicted_conf_ls, category_id_ls = self.postprocess_cls(\n                output)\n            cv2.putText(batch_image_raw[i], str(\n                classes_ls), (10, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 1, cv2.LINE_AA)\n            print(classes_ls, predicted_conf_ls)\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_cls_image(self, raw_bgr_image, dst_width=224, dst_height=224):\n\n        \"\"\"\n            description: Convert BGR image to RGB,\n                         crop the center square frame,\n                         resize it to target size, normalize to [0,1],\n                         transform to NCHW format.\n            param:\n                raw_bgr_image: numpy array, raw BGR image\n                dst_width: int, target image width\n                dst_height: int, target image height\n            return:\n                image:  the processed image\n                image_raw: the original image\n                h: original height\n                w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        # Crop the center square frame\n        m = min(h, w)\n        top = (h - m) // 2\n        left = (w - m) // 2\n        image = raw_bgr_image[top:top + m, left:left + m]\n\n        # Resize the image with target size while maintaining ratio\n        image = cv2.resize(image, (dst_width, dst_height), interpolation=cv2.INTER_LINEAR)\n\n        # Convert BGR to RGB\n        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)\n\n        # Normalize to [0,1]\n        image = image.astype(np.float32) / 255.0\n\n        # HWC to CHW format\n        image = image.transpose(2, 0, 1)\n\n        # CHW to NCHW format (add batch dimension)\n        image = np.expand_dims(image, axis=0)\n\n        # Convert the image to row-major order, also known as \"C order\"\n        image = np.ascontiguousarray(image)\n\n        batch_data = np.expand_dims(image, axis=0)\n\n        return batch_data\n\n    def postprocess_cls(self, output_data):\n        classes_ls = []\n        predicted_conf_ls = []\n        category_id_ls = []\n        output_data = output_data.reshape(self.batch_size, -1)\n        output_data = torch.Tensor(output_data)\n        p = torch.nn.functional.softmax(output_data, dim=1)\n        score, index = torch.topk(p, 3)\n        for ind in range(index.shape[0]):\n            input_category_id = index[ind][0].item()  # 716\n            category_id_ls.append(input_category_id)\n            predicted_confidence = score[ind][0].item()\n            predicted_conf_ls.append(predicted_confidence)\n            classes_ls.append(classes[input_category_id])\n        return classes_ls, predicted_conf_ls, category_id_ls\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolo11_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolo11_wrapper = yolo11_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(\n            self.yolo11_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(\n            self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolo11_wrapper):\n        threading.Thread.__init__(self)\n        self.yolo11_wrapper = yolo11_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(\n            self.yolo11_wrapper.get_raw_image_zeros())\n        print(\n            'warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    engine_file_path = \"./yolo11x-cls-fp32.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLo11TRT instance\n    yolo11_wrapper = YoLo11TRT(engine_file_path)\n    try:\n        print('batch size is', yolo11_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(\n            yolo11_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolo11_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolo11_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolo11_wrapper.destroy()\n"
  },
  {
    "path": "yolo11/yolo11_det.cpp",
    "content": "\n#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"utils.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\n\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, float& gd, float& gw, int& max_channels,\n                      std::string& type) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine = nullptr;\n\n    serialized_engine = buildEngineYolo11Det(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels, type);\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_buffer_host, float** decode_ptr_host, float** decode_ptr_device,\n                    std::string cuda_post_process) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n    if (cuda_post_process == \"c\") {\n        *output_buffer_host = new float[kBatchSize * kOutputSize];\n    } else if (cuda_post_process == \"g\") {\n        if (kBatchSize > 1) {\n            std::cerr << \"Do not yet support GPU post processing for multiple batches\" << std::endl;\n            exit(0);\n        }\n        // Allocate memory for decode_ptr_host and copy to device\n        *decode_ptr_host = new float[1 + kMaxNumOutputBbox * bbox_element];\n        CUDA_CHECK(cudaMalloc((void**)decode_ptr_device, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element)));\n    }\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchsize,\n           float* decode_ptr_host, float* decode_ptr_device, int model_bboxes, std::string cuda_post_process) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueueV2(buffers, stream, nullptr);\n    if (cuda_post_process == \"c\") {\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n    } else if (cuda_post_process == \"g\") {\n        CUDA_CHECK(\n                cudaMemsetAsync(decode_ptr_device, 0, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), stream));\n        cuda_decode((float*)buffers[1], model_bboxes, kConfThresh, decode_ptr_device, kMaxNumOutputBbox, stream);\n        cuda_nms(decode_ptr_device, kNmsThresh, kMaxNumOutputBbox, stream);  //cuda nms\n        CUDA_CHECK(cudaMemcpyAsync(decode_ptr_host, decode_ptr_device,\n                                   sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference and gpu postprocess time: \"\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir, std::string& type,\n                std::string& cuda_post_process, float& gd, float& gw, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && (argc == 5)) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        auto sub_type = std::string(argv[4]);\n\n        if (sub_type[0] == 'n') {\n            gd = 0.50;\n            gw = 0.25;\n            max_channels = 1024;\n            type = \"n\";\n        } else if (sub_type[0] == 's') {\n            gd = 0.50;\n            gw = 0.50;\n            max_channels = 1024;\n            type = \"s\";\n        } else if (sub_type[0] == 'm') {\n            gd = 0.50;\n            gw = 1.00;\n            max_channels = 512;\n            type = \"m\";\n        } else if (sub_type[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n            type = \"l\";\n        } else if (sub_type[0] == 'x') {\n            gd = 1.0;\n            gw = 1.50;\n            max_channels = 512;\n            type = \"x\";\n        } else {\n            return false;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 5) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n        cuda_post_process = std::string(argv[4]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    // yolo11_det -s ../models/yolo11n.wts ../models/yolo11n.fp32.trt n\n    // yolo11_det -d ../models/yolo11n.fp32.trt ../images c\n    cudaSetDevice(kGpuId);\n    std::string wts_name;\n    std::string engine_name;\n    std::string img_dir;\n    std::string cuda_post_process;\n    std::string type;\n    int model_bboxes;\n    float gd = 0, gw = 0;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir, type, cuda_post_process, gd, gw, max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolo11_det -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to \"\n                     \"plan file\"\n                  << std::endl;\n        std::cerr << \"./yolo11_det -d [.engine] ../images  [c/g]// deserialize plan file and run inference\"\n                  << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, gd, gw, max_channels, type);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* output_buffer_host = nullptr;\n    float* decode_ptr_host = nullptr;\n    float* decode_ptr_device = nullptr;\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host, &decode_ptr_host,\n                   &decode_ptr_device, cuda_post_process);\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n        // Run inference\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize, decode_ptr_host,\n              decode_ptr_device, model_bboxes, cuda_post_process);\n        // 保存output_buffer_host的前100个值，一行一个\n        //        std::ofstream out(\"../models/output.txt\");\n        //        for (int j = 0; j < 100; j++) {\n        //            out << output_buffer_host[j] << std::endl;\n        //        }\n        //        out.close();\n        std::vector<std::vector<Detection>> res_batch;\n        if (cuda_post_process == \"c\") {\n            // NMS\n            batch_nms(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n        } else if (cuda_post_process == \"g\") {\n            //Process gpu decode and nms results\n            batch_process(res_batch, decode_ptr_host, img_batch.size(), bbox_element, img_batch);\n        }\n        // Draw bounding boxes\n        draw_bbox(img_batch, res_batch);\n        // Save images\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    CUDA_CHECK(cudaFree(decode_ptr_device));\n    delete[] decode_ptr_host;\n    delete[] output_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    // Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < kOutputSize; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolo11/yolo11_det_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\nPOSE_NUM = 17 * 3\nDET_NUM = 6\nSEG_NUM = 32\nOBB_NUM = 1\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLo11 project.\n    param:\n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n            line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n        )\n\n\nclass YoLo11TRT(object):\n    \"\"\"\n    description: A YOLO11 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            self.batch_size = engine.get_binding_shape(binding)[0]\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.det_output_length = host_outputs[0].shape[0]\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            result_boxes, result_scores, result_classid = self.post_process(\n                output[i * self.det_output_length: (i + 1) * self.det_output_length], batch_origin_h[i],\n                batch_origin_w[i]\n            )\n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0]\n            y[:, 2] = x[:, 2]\n            y[:, 1] = x[:, 1] - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 3] - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 2] - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1]\n            y[:, 3] = x[:, 3]\n            y /= r_h\n\n        return y\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...]\n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        num_values_per_detection = DET_NUM + SEG_NUM + POSE_NUM + OBB_NUM\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        # pred = np.reshape(output[1:], (-1, 38))[:num, :]\n        pred = np.reshape(output[1:], (-1, num_values_per_detection))[:num, :]\n        # Do nms\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        return result_boxes, result_scores, result_classid\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = (np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None)\n                      * np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None))\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\n        # clip the coordinates\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1)\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1)\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1)\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1)\n        # Object confidence\n        confs = boxes[:, 4]\n        # Sort by the confs\n        boxes = boxes[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_boxes = []\n        while boxes.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\n            label_match = boxes[0, -1] == boxes[:, -1]\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\n            invalid = large_overlap & label_match\n            keep_boxes += [boxes[0]]\n            boxes = boxes[~invalid]\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\n        return boxes\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolo11_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolo11_wrapper = yolo11_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(self.yolo11_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolo11_wrapper):\n        threading.Thread.__init__(self)\n        self.yolo11_wrapper = yolo11_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(self.yolo11_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"build/libmyplugins.so\"\n    engine_file_path = \"yolo11s.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load coco labels\n\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\",\n                  \"traffic light\",\n                  \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n                  \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\n                  \"frisbee\",\n                  \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\",\n                  \"surfboard\",\n                  \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n                  \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n                  \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\n                  \"cell phone\",\n                  \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\n                  \"teddy bear\",\n                  \"hair drier\", \"toothbrush\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLo11TRT instance\n    yolo11_wrapper = YoLo11TRT(engine_file_path)\n    try:\n        print('batch size is', yolo11_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolo11_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolo11_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolo11_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolo11_wrapper.destroy()\n"
  },
  {
    "path": "yolo11/yolo11_obb.cpp",
    "content": "\n#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"utils.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\n\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, std::string& type, float& gd, float& gw,\n                      int& max_channels) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine = nullptr;\n\n    serialized_engine = buildEngineYolo11Obb(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels, type);\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_buffer_host, float** decode_ptr_host, float** decode_ptr_device,\n                    std::string cuda_post_process) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kObbInputH * kObbInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n\n    if (cuda_post_process == \"c\") {\n        *output_buffer_host = new float[kBatchSize * kOutputSize];\n    } else if (cuda_post_process == \"g\") {\n        if (kBatchSize > 1) {\n            std::cerr << \"Do not yet support GPU post processing for multiple batches\" << std::endl;\n            exit(0);\n        }\n        // Allocate memory for decode_ptr_host and copy to device\n        *decode_ptr_host = new float[1 + kMaxNumOutputBbox * bbox_element];\n        CUDA_CHECK(cudaMalloc((void**)decode_ptr_device, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element)));\n    }\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchsize,\n           float* decode_ptr_host, float* decode_ptr_device, int model_bboxes, std::string cuda_post_process) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueueV2(buffers, stream, nullptr);\n    if (cuda_post_process == \"c\") {\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n    } else if (cuda_post_process == \"g\") {\n        CUDA_CHECK(\n                cudaMemsetAsync(decode_ptr_device, 0, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), stream));\n        cuda_decode_obb((float*)buffers[1], model_bboxes, kConfThresh, decode_ptr_device, kMaxNumOutputBbox, stream);\n        cuda_nms_obb(decode_ptr_device, kNmsThresh, kMaxNumOutputBbox, stream);  //cuda nms\n        CUDA_CHECK(cudaMemcpyAsync(decode_ptr_host, decode_ptr_device,\n                                   sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference and gpu postprocess time: \"\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir, std::string& type,\n                std::string& cuda_post_process, float& gd, float& gw, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && argc == 5) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        std::string sub_type = std::string(argv[4]);\n        if (sub_type[0] == 'n') {\n            gd = 0.50;\n            gw = 0.25;\n            max_channels = 1024;\n            type = \"n\";\n        } else if (sub_type[0] == 's') {\n            gd = 0.50;\n            gw = 0.50;\n            max_channels = 1024;\n            type = \"s\";\n        } else if (sub_type[0] == 'm') {\n            gd = 0.50;\n            gw = 1.00;\n            max_channels = 512;\n            type = \"m\";\n        } else if (sub_type[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n            type = \"l\";\n        } else if (sub_type[0] == 'x') {\n            gd = 1.0;\n            gw = 1.50;\n            max_channels = 512;\n            type = \"x\";\n        } else {\n            return false;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 5) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n        cuda_post_process = std::string(argv[4]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    // yolo11_obb -s ../models/yolo11n-obb.wts ../models/yolo11n-obb.fp32.trt n\n    // yolo11_obb -d ../models/yolo11n-obb.fp32.trt ../images c\n    cudaSetDevice(kGpuId);\n    std::string wts_name;\n    std::string engine_name;\n    std::string img_dir;\n    std::string type;\n    std::string cuda_post_process;\n    int model_bboxes;\n    float gd = 0.0f, gw = 0.0f;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir, type, cuda_post_process, gd, gw, max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolo11_obb -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./yolo11_obb -d [.engine] ../images  [c/g]// deserialize plan file and run inference\"\n                  << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, type, gd, gw, max_channels);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* output_buffer_host = nullptr;\n    float* decode_ptr_host = nullptr;\n    float* decode_ptr_device = nullptr;\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host, &decode_ptr_host,\n                   &decode_ptr_device, cuda_post_process);\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kObbInputW, kObbInputH, stream);\n        // Run inference\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize, decode_ptr_host,\n              decode_ptr_device, model_bboxes, cuda_post_process);\n        std::vector<std::vector<Detection>> res_batch;\n        if (cuda_post_process == \"c\") {\n            // NMS\n            batch_nms_obb(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n        } else if (cuda_post_process == \"g\") {\n            //Process gpu decode and nms results\n            //            batch_process_obb(res_batch, decode_ptr_host, img_batch.size(), bbox_element, img_batch);\n            // todo seg in gpu\n            std::cerr << \"obb_postprocess is not support in gpu right now\" << std::endl;\n        }\n        // Draw bounding boxes\n        draw_bbox_obb(img_batch, res_batch);\n        // Save images\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    CUDA_CHECK(cudaFree(decode_ptr_device));\n    delete[] decode_ptr_host;\n    delete[] output_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    // Print histogram of the output distribution\n    // std::cout << \"\\nOutput:\\n\\n\";\n    // for (unsigned int i = 0; i < kOutputSize; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << std::endl;\n    //}\n    // std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolo11/yolo11_obb_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport sys\nimport threading\nimport time\nimport cv2\nimport math\nimport numpy as np\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\nPOSE_NUM = 17 * 3\nDET_NUM = 6\nSEG_NUM = 32\nOBB_NUM = 1\n\nINPUT_W = 640\nINPUT_H = 640\n\n\nclass Detection:\n    def __init__(self, bbox, score, class_id, angle):\n        self.bbox = bbox\n        self.score = score\n        self.class_id = class_id\n        self.angle = angle\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\ndef get_corner(img, box: Detection):\n    \"\"\"\n    description: Get the four corner points of the rotated bounding box\n    param:\n        img:    an opencv image object (numpy array)\n        box:    a Detection object containing bbox [cx,cy,w,h] and angle (radians)\n    return:\n        corners: four corner points of the rotated bounding box as numpy array [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]\n    \"\"\"\n    # Extract box parameters\n    cx, cy, w, h = box.bbox\n    angle = box.angle * 180.0 / math.pi  # Convert radians to degrees\n\n    # Swap width and height if height >= width\n    if h >= w:\n        w, h = h, w\n        angle = (angle + 90.0) % 180.0  # Adjust angle\n\n    # Ensure angle is between 0 and 180 degrees\n    if angle < 0:\n        angle += 360.0\n    if angle > 180.0:\n        angle -= 180.0\n\n    # Convert to normalized angle (0-180)\n    normal_angle = angle % 180.0\n    if normal_angle < 0:\n        normal_angle += 180.0\n\n    # Convert back to radians for calculation\n    angle_rad = angle * math.pi / 180.0\n    cos_val = math.cos(angle_rad)\n    sin_val = math.sin(angle_rad)\n\n    # Calculate boundaries\n    l_x = cx - w / 2\n    r_x = cx + w / 2\n    t_y = cy - h / 2\n    b_y = cy + h / 2\n\n    # Scale coordinates using get_rect_obb (matching C++ version)\n    bbox = [l_x, t_y, r_x, b_y]\n    rect = get_rect_obb(img, bbox)\n\n    # Calculate center and dimensions of scaled box\n    x_ = (rect[0] + rect[0] + rect[2]) / 2  # rect.x + rect.width/2\n    y_ = (rect[1] + rect[1] + rect[3]) / 2  # rect.y + rect.height/2\n    width = rect[2]\n    height = rect[3]\n\n    # Calculate vectors\n    vec1x = width / 2 * cos_val\n    vec1y = width / 2 * sin_val\n    vec2x = -height / 2 * sin_val\n    vec2y = height / 2 * cos_val\n\n    # Calculate four corners\n    corners = np.array([\n        [int(round(x_ + vec1x + vec2x)), int(round(y_ + vec1y + vec2y))],  # Top-left\n        [int(round(x_ + vec1x - vec2x)), int(round(y_ + vec1y - vec2y))],  # Top-right\n        [int(round(x_ - vec1x - vec2x)), int(round(y_ - vec1y - vec2y))],  # Bottom-right\n        [int(round(x_ - vec1x + vec2x)), int(round(y_ - vec1y + vec2y))]   # Bottom-left\n    ], dtype=np.int32)\n\n    # Clip to image boundaries\n    h, w = img.shape[:2]\n    corners[:, 0] = np.clip(corners[:, 0], 0, w - 1)\n    corners[:, 1] = np.clip(corners[:, 1], 0, h - 1)\n\n    return corners\n\n\ndef get_rect_obb(img, bbox):\n    \"\"\"\n    Scale coordinates according to image resize ratio (matching C++ version)\n    param:\n        img: OpenCV image (numpy array)\n        bbox: [left, top, right, bottom]\n    return:\n        [x, y, width, height]\n    \"\"\"\n    l_x, t_y, r_x, b_y = bbox\n    r_w = INPUT_W / img.shape[1]  # INPUT_W should be your model input width\n    r_h = INPUT_H / img.shape[0]  # INPUT_H should be your model input height\n\n    if r_h > r_w:\n        l_x = l_x\n        r_x = r_x\n        t_y = t_y - (INPUT_H - r_w * img.shape[0]) / 2\n        b_y = b_y - (INPUT_H - r_w * img.shape[0]) / 2\n        l_x = l_x / r_w\n        r_x = r_x / r_w\n        t_y = t_y / r_w\n        b_y = b_y / r_w\n    else:\n        l_x = l_x - (INPUT_W - r_h * img.shape[1]) / 2\n        r_x = r_x - (INPUT_W - r_h * img.shape[1]) / 2\n        t_y = t_y\n        b_y = b_y\n        l_x = l_x / r_h\n        r_x = r_x / r_h\n        t_y = t_y / r_h\n        b_y = b_y / r_h\n\n    l_x = max(0.0, l_x)\n    t_y = max(0.0, t_y)\n    width = max(0, min(int(round(r_x - l_x)), img.shape[1] - int(round(l_x))))\n    height = max(0, min(int(round(b_y - t_y)), img.shape[0] - int(round(t_y))))\n\n    return [int(round(l_x)), int(round(t_y)), width, height]\n\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one rotated bounding box on image img\n    param:\n        x:      a box in [cx, cy, w, h, angle] format\n        img:    an opencv image object\n        color:  color to draw rectangle\n        label:  str\n        line_thickness: int\n    \"\"\"\n    tl = line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n\n    # Get four corner points\n    corners = get_corner(img, x)\n    corners = corners.astype(int)\n\n    # Draw the rotated rectangle\n    cv2.polylines(img, [corners], isClosed=True, color=color, thickness=tl, lineType=cv2.LINE_AA)\n\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        # Use first corner point for label placement\n        p1 = tuple(corners[0])\n        w, h = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n\n        outside = p1[1] - h >= 3\n        p2 = (p1[0] + w, p1[1] - h - 3 if outside else p1[1] + h + 3)\n\n        cv2.rectangle(img, p1, p2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (p1[0], p1[1] - 2 if outside else p1[1] + h + 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA\n        )\n\n\nclass YoLo11TRT(object):\n    \"\"\"\n    description: A YOLO11 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            self.batch_size = engine.get_binding_shape(binding)[0]\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                global INPUT_W, INPUT_H\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                INPUT_W = self.input_w\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                INPUT_H = self.input_h\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.det_output_length = host_outputs[0].shape[0]\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            keep = self.post_process(\n                output[i * self.det_output_length: (i + 1) * self.det_output_length], batch_origin_h[i],\n                batch_origin_w[i]\n            )\n            # Draw rectangles and labels on the original image\n            for j in range(len(keep)):\n                box = keep[j]  # type: Detection\n                np.random.seed(int(keep[j].class_id))\n                color = [np.random.randint(0, 255) for _ in range(3)]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(keep[j].class_id)], keep[j].score\n                    ),\n                    color=color,\n                    line_thickness=1\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0]\n            y[:, 2] = x[:, 2]\n            y[:, 1] = x[:, 1] - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 3] - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 2] - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1]\n            y[:, 3] = x[:, 3]\n            y /= r_h\n\n        return y\n\n    def covariance_matrix(self, res: Detection):\n        \"\"\"\n        description: Generating covariance matrix from obbs.\n        param:\n            box (np.ndarray): A numpy array representing rotated bounding box, with xywhr format.\n\n        return:\n            tuple: (a, b, c) values of covariance matrix\n        \"\"\"\n        w = res.bbox[2]\n        h = res.bbox[3]\n        angle = res.angle\n\n        a = w * w / 12.0\n        b = h * h / 12.0\n        c = angle\n\n        cos_r = math.cos(c)\n        sin_r = math.sin(c)\n\n        cos_r2 = cos_r * cos_r\n        sin_r2 = sin_r * sin_r\n\n        a_val = a * cos_r2 + b * sin_r2\n        b_val = a * sin_r2 + b * cos_r2\n        c_val = (a - b) * cos_r * sin_r\n\n        return a_val, b_val, c_val\n\n    def probiou(self, box1: Detection, box2: Detection, eps=1e-7):\n        \"\"\"\n        description: Calculate the prob IoU between oriented bounding boxes.\n        param:\n            box1 (np.ndarray): First box in xywhr format\n            box2 (np.ndarray): Second box in xywhr format\n            eps (float): Small value to avoid division by zero\n        return:\n            float: 1 - hd where hd is the Bhattacharyya distance\n        \"\"\"\n        a1, b1, c1 = self.covariance_matrix(box1)\n        a2, b2, c2 = self.covariance_matrix(box2)\n\n        x1, y1 = box1.bbox[0], box1.bbox[1]\n        x2, y2 = box2.bbox[0], box2.bbox[1]\n\n        t1 = ((a1 + a2) * (y1 - y2) ** 2 + (b1 + b2) * (x1 - x2) ** 2) / \\\n             ((a1 + a2) * (b1 + b2) - (c1 + c2) ** 2 + eps)\n        t1 *= 0.25\n\n        t2 = ((c1 + c2) * (x2 - x1) * (y1 - y2)) / \\\n             ((a1 + a2) * (b1 + b2) - (c1 + c2) ** 2 + eps)\n        t2 *= 0.5\n\n        t3 = ((a1 + a2) * (b1 + b2) - (c1 + c2) ** 2) / \\\n             (4 * math.sqrt(max(a1 * b1 - c1 * c1, 0.0)) *\n              math.sqrt(max(a2 * b2 - c2 * c2, 0.0)) + eps)\n        t3 = math.log(t3 + eps) * 0.5\n\n        bd = max(min(t1 + t2 + t3, 100.0), eps)\n        hd = math.sqrt(1.0 - math.exp(-bd) + eps)\n\n        return 1 - hd\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id,angle cx,cy,w,h,conf,cls_id,angle ...]\n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2, angle]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        num_values_per_detection = DET_NUM + SEG_NUM + POSE_NUM + OBB_NUM\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        pred = np.reshape(output[1:], (-1, num_values_per_detection))[:num, :]\n\n        # Filter by confidence threshold\n        mask = pred[:, 4] >= CONF_THRESH\n        pred = pred[mask]\n\n        if len(pred) == 0:\n            return []\n\n        m_map = {}\n        for i in range(len(pred)):\n            class_id = int(pred[i][5])\n            if class_id not in m_map:\n                m_map[class_id] = []\n            m_map[class_id].append(Detection(pred[i][:4], pred[i][4], class_id, pred[i][89]))\n\n        res = []\n        for it in m_map:\n            dets = m_map[it]\n            dets = sorted(dets, key=lambda x: x.score, reverse=True)\n            for m in range(len(dets)):\n                if dets[m].score == 0.0:\n                    continue\n                item = dets[m]\n                res.append(item)\n                for n in range(m + 1, len(dets)):\n                    if dets[n].score == 0.0:\n                        continue\n                    if self.probiou(item, dets[n]) > IOU_THRESHOLD:\n                        dets[n].score = 0.0\n\n        keep = []\n        for i in range(len(res)):\n            if res[i].score > CONF_THRESH:\n                keep.append(res[i])\n\n        return keep\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolo11_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolo11_wrapper = yolo11_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(self.yolo11_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolo11_wrapper):\n        threading.Thread.__init__(self)\n        self.yolo11_wrapper = yolo11_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(self.yolo11_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"./build/libmyplugins.so\"\n    engine_file_path = \"yolo11n-obb.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load DOTAV 1.5 labels\n\n    categories = [\"plane\", \"ship\", \"storage tank\", \"baseball diamond\", \"tennis court\",\n                  \"basketball court\", \"ground track field\", \"harbor\",\n                  \"bridge\", \"large vehicle\", \"small vehicle\", \"helicopter\",\n                  \"roundabout\", \"soccer ball field\", \"swimming pool\", \"container crane\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLo11TRT instance\n    yolo11_wrapper = YoLo11TRT(engine_file_path)\n    try:\n        print('batch size is', yolo11_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolo11_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolo11_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolo11_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolo11_wrapper.destroy()\n"
  },
  {
    "path": "yolo11/yolo11_pose.cpp",
    "content": "\n#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"utils.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\n\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, std::string& type, float& gd, float& gw,\n                      int& max_channels) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine = nullptr;\n\n    serialized_engine = buildEngineYolo11Pose(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels, type);\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_buffer_host, float** decode_ptr_host, float** decode_ptr_device,\n                    std::string cuda_post_process) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n    if (cuda_post_process == \"c\") {\n        *output_buffer_host = new float[kBatchSize * kOutputSize];\n    } else if (cuda_post_process == \"g\") {\n        if (kBatchSize > 1) {\n            std::cerr << \"Do not yet support GPU post processing for multiple batches\" << std::endl;\n            exit(0);\n        }\n        // Allocate memory for decode_ptr_host and copy to device\n        *decode_ptr_host = new float[1 + kMaxNumOutputBbox * bbox_element];\n        CUDA_CHECK(cudaMalloc((void**)decode_ptr_device, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element)));\n    }\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchsize,\n           float* decode_ptr_host, float* decode_ptr_device, int model_bboxes, std::string cuda_post_process) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueueV2(buffers, stream, nullptr);\n    if (cuda_post_process == \"c\") {\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n    } else if (cuda_post_process == \"g\") {\n        CUDA_CHECK(\n                cudaMemsetAsync(decode_ptr_device, 0, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), stream));\n        cuda_decode((float*)buffers[1], model_bboxes, kConfThresh, decode_ptr_device, kMaxNumOutputBbox, stream);\n        cuda_nms(decode_ptr_device, kNmsThresh, kMaxNumOutputBbox, stream);  //cuda nms\n        CUDA_CHECK(cudaMemcpyAsync(decode_ptr_host, decode_ptr_device,\n                                   sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference and gpu postprocess time: \"\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir, std::string& type,\n                std::string& cuda_post_process, float& gd, float& gw, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && (argc == 5 || argc == 7)) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        auto sub_type = std::string(argv[4]);\n        if (sub_type[0] == 'n') {\n            gd = 0.50;\n            gw = 0.25;\n            max_channels = 1024;\n            type = \"n\";\n        } else if (sub_type[0] == 's') {\n            gd = 0.50;\n            gw = 0.50;\n            max_channels = 1024;\n            type = \"s\";\n        } else if (sub_type[0] == 'm') {\n            gd = 0.50;\n            gw = 1.00;\n            max_channels = 512;\n            type = \"m\";\n        } else if (sub_type[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n            type = \"l\";\n        } else if (sub_type[0] == 'x') {\n            gd = 1.0;\n            gw = 1.50;\n            max_channels = 512;\n            type = \"x\";\n        } else {\n            return false;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 5) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n        cuda_post_process = std::string(argv[4]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    // yolo11_pose -s ../models/yolo11n-pose.wts ../models/yolo11n-pose.fp32.trt n\n    // yolo11_pose -d ../models/yolo11n-pose.fp32.trt ../images c\n    cudaSetDevice(kGpuId);\n    std::string wts_name;\n    std::string engine_name;\n    std::string img_dir;\n    std::string type;\n    std::string cuda_post_process;\n    int model_bboxes;\n    float gd = 0.0f, gw = 0.0f;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir, type, cuda_post_process, gd, gw, max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolo11_pose -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to \"\n                     \"plan file\"\n                  << std::endl;\n        std::cerr << \"./yolo11_pose -d [.engine] ../images  [c/g]// deserialize plan file and run inference\"\n                  << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, type, gd, gw, max_channels);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* output_buffer_host = nullptr;\n    float* decode_ptr_host = nullptr;\n    float* decode_ptr_device = nullptr;\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host, &decode_ptr_host,\n                   &decode_ptr_device, cuda_post_process);\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n        // Run inference\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize, decode_ptr_host,\n              decode_ptr_device, model_bboxes, cuda_post_process);\n        std::vector<std::vector<Detection>> res_batch;\n        if (cuda_post_process == \"c\") {\n            // NMS\n            batch_nms(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n        } else if (cuda_post_process == \"g\") {\n            // Process gpu decode and nms results\n            // todo pose in gpu\n            std::cerr << \"pose_postprocess is not support in gpu right now\" << std::endl;\n        }\n        // Draw bounding boxes\n        draw_bbox_keypoints_line(img_batch, res_batch);\n        // Save images\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    CUDA_CHECK(cudaFree(decode_ptr_device));\n    delete[] decode_ptr_host;\n    delete[] output_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    // Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < kOutputSize; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolo11/yolo11_pose_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\nPOSE_NUM = 17 * 3\nDET_NUM = 6\nSEG_NUM = 32\nOBB_NUM = 1\nkeypoint_pairs = [\n    (0, 1), (0, 2), (0, 5), (0, 6), (1, 2),\n    (1, 3), (2, 4), (5, 6), (5, 7), (5, 11),\n    (6, 8), (6, 12), (7, 9), (8, 10), (11, 12),\n    (11, 13), (12, 14), (13, 15), (14, 16)\n]\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLo11 project.\n    param:\n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n            line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n        )\n\n\nclass YoLo11TRT(object):\n    \"\"\"\n    description: A YOLO11 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            self.batch_size = engine.get_binding_shape(binding)[0]\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.det_output_size = host_outputs[0].shape[0]\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n\n            result_boxes, result_scores, result_classid, keypoints = self.post_process(\n                output[i * (self.det_output_size): (i + 1) * (self.det_output_size)],\n                batch_origin_h[i], batch_origin_w[i]\n            )\n\n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n\n                num_keypoints = len(keypoints[j]) // 3\n                points = []\n                for k in range(num_keypoints):\n                    x = keypoints[j][k * 3]\n                    y = keypoints[j][k * 3 + 1]\n                    confidence = keypoints[j][k * 3 + 2]\n                    if confidence > 0:\n                        points.append((int(x), int(y)))\n                    else:\n                        points.append(None)\n\n                # 根据关键点索引对绘制线条\n                for pair in keypoint_pairs:\n                    partA, partB = pair\n                    if points[partA] and points[partB]:\n                        cv2.line(batch_image_raw[i], points[partA], points[partB], (0, 255, 0), 2)\n\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy_with_keypoints(self, origin_h, origin_w, boxes, keypoints):\n\n        n = len(boxes)\n        box_array = np.zeros_like(boxes)\n        keypoint_array = np.zeros_like(keypoints)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        for i in range(n):\n            if r_h > r_w:\n                box = boxes[i]\n                lmk = keypoints[i]\n                box_array[i, 0] = box[0] / r_w\n                box_array[i, 2] = box[2] / r_w\n                box_array[i, 1] = (box[1] - (self.input_h - r_w * origin_h) / 2) / r_w\n                box_array[i, 3] = (box[3] - (self.input_h - r_w * origin_h) / 2) / r_w\n\n                for j in range(0, len(lmk), 3):\n                    keypoint_array[i, j] = lmk[j] / r_w\n                    keypoint_array[i, j + 1] = (lmk[j + 1] - (self.input_h - r_w * origin_h) / 2) / r_w\n                    keypoint_array[i, j + 2] = lmk[j + 2]\n            else:\n\n                box = boxes[i]\n                lmk = keypoints[i]\n\n                box_array[i, 0] = (box[0] - (self.input_w - r_h * origin_w) / 2) / r_h\n                box_array[i, 2] = (box[2] - (self.input_w - r_h * origin_w) / 2) / r_h\n                box_array[i, 1] = box[1] / r_h\n                box_array[i, 3] = box[3] / r_h\n\n                for j in range(0, len(lmk), 3):\n                    keypoint_array[i, j] = (lmk[j] - (self.input_w - r_h * origin_w) / 2) / r_h\n                    keypoint_array[i, j + 1] = lmk[j + 1] / r_h\n                    keypoint_array[i, j + 2] = lmk[j + 2]\n\n        return box_array, keypoint_array\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: Post-process the prediction to include pose keypoints\n        param:\n            output:     A numpy array like [num_boxes, cx, cy, w, h, conf,\n            cls_id, px1, py1, pconf1,...px17, py17, pconf17] where p denotes pose keypoint\n            origin_h:   Height of original image\n            origin_w:   Width of original image\n        return:\n            result_boxes:    Final boxes, a numpy array, each row is a box [x1, y1, x2, y2]\n            result_scores:   Final scores, a numpy array, each element is the score corresponding to box\n            result_classid:  Final classID, a numpy array, each element is the classid corresponding to box\n            result_keypoints: Final keypoints, a list of numpy arrays,\n            each element represents keypoints for a box, shaped as (#keypoints, 3)\n        \"\"\"\n        # Number of values per detection: 38 base values + 17 keypoints * 3 values each + angle\n        num_values_per_detection = DET_NUM + SEG_NUM + POSE_NUM + OBB_NUM\n        # Get the number of boxes detected\n        num = int(output[0])\n        # Reshape to a two-dimensional ndarray with the full detection shape\n        pred = np.reshape(output[1:], (-1, num_values_per_detection))[:num, :]\n\n        # Perform non-maximum suppression to filter the detections\n        boxes = self.non_max_suppression(\n            pred[:, :num_values_per_detection], origin_h, origin_w,\n            conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n\n        # Extract the bounding boxes, confidence scores, and class IDs\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        result_keypoints = boxes[:, -POSE_NUM - 1:-1] if len(boxes) else np.array([])\n\n        # Return the post-processed results including keypoints\n        return result_boxes, result_scores, result_classid, result_keypoints\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = np.clip(\n            inter_rect_x2 - inter_rect_x1 + 1, 0, None) * np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None)\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        res_array = np.copy(boxes)\n        box_pred_deep_copy = np.copy(boxes[:, :4])\n        keypoints_pred_deep_copy = np.copy(boxes[:, -POSE_NUM - 1:-1])\n        res_box, res_keypoints = self.xywh2xyxy_with_keypoints(\n            origin_h, origin_w, box_pred_deep_copy, keypoints_pred_deep_copy)\n        res_array[:, :4] = res_box\n        res_array[:, -POSE_NUM - 1:-1] = res_keypoints\n        # clip the coordinates\n        res_array[:, 0] = np.clip(res_array[:, 0], 0, origin_w - 1)\n        res_array[:, 2] = np.clip(res_array[:, 2], 0, origin_w - 1)\n        res_array[:, 1] = np.clip(res_array[:, 1], 0, origin_h - 1)\n        res_array[:, 3] = np.clip(res_array[:, 3], 0, origin_h - 1)\n        # Object confidence\n        confs = res_array[:, 4]\n        # Sort by the confs\n        res_array = res_array[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_res_array = []\n        while res_array.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(res_array[0, :4], 0), res_array[:, :4]) > nms_thres\n            label_match = res_array[0, 5] == res_array[:, 5]\n            invalid = large_overlap & label_match\n            keep_res_array.append(res_array[0])\n            res_array = res_array[~invalid]\n\n        res_array = np.stack(keep_res_array, 0) if len(keep_res_array) else np.array([])\n        return res_array\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolo11_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolo11_wrapper = yolo11_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(self.yolo11_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolo11_wrapper):\n        threading.Thread.__init__(self)\n        self.yolo11_wrapper = yolo11_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(self.yolo11_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"./build/libmyplugins.so\"\n    engine_file_path = \"yolo11n-pose.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load coco labels\n\n    categories = [\"person\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLo11TRT instance\n    yolo11_wrapper = YoLo11TRT(engine_file_path)\n    try:\n        print('batch size is', yolo11_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolo11_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolo11_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolo11_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolo11_wrapper.destroy()\n"
  },
  {
    "path": "yolo11/yolo11_seg.cpp",
    "content": "\n#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"utils.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\nconst static int kOutputSegSize = 32 * (kInputH / 4) * (kInputW / 4);\n\nstatic cv::Rect get_downscale_rect(float bbox[4], float scale) {\n\n    float left = bbox[0];\n    float top = bbox[1];\n    float right = bbox[0] + bbox[2];\n    float bottom = bbox[1] + bbox[3];\n\n    left = left < 0 ? 0 : left;\n    top = top < 0 ? 0 : top;\n    right = right > kInputW ? kInputW : right;\n    bottom = bottom > kInputH ? kInputH : bottom;\n\n    left /= scale;\n    top /= scale;\n    right /= scale;\n    bottom /= scale;\n    return cv::Rect(int(left), int(top), int(right - left), int(bottom - top));\n}\n\nstd::vector<cv::Mat> process_mask(const float* proto, int proto_size, std::vector<Detection>& dets) {\n\n    std::vector<cv::Mat> masks;\n    for (size_t i = 0; i < dets.size(); i++) {\n\n        cv::Mat mask_mat = cv::Mat::zeros(kInputH / 4, kInputW / 4, CV_32FC1);\n        auto r = get_downscale_rect(dets[i].bbox, 4);\n\n        for (int x = r.x; x < r.x + r.width; x++) {\n            for (int y = r.y; y < r.y + r.height; y++) {\n                float e = 0.0f;\n                for (int j = 0; j < 32; j++) {\n                    e += dets[i].mask[j] * proto[j * proto_size / 32 + y * mask_mat.cols + x];\n                }\n                e = 1.0f / (1.0f + expf(-e));\n                mask_mat.at<float>(y, x) = e;\n            }\n        }\n        cv::resize(mask_mat, mask_mat, cv::Size(kInputW, kInputH));\n        masks.push_back(mask_mat);\n    }\n    return masks;\n}\n\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, std::string& type, float& gd, float& gw,\n                      int& max_channels) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine = nullptr;\n\n    serialized_engine = buildEngineYolo11Seg(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels, type);\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_seg_buffer_device, float** output_buffer_host, float** output_seg_buffer_host,\n                    float** decode_ptr_host, float** decode_ptr_device, std::string cuda_post_process) {\n    assert(engine->getNbBindings() == 3);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    const int outputIndex_seg = engine->getBindingIndex(kProtoTensorName);\n\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    assert(outputIndex_seg == 2);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_seg_buffer_device, kBatchSize * kOutputSegSize * sizeof(float)));\n\n    if (cuda_post_process == \"c\") {\n        *output_buffer_host = new float[kBatchSize * kOutputSize];\n        *output_seg_buffer_host = new float[kBatchSize * kOutputSegSize];\n    } else if (cuda_post_process == \"g\") {\n        if (kBatchSize > 1) {\n            std::cerr << \"Do not yet support GPU post processing for multiple batches\" << std::endl;\n            exit(0);\n        }\n        // Allocate memory for decode_ptr_host and copy to device\n        *decode_ptr_host = new float[1 + kMaxNumOutputBbox * bbox_element];\n        CUDA_CHECK(cudaMalloc((void**)decode_ptr_device, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element)));\n    }\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, float* output_seg,\n           int batchsize, float* decode_ptr_host, float* decode_ptr_device, int model_bboxes,\n           std::string cuda_post_process) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueueV2(buffers, stream, nullptr);\n    if (cuda_post_process == \"c\") {\n\n        std::cout << \"kOutputSize:\" << kOutputSize << std::endl;\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                                   stream));\n        std::cout << \"kOutputSegSize:\" << kOutputSegSize << std::endl;\n        CUDA_CHECK(cudaMemcpyAsync(output_seg, buffers[2], batchsize * kOutputSegSize * sizeof(float),\n                                   cudaMemcpyDeviceToHost, stream));\n\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n    } else if (cuda_post_process == \"g\") {\n        CUDA_CHECK(\n                cudaMemsetAsync(decode_ptr_device, 0, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), stream));\n        cuda_decode((float*)buffers[1], model_bboxes, kConfThresh, decode_ptr_device, kMaxNumOutputBbox, stream);\n        cuda_nms(decode_ptr_device, kNmsThresh, kMaxNumOutputBbox, stream);  //cuda nms\n        CUDA_CHECK(cudaMemcpyAsync(decode_ptr_host, decode_ptr_device,\n                                   sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference and gpu postprocess time: \"\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir, std::string& type,\n                std::string& cuda_post_process, std::string& labels_filename, float& gd, float& gw, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && argc == 5) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        std::string sub_type = std::string(argv[4]);\n        if (sub_type[0] == 'n') {\n            gd = 0.50;\n            gw = 0.25;\n            max_channels = 1024;\n            type = \"n\";\n        } else if (sub_type[0] == 's') {\n            gd = 0.50;\n            gw = 0.50;\n            max_channels = 1024;\n            type = \"s\";\n        } else if (sub_type[0] == 'm') {\n            gd = 0.50;\n            gw = 1.00;\n            max_channels = 512;\n            type = \"m\";\n        } else if (sub_type[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n            type = \"l\";\n        } else if (sub_type[0] == 'x') {\n            gd = 1.0;\n            gw = 1.50;\n            max_channels = 512;\n            type = \"x\";\n        } else {\n            return false;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 6) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n        cuda_post_process = std::string(argv[4]);\n        labels_filename = std::string(argv[5]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    // yolo11_seg -s ../models/yolo11n-seg.wts ../models/yolo11n-seg.fp32.trt n\n    // yolo11_seg -d ../models/yolo11n-seg.fp32.trt ../images c coco.txt\n    cudaSetDevice(kGpuId);\n    std::string wts_name;\n    std::string engine_name;\n    std::string img_dir;\n    std::string type;\n    std::string cuda_post_process;\n    std::string labels_filename = \"coco.txt\";\n    int model_bboxes;\n    float gd = 0.0f, gw = 0.0f;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir, type, cuda_post_process, labels_filename, gd, gw,\n                    max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolo11_seg -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./yolo11_seg -d [.engine] ../images  [c/g] coco_file// deserialize plan file and run inference\"\n                  << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, type, gd, gw, max_channels);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[3];\n    float* output_buffer_host = nullptr;\n    float* output_seg_buffer_host = nullptr;\n    float* decode_ptr_host = nullptr;\n    float* decode_ptr_device = nullptr;\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    std::unordered_map<int, std::string> labels_map;\n    read_labels(labels_filename, labels_map);\n    assert(kNumClass == labels_map.size());\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &device_buffers[2], &output_buffer_host,\n                   &output_seg_buffer_host, &decode_ptr_host, &decode_ptr_device, cuda_post_process);\n\n    // // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n        // Run inference\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, output_seg_buffer_host, kBatchSize,\n              decode_ptr_host, decode_ptr_device, model_bboxes, cuda_post_process);\n        std::vector<std::vector<Detection>> res_batch;\n        if (cuda_post_process == \"c\") {\n            // NMS\n            batch_nms(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n            for (size_t b = 0; b < img_batch.size(); b++) {\n                auto& res = res_batch[b];\n                cv::Mat img = img_batch[b];\n                auto masks = process_mask(&output_seg_buffer_host[b * kOutputSegSize], kOutputSegSize, res);\n                draw_mask_bbox(img, res, masks, labels_map);\n                cv::imwrite(\"_\" + img_name_batch[b], img);\n            }\n        } else if (cuda_post_process == \"g\") {\n            // Process gpu decode and nms results\n            // batch_process(res_batch, decode_ptr_host, img_batch.size(), bbox_element, img_batch);\n            // todo seg in gpu\n            std::cerr << \"seg_postprocess is not support in gpu right now\" << std::endl;\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    CUDA_CHECK(cudaFree(device_buffers[2]));\n    CUDA_CHECK(cudaFree(decode_ptr_device));\n    delete[] decode_ptr_host;\n    delete[] output_buffer_host;\n    delete[] output_seg_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    // Print histogram of the output distribution\n    // std::cout << \"\\nOutput:\\n\\n\";\n    // for (unsigned int i = 0; i < kOutputSize; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << std::endl;\n    //}\n    // std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolo11/yolo11_seg_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\nPOSE_NUM = 17 * 3\nDET_NUM = 6\nSEG_NUM = 32\nOBB_NUM = 1\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLo11 project.\n    param:\n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n            line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n        )\n\n\nclass YoLo11TRT(object):\n    \"\"\"\n    description: A YOLO11 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            self.batch_size = engine.get_binding_shape(binding)[0]\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n\n        # Data length\n        self.det_output_length = host_outputs[0].shape[0]\n        self.seg_output_length = host_outputs[1].shape[0]\n        self.seg_w = int(self.input_w / 4)\n        self.seg_h = int(self.input_h / 4)\n        self.seg_c = int(self.seg_output_length / (self.seg_w * self.seg_w))\n        self.det_row_output_length = self.seg_c + DET_NUM + POSE_NUM + OBB_NUM\n\n        # Draw mask\n        self.colors_obj = Colors()\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        cuda.memcpy_dtoh_async(host_outputs[1], cuda_outputs[1], stream)\n\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        output_proto_mask = host_outputs[1]\n        # Do postprocess\n        for i in range(self.batch_size):\n            result_boxes, result_scores, result_classid, result_proto_coef = self.post_process(\n                output[i * self.det_output_length: (i + 1) * self.det_output_length], batch_origin_h[i],\n                batch_origin_w[i]\n            )\n\n            if result_proto_coef.shape[0] == 0:\n                continue\n            result_masks = self.process_mask(output_proto_mask, result_proto_coef, result_boxes, batch_origin_h[i],\n                                             batch_origin_w[i])\n\n            self.draw_mask(result_masks, colors_=[self.colors_obj(x, True) for x in result_classid],\n                           im_src=batch_image_raw[i])\n\n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0]\n            y[:, 2] = x[:, 2]\n            y[:, 1] = x[:, 1] - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 3] - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 2] - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1]\n            y[:, 3] = x[:, 3]\n            y /= r_h\n\n        return y\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...]\n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        pred = np.reshape(output[1:], (-1, self.det_row_output_length))[:num, :]\n\n        # Do nms\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        result_proto_coef = boxes[:, DET_NUM:int(DET_NUM + SEG_NUM)] if len(boxes) else np.array([])\n        return result_boxes, result_scores, result_classid, result_proto_coef\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = (np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None)\n                      * np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None))\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\n        # clip the coordinates\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1)\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1)\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1)\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1)\n        # Object confidence\n        confs = boxes[:, 4]\n        # Sort by the confs\n        boxes = boxes[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_boxes = []\n        while boxes.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\n            label_match = boxes[0, 5] == boxes[:, 5]\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\n            invalid = large_overlap & label_match\n            keep_boxes += [boxes[0]]\n            boxes = boxes[~invalid]\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\n        return boxes\n\n    def sigmoid(self, x):\n        return 1 / (1 + np.exp(-x))\n\n    def scale_mask(self, mask, ih, iw):\n        mask = cv2.resize(mask, (self.input_w, self.input_h))\n        r_w = self.input_w / (iw * 1.0)\n        r_h = self.input_h / (ih * 1.0)\n        if r_h > r_w:\n            w = self.input_w\n            h = int(r_w * ih)\n            x = 0\n            y = int((self.input_h - h) / 2)\n        else:\n            w = int(r_h * iw)\n            h = self.input_h\n            x = int((self.input_w - w) / 2)\n            y = 0\n        crop = mask[y:y + h, x:x + w]\n        crop = cv2.resize(crop, (iw, ih))\n        return crop\n\n    def process_mask(self, output_proto_mask, result_proto_coef, result_boxes, ih, iw):\n        \"\"\"\n        description: Mask pred by yolo11 instance segmentation ,\n        param:\n            output_proto_mask: prototype mask e.g. (32, 160, 160) for 640x640 input\n            result_proto_coef: prototype mask coefficients (n, 32), n represents n results\n            result_boxes     :\n            ih: rows of original image\n            iw: cols of original image\n        return:\n            mask_result: (n, ih, iw)\n        \"\"\"\n        result_proto_masks = output_proto_mask.reshape(self.seg_c, self.seg_h, self.seg_w)\n        c, mh, mw = result_proto_masks.shape\n        print(result_proto_masks.shape)\n        print(result_proto_coef.shape)\n        masks = self.sigmoid((result_proto_coef @ result_proto_masks.astype(np.float32).reshape(c, -1))).reshape(-1, mh,\n                                                                                                                 mw)\n\n        mask_result = []\n        for mask, box in zip(masks, result_boxes):\n            mask_s = np.zeros((ih, iw))\n            crop_mask = self.scale_mask(mask, ih, iw)\n            x1 = int(box[0])\n            y1 = int(box[1])\n            x2 = int(box[2])\n            y2 = int(box[3])\n            crop = crop_mask[y1:y2, x1:x2]\n            crop = np.where(crop >= 0.5, 1, 0)\n            crop = crop.astype(np.uint8)\n            mask_s[y1:y2, x1:x2] = crop\n\n            mask_result.append(mask_s)\n        mask_result = np.array(mask_result)\n        return mask_result\n\n    def draw_mask(self, masks, colors_, im_src, alpha=0.5):\n        \"\"\"\n        description: Draw mask on image ,\n        param:\n            masks  : result_mask\n            colors_: color to draw mask\n            im_src : original image\n            alpha  : scale between original  image and mask\n        return:\n            no return\n        \"\"\"\n        if len(masks) == 0:\n            return\n        masks = np.asarray(masks, dtype=np.uint8)\n        masks = np.ascontiguousarray(masks.transpose(1, 2, 0))\n        masks = np.asarray(masks, dtype=np.float32)\n        colors_ = np.asarray(colors_, dtype=np.float32)\n        s = masks.sum(2, keepdims=True).clip(0, 1)\n        masks = (masks @ colors_).clip(0, 255)\n        im_src[:] = masks * alpha + im_src * (1 - s * alpha)\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolo11_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolo11_wrapper = yolo11_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(self.yolo11_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolo11_wrapper):\n        threading.Thread.__init__(self)\n        self.yolo11_wrapper = yolo11_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(self.yolo11_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nclass Colors:\n    def __init__(self):\n        hexs = ('FF3838', 'FF9D97', 'FF701F', 'FFB21D', 'CFD231', '48F90A',\n                '92CC17', '3DDB86', '1A9334', '00D4BB', '2C99A8', '00C2FF',\n                '344593', '6473FF', '0018EC', '8438FF', '520085', 'CB38FF',\n                'FF95C8', 'FF37C7')\n        self.palette = [self.hex2rgb(f'#{c}') for c in hexs]\n        self.n = len(self.palette)\n\n    def __call__(self, i, bgr=False):\n        c = self.palette[int(i) % self.n]\n        return (c[2], c[1], c[0]) if bgr else c\n\n    @staticmethod\n    def hex2rgb(h):  # rgb order (PIL)\n        return tuple(int(h[1 + i:1 + i + 2], 16) for i in (0, 2, 4))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"build/libmyplugins.so\"\n    engine_file_path = \"yolo11n-seg.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load coco labels\n\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\",\n                  \"traffic light\",\n                  \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n                  \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\n                  \"frisbee\",\n                  \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\",\n                  \"surfboard\",\n                  \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n                  \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n                  \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\n                  \"cell phone\",\n                  \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\n                  \"teddy bear\",\n                  \"hair drier\", \"toothbrush\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLo11TRT instance\n    yolo11_wrapper = YoLo11TRT(engine_file_path)\n    try:\n        print('batch size is', yolo11_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolo11_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolo11_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolo11_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolo11_wrapper.destroy()\n"
  },
  {
    "path": "yolo11_tripy/.gitignore",
    "content": "imagenet_classes.txt\n*.JPEG\n*.pt\n"
  },
  {
    "path": "yolo11_tripy/README.md",
    "content": "# YOLO11 Tripy\n\nThis example implements a YOLO11 classifier model using [Tripy](https://nvidia.github.io/TensorRT-Incubator/).\n\n## Running The Example\n\nRun the following commands from the [`yolo11_tripy`](./) directory:\n\n1. Install Dependencies:\n\n    ```bash\n    python3 -m pip install -r requirements.txt\n    ```\n\n2. Download ImageNet classes file:\n\n    ```bash\n    wget https://raw.githubusercontent.com/joannzhang00/ImageNet-dataset-classes-labels/main/imagenet_classes.txt\n    ```\n\n3. [*Optional*] Download some images:\n\n    ```bash\n    wget https://raw.githubusercontent.com/EliSchwartz/imagenet-sample-images/master/n01558993_robin.JPEG\n    wget https://raw.githubusercontent.com/EliSchwartz/imagenet-sample-images/master/n04389033_tank.JPEG\n    ```\n\n    You can skip this step if you already have images you'd like to classify.\n\n3. Build the model:\n\n    ```bash\n    python3 compile_classifier.py\n    ```\n\n    You can configure various aspects of the model when you compile.\n    Run `python3 compile_classifier.py -h` for details.\n\n4. Run inference:\n\n    ```bash\n    python3 classify.py n01558993_robin.JPEG n04389033_tank.JPEG\n    ```\n\n    The `classify.py` script allows you to pass one or more image file paths on the command line.\n    The images are batched and classified in a single forward pass.\n"
  },
  {
    "path": "yolo11_tripy/classify.py",
    "content": "import argparse\nimport os\n\nimport cv2\nimport numpy as np\nimport nvtripy as tp\nimport time\nfrom constants import IMAGE_H, IMAGE_W\n\nCURDIR = os.path.realpath(os.path.dirname(__file__))\n\n\ndef load_image(path):\n    return cv2.imread(path)\n\n\ndef preprocess(image):\n    h, w, _ = image.shape\n    # Crop the center square frame\n    m = min(h, w)\n    top = (h - m) // 2\n    left = (w - m) // 2\n    image = image[top:top + m, left:left + m]\n\n    # Resize the image with target size while maintaining ratio\n    image = cv2.resize(image, (IMAGE_H, IMAGE_W), interpolation=cv2.INTER_LINEAR)\n\n    # Convert BGR to RGB\n    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)\n\n    # Normalize to [0,1]\n    image = image.astype(np.float32) / 255.0\n\n    # HWC to CHW format\n    image = image.transpose(2, 0, 1)\n\n    # CHW to NCHW format (add batch dimension)\n    image = np.expand_dims(image, axis=0)\n\n    # Convert the image to row-major order, also known as \"C order\"\n    image = np.ascontiguousarray(image)\n\n    return image\n\n\ndef main():\n    parser = argparse.ArgumentParser(description=\"Classify an image using a YOLO11 classifier model.\")\n    parser.add_argument(\"images\", help=\"Images to classify\", nargs=\"+\")\n    parser.add_argument(\n        \"--model-path\",\n        help=\"Path to the compiled model\",\n        default=os.path.join(CURDIR, \"yolo11-cls.tpymodel\"),\n    )\n\n    parser.add_argument(\n        \"--imagenet-classes-file\",\n        help=\"Path to the ImageNet classes file (imagenet_classes.txt)\",\n        default=os.path.join(CURDIR, \"imagenet_classes.txt\"),\n    )\n\n    args, _ = parser.parse_known_args()\n\n    with open(args.imagenet_classes_file) as f:\n        CLASSES = [line.strip() for line in f.readlines()]\n\n    print(f\"Loading model: {args.model_path}...\")\n\n    model = tp.Executable.load(args.model_path)\n\n    input_info = model.input_infos[\"batch\"]\n    dtype = input_info.dtype\n\n    if input_info.shape_bounds.max[0] < len(args.images):\n        raise ValueError(\n            f\"Model was compiled for a maximum of {input_info.shape_bounds.max[0]} image(s) \"\n            f\"per batch, but {len(args.images)} were provided.\"\n            f\"\\nPlease recompile the model with a larger maximum batch size using the \"\n            f\"`--max-images` argument in `compile_classifier.py`.\"\n        )\n\n    images = [preprocess(load_image(path)) for path in args.images]\n    batch = tp.Tensor(np.concatenate(images, axis=0))\n\n    # Warm up the model:\n    _, _ = model(tp.zeros_like(batch, dtype=dtype).eval())\n\n    # Cast the input based on the model type.\n    # Note that the result will be in GPU memory, so we don't need an explicit copy.\n    batch = tp.cast(batch, dtype).eval()\n\n    start = time.perf_counter()\n    batch_scores, batch_preds = model(batch)\n    end = time.perf_counter()\n\n    print(f\"Inference + Postprocessing took: {(end - start) * 1000:.3f} ms\")\n\n    # Copy the scores back to CPU memory and convert to numpy:\n    batch_scores = np.from_dlpack(tp.copy(batch_scores, device=tp.device(\"cpu\")))\n    batch_preds = np.from_dlpack(tp.copy(batch_preds, device=tp.device(\"cpu\")))\n\n    for path, scores, preds in zip(args.images, batch_scores, batch_preds):\n        print(f\"Top {len(preds)} predictions for:\", path)\n        for idx, (score, pred) in enumerate(zip(scores, preds)):\n            print(f\"    {idx + 1}. (confidence: {score:.3f}) {CLASSES[pred]}\")\n        print()\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "yolo11_tripy/compile_classifier.py",
    "content": "import argparse\nimport os\n\nimport nvtripy as tp\nimport requests\nimport torch\nfrom constants import IMAGE_C, IMAGE_H, IMAGE_W\nfrom model.model import Yolo11Cls\nfrom tqdm import tqdm\n\nCURDIR = os.path.realpath(os.path.dirname(__file__))\n\n\ndef get_model_config(model_variant):\n    config = {\n        \"model_variant\": model_variant,\n    }\n    if model_variant == \"n\":\n        config.update({\"gd\": 0.50, \"gw\": 0.25, \"max_channels\": 1024})\n    elif model_variant == \"s\":\n        config.update({\"gd\": 0.50, \"gw\": 0.50, \"max_channels\": 1024})\n    elif model_variant == \"m\":\n        config.update({\"gd\": 0.50, \"gw\": 1.00, \"max_channels\": 512})\n    elif model_variant == \"l\":\n        config.update({\"gd\": 1.0, \"gw\": 1.0, \"max_channels\": 512})\n    elif model_variant == \"x\":\n        config.update({\"gd\": 1.0, \"gw\": 1.50, \"max_channels\": 512})\n\n    return config\n\n\ndef download_weights(model_variant, directory):\n    out_path = os.path.join(directory, f\"yolo11{model_variant}-cls.pt\")\n\n    if os.path.exists(out_path):\n        print(f\"Checkpoint already exists at: {out_path}, skipping download.\")\n        return out_path\n\n    URL = f\"https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11{model_variant}-cls.pt\"\n\n    response = requests.get(URL, stream=True)\n    response.raise_for_status()\n    total_size = int(response.headers.get(\"content-length\", 0))\n\n    os.makedirs(directory, exist_ok=True)\n\n    with open(out_path, \"wb\") as f, tqdm(\n        desc=f\"Downloading checkpoint: yolo11{model_variant}-cls.pt\",\n        total=total_size,\n        unit=\"B\",\n        unit_scale=True,\n        unit_divisor=1024,\n    ) as progress_bar:\n        for chunk in response.iter_content(chunk_size=8192):\n            f.write(chunk)\n            progress_bar.update(len(chunk))\n\n    return out_path\n\n\ndef load_weights(weights_path, dtype):\n    checkpoint = torch.load(weights_path, weights_only=False)\n    torch_model = checkpoint[\"model\"].eval()\n    if dtype == tp.float16:\n        torch_model = torch_model.half()\n    else:\n        assert dtype == tp.float32, \"Unsupported dtype\"\n        torch_model = torch_model.float()\n\n    state_dict = torch_model.state_dict()\n\n    # Some weights from the training graph are not needed for inference:\n    def should_include(key):\n        return \"num_batches_tracked\" not in key\n\n    return {name: tp.Tensor(weight) for name, weight in state_dict.items() if should_include(name)}\n\n\ndef main():\n    parser = argparse.ArgumentParser(description=\"Compiles a YOLO11 classifier model with Tripy.\")\n    parser.add_argument(\n        \"--model-variant\",\n        help=\"Model variant (n, s, m, l, x)\",\n        default=\"n\",\n        choices=[\"n\", \"s\", \"m\", \"l\", \"x\"],\n    )\n    parser.add_argument(\n        \"-o\",\n        \"--output\",\n        help=\"Where to save the Tripy executable\",\n        default=\"yolo11-cls.tpymodel\",\n    )\n    parser.add_argument(\n        \"--checkpoints-dir\",\n        help=\"Where to save PyTorch checkpoints\",\n        default=os.path.join(CURDIR, \"checkpoints\"),\n    )\n    parser.add_argument(\n        \"--max-images\",\n        help=\"Maximum number of images the model will be able to classify at once, i.e. the maximum batch size.\",\n        default=10,\n        type=int,\n    )\n    parser.add_argument(\n        \"--dtype\",\n        help=\"Data type to use for inference\",\n        default=\"float16\",\n        choices=[\"float32\", \"float16\"],\n    )\n\n    args, _ = parser.parse_known_args()\n\n    config = get_model_config(args.model_variant)\n    dtype = getattr(tp, args.dtype)\n    model = Yolo11Cls(**config, dtype=dtype)\n\n    weights_path = download_weights(args.model_variant, args.checkpoints_dir)\n\n    model.load_state_dict(load_weights(weights_path, dtype))\n\n    # We compile not only the classifier itself, but also accelerate the postprocessing:\n    def infer(batch):\n        out = model(batch)\n        out = tp.softmax(out, dim=1)\n        batch_scores, batch_preds = tp.topk(out, 3, dim=-1)\n        return batch_scores, batch_preds\n\n    print(\"Compiling YOLO11 classifier + postprocessing. This may take a few moments...\")\n    executable = tp.compile(\n        infer,\n        args=[\n            tp.InputInfo(\n                [\n                    # Support a range of batch sizes from 1 to `max_images`, optimizing for the midpoint:\n                    (1, (args.max_images + 1) // 2, args.max_images),\n                    IMAGE_C,\n                    IMAGE_H,\n                    IMAGE_W,\n                ],\n                dtype=dtype,\n            ),\n        ],\n    )\n\n    print(f\"Saving compiled executable to: {args.output}\")\n    executable.save(args.output)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "yolo11_tripy/constants.py",
    "content": "IMAGE_C = 3\nIMAGE_H = 224\nIMAGE_W = 224\n"
  },
  {
    "path": "yolo11_tripy/model/block.py",
    "content": "import nvtripy as tp\n\n\nclass ConvBnSilu(tp.Module):\n    def __init__(self, in_channels, out_channels, kernel_dims, stride, dtype):\n        super().__init__()\n        self.conv = tp.Conv(\n            in_channels,\n            out_channels,\n            kernel_dims,\n            stride=stride,\n            padding=[(dim // 2, dim // 2) for dim in kernel_dims],\n            bias=False,\n            dtype=dtype,\n        )\n        self.bn = tp.BatchNorm(out_channels, eps=1e-3, dtype=dtype)\n\n    def forward(self, x):\n        x = self.conv(x)\n        x = self.bn(x)\n        x = tp.silu(x)\n        return x\n\n\nclass Bottleneck(tp.Module):\n    def __init__(\n        self,\n        in_channels,\n        out_channels,\n        shortcut,\n        kernel_dims1,\n        kernel_dims2,\n        expansion_ratio,\n        dtype,\n    ):\n        super().__init__()\n        expanded_out_channels = int(out_channels * expansion_ratio)\n        self.cv1 = ConvBnSilu(in_channels, expanded_out_channels, kernel_dims1, stride=(1, 1), dtype=dtype)\n        self.cv2 = ConvBnSilu(\n            expanded_out_channels,\n            out_channels,\n            kernel_dims2,\n            stride=(1, 1),\n            dtype=dtype,\n        )\n\n        self.shortcut = shortcut and in_channels == out_channels\n\n    def forward(self, x):\n        out = self.cv1(x)\n        out = self.cv2(out)\n        if self.shortcut:\n            out += x\n        return out\n\n\nclass C3k(tp.Module):\n    def __init__(\n        self,\n        in_channels,\n        out_channels,\n        num_layers,\n        shortcut,\n        kernel_dims1,\n        kernel_dims2,\n        expansion_ratio,\n        dtype,\n    ):\n        super().__init__()\n        expanded_out_channels = int(out_channels * expansion_ratio)\n\n        self.cv1 = ConvBnSilu(\n            in_channels,\n            expanded_out_channels,\n            kernel_dims=(1, 1),\n            stride=(1, 1),\n            dtype=dtype,\n        )\n        self.cv2 = ConvBnSilu(\n            in_channels,\n            expanded_out_channels,\n            kernel_dims=(1, 1),\n            stride=(1, 1),\n            dtype=dtype,\n        )\n\n        self.m = tp.Sequential(\n            *[\n                Bottleneck(\n                    expanded_out_channels,\n                    expanded_out_channels,\n                    shortcut,\n                    kernel_dims1,\n                    kernel_dims2,\n                    1.0,\n                    dtype=dtype,\n                )\n                for _ in range(num_layers)\n            ]\n        )\n\n        self.cv3 = ConvBnSilu(\n            2 * expanded_out_channels,\n            out_channels,\n            kernel_dims=(1, 1),\n            stride=(1, 1),\n            dtype=dtype,\n        )\n\n    def forward(self, x):\n        out1 = self.cv1(x)\n        out2 = self.cv2(x)\n\n        out1 = self.m(out1)\n        out = tp.concatenate((out1, out2), dim=1)\n        out = self.cv3(out)\n        return out\n\n\nclass C3K2(tp.Module):\n    def __init__(\n        self,\n        in_channels,\n        out_channels,\n        num_layers,\n        use_c3k,\n        shortcut,\n        expansion_ratio,\n        dtype,\n    ):\n        super().__init__()\n\n        expanded_out_channels = int(out_channels * expansion_ratio)\n        self.cv1 = ConvBnSilu(\n            in_channels,\n            2 * expanded_out_channels,\n            kernel_dims=(1, 1),\n            stride=(1, 1),\n            dtype=dtype,\n        )\n\n        self.m = tp.Sequential(\n            *[\n                (\n                    C3k(\n                        expanded_out_channels,\n                        expanded_out_channels,\n                        2,\n                        shortcut,\n                        (3, 3),\n                        (3, 3),\n                        0.5,\n                        dtype=dtype,\n                    )\n                    if use_c3k\n                    else Bottleneck(\n                        expanded_out_channels,\n                        expanded_out_channels,\n                        shortcut,\n                        (3, 3),\n                        (3, 3),\n                        0.5,\n                        dtype=dtype,\n                    )\n                )\n                for _ in range(num_layers)\n            ]\n        )\n\n        # Number of input channels to CV2 is the output channels of CV1 plus all\n        # output channels from the layers in `m`.\n        cv2_in_channels = (2 * expanded_out_channels) + (expanded_out_channels * num_layers)\n        self.cv2 = ConvBnSilu(cv2_in_channels, out_channels, (1, 1), (1, 1), dtype=dtype)\n\n    def forward(self, x):\n        x = self.cv1(x)\n\n        _, m_inp = tp.split(x, 2, dim=1)\n\n        cat = x\n        # We manually iterate over the Sequential module here since we need to access the intermediate outputs.\n        for layer in self.m:\n            m_inp = layer(m_inp)\n            cat = tp.concatenate((cat, m_inp), dim=1)\n        out = self.cv2(cat)\n        return out\n\n\nclass ConvBn(tp.Module):\n    def __init__(self, in_channels, out_channels, kernel_dims, stride, dtype, num_groups=1):\n        super().__init__()\n        self.conv = tp.Conv(\n            in_channels,\n            out_channels,\n            kernel_dims,\n            stride=stride,\n            padding=[(dim // 2, dim // 2) for dim in kernel_dims],\n            bias=False,\n            groups=num_groups,\n            dtype=dtype,\n        )\n        self.bn = tp.BatchNorm(out_channels, eps=1e-3, dtype=dtype)\n\n    def forward(self, x):\n        x = self.conv(x)\n        x = self.bn(x)\n        return x\n\n\nclass Attention(tp.Module):\n    def __init__(self, dim, num_heads, attn_ratio, dtype):\n        super().__init__()\n\n        self.dim = dim\n        self.num_heads = num_heads\n        head_dim = self.dim // num_heads\n        self.key_dim = int(head_dim * attn_ratio)\n        self.scale = self.key_dim**-0.5\n        nh_kd = self.key_dim * num_heads\n        h = self.dim + nh_kd * 2\n\n        self.qkv = ConvBn(self.dim, h, (1, 1), (1, 1), dtype=dtype)\n        self.pe = ConvBn(self.dim, self.dim, (3, 3), (1, 1), dtype=dtype, num_groups=self.dim)\n        self.proj = ConvBn(self.dim, self.dim, (1, 1), (1, 1), dtype=dtype)\n\n    def forward(self, x):\n        B, _, H, W = x.shape\n        N = H * W\n\n        x = self.qkv(x)\n\n        x = tp.reshape(x, (B, self.num_heads, -1, N))\n\n        q, k, v = tp.split(x, [self.key_dim, self.key_dim, self.key_dim * 2], dim=2)\n\n        q_t = tp.transpose(q, 2, 3)\n\n        softmax = tp.softmax((q_t @ k) * self.scale, dim=3)\n\n        attn_t = tp.transpose(softmax, 2, 3)\n\n        matmul2 = v @ attn_t\n        reshape = tp.reshape(matmul2, (B, -1, H, W))\n\n        v_reshape = tp.reshape(v, (B, self.dim, H, W))\n\n        pe = self.pe(v_reshape)\n\n        sum = reshape + pe\n        proj = self.proj(sum)\n        return proj\n\n\nclass PSABlock(tp.Module):\n    def __init__(self, dim, attn_ratio, num_heads, shortcut, dtype):\n        super().__init__()\n\n        self.attn = Attention(dim, num_heads, attn_ratio, dtype=dtype)\n        self.shortcut = shortcut\n\n        self.ffn = tp.Sequential(\n            ConvBnSilu(dim, dim * 2, (1, 1), (1, 1), dtype=dtype),\n            ConvBn(dim * 2, dim, (1, 1), (1, 1), dtype=dtype),\n        )\n\n    def forward(self, x):\n        attn_out = self.attn(x)\n        if self.shortcut:\n            x = x + attn_out\n        else:\n            x = attn_out\n\n        ffn_out = self.ffn(x)\n        if self.shortcut:\n            x = x + ffn_out\n        else:\n            x = ffn_out\n\n        return x\n\n\nclass C2PSA(tp.Module):\n    def __init__(self, input_channels, output_channels, num_layers, expansion_ratio, dtype):\n        super().__init__()\n\n        expanded_input_channels = int(input_channels * expansion_ratio)\n\n        self.cv1 = ConvBnSilu(input_channels, 2 * expanded_input_channels, (1, 1), (1, 1), dtype=dtype)\n        self.m = tp.Sequential(\n            *[\n                PSABlock(\n                    expanded_input_channels,\n                    0.5,\n                    expanded_input_channels // 64,\n                    True,\n                    dtype=dtype,\n                )\n                for _ in range(num_layers)\n            ]\n        )\n\n        self.cv2 = ConvBnSilu(2 * expanded_input_channels, output_channels, (1, 1), (1, 1), dtype=dtype)\n\n    def forward(self, x):\n        x = self.cv1(x)\n\n        split1, y = tp.split(x, 2, dim=1)\n\n        y = self.m(y)\n\n        cat = tp.concatenate((split1, y), dim=1)\n        out = self.cv2(cat)\n        return out\n"
  },
  {
    "path": "yolo11_tripy/model/model.py",
    "content": "import math\n\nimport nvtripy as tp\n\nfrom .block import C2PSA, C3K2, ConvBnSilu\n\nNUM_CLASSES = 1000\n\n\ndef get_width(w, gw, max_channels, divisor=8):\n    return int(math.ceil((min(w, max_channels) * gw) / divisor)) * divisor\n\n\ndef get_depth(d, gd):\n    if d == 1:\n        return d\n\n    r = round(d * gd)\n    # Round ties for even numbers down:\n    if d * gd - int(d * gd) == 0.5 and (int(d * gd) % 2) == 0:\n        r -= 1\n    return max(r, 1)\n\n\nclass Yolo11Head(tp.Module):\n    def __init__(self, input_channels, dtype):\n        super().__init__()\n        self.conv = ConvBnSilu(input_channels, 1280, (1, 1), (1, 1), dtype=dtype)\n        self.linear = tp.Linear(1280, NUM_CLASSES, dtype=dtype)\n\n    def forward(self, x):\n        x = self.conv(x)\n        # Global average pooling:\n        x = tp.reshape(tp.mean(x, dim=(2, 3), keepdim=True), (-1, 1280))\n        x = self.linear(x)\n        return x\n\n\nclass Yolo11Cls(tp.Module):\n    def __init__(self, model_variant, gd, gw, max_channels, dtype=tp.float32):\n        use_c3k = model_variant in {\"m\", \"l\", \"x\"}\n\n        self.model = tp.Sequential(\n            ConvBnSilu(3, get_width(64, gw, max_channels), (3, 3), (2, 2), dtype=dtype),\n            ConvBnSilu(\n                get_width(64, gw, max_channels),\n                get_width(128, gw, max_channels),\n                (3, 3),\n                (2, 2),\n                dtype=dtype,\n            ),\n            C3K2(\n                get_width(128, gw, max_channels),\n                get_width(256, gw, max_channels),\n                get_depth(2, gd),\n                use_c3k,\n                True,\n                0.25,\n                dtype=dtype,\n            ),\n            ConvBnSilu(\n                get_width(256, gw, max_channels),\n                get_width(256, gw, max_channels),\n                (3, 3),\n                (2, 2),\n                dtype=dtype,\n            ),\n            C3K2(\n                get_width(256, gw, max_channels),\n                get_width(512, gw, max_channels),\n                get_depth(2, gd),\n                use_c3k,\n                True,\n                0.25,\n                dtype=dtype,\n            ),\n            ConvBnSilu(\n                get_width(512, gw, max_channels),\n                get_width(512, gw, max_channels),\n                (3, 3),\n                (2, 2),\n                dtype=dtype,\n            ),\n            C3K2(\n                get_width(512, gw, max_channels),\n                get_width(512, gw, max_channels),\n                get_depth(2, gd),\n                True,\n                True,\n                0.5,\n                dtype=dtype,\n            ),\n            ConvBnSilu(\n                get_width(512, gw, max_channels),\n                get_width(1024, gw, max_channels),\n                (3, 3),\n                (2, 2),\n                dtype=dtype,\n            ),\n            C3K2(\n                get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels),\n                get_depth(2, gd),\n                True,\n                True,\n                0.5,\n                dtype=dtype,\n            ),\n            C2PSA(\n                get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels),\n                get_depth(2, gd),\n                0.5,\n                dtype=dtype,\n            ),\n            Yolo11Head(get_width(1024, gw, max_channels), dtype=dtype),\n        )\n\n    def forward(self, x):\n        x = self.model(x)\n        return x\n"
  },
  {
    "path": "yolo11_tripy/requirements.txt",
    "content": "-f https://nvidia.github.io/TensorRT-Incubator/packages.html\nnvtripy>=0.1.1\nopencv-python-headless\nnumpy\ntorch\n"
  },
  {
    "path": "yolo26/.clang-format",
    "content": "# Google C/C++ Code Style settings (with 4-space)\n# Refered to https://github.com/kehanXue/google-style-clang-format/blob/master/.clang-format\n\nLanguage: Cpp\nBasedOnStyle: Google\nAccessModifierOffset: -1\nAlignAfterOpenBracket: Align\nAlignConsecutiveAssignments: None\nAlignOperands: Align\nAllowAllArgumentsOnNextLine: true\nAllowAllConstructorInitializersOnNextLine: true\nAllowAllParametersOfDeclarationOnNextLine: false\nAllowShortBlocksOnASingleLine: Empty\nAllowShortCaseLabelsOnASingleLine: false\nAllowShortFunctionsOnASingleLine: Inline\nAllowShortIfStatementsOnASingleLine: Never  # To avoid conflict, set this \"Never\" and each \"if statement\" should include brace when coding\nAllowShortLambdasOnASingleLine: Inline\nAllowShortLoopsOnASingleLine: false\nAlwaysBreakAfterReturnType: None\nAlwaysBreakTemplateDeclarations: Yes\nBinPackArguments: true\nBreakBeforeBraces: Custom\nBraceWrapping:\n  AfterCaseLabel: false\n  AfterClass: false\n  AfterStruct: false\n  AfterControlStatement: Never\n  AfterEnum: false\n  AfterFunction: false\n  AfterNamespace: false\n  AfterUnion: false\n  AfterExternBlock: false\n  BeforeCatch: false\n  BeforeElse: false\n  BeforeLambdaBody: false\n  IndentBraces: false\n  SplitEmptyFunction: false\n  SplitEmptyRecord: false\n  SplitEmptyNamespace: false\nBreakBeforeBinaryOperators: None\nBreakBeforeTernaryOperators: true\nBreakConstructorInitializers: BeforeColon\nBreakInheritanceList: BeforeColon\nColumnLimit: 120\nCompactNamespaces: false\nContinuationIndentWidth: 8\nCpp11BracedListStyle: true\nDerivePointerAlignment: false  # Make sure the * or & align on the left\nEmptyLineBeforeAccessModifier: LogicalBlock\nFixNamespaceComments: true\nIncludeBlocks: Preserve\nIndentCaseLabels: true\nIndentPPDirectives: None\nIndentWidth: 4\nKeepEmptyLinesAtTheStartOfBlocks: true\nMaxEmptyLinesToKeep: 1\nNamespaceIndentation: None\nObjCSpaceAfterProperty: false\nObjCSpaceBeforeProtocolList: true\nPointerAlignment: Left\nReflowComments: false\n# SeparateDefinitionBlocks: Always   # Only support since clang-format 14\nSpaceAfterCStyleCast: false\nSpaceAfterLogicalNot: false\nSpaceAfterTemplateKeyword: true\nSpaceBeforeAssignmentOperators: true\nSpaceBeforeCpp11BracedList: false\nSpaceBeforeCtorInitializerColon: true\nSpaceBeforeInheritanceColon: true\nSpaceBeforeParens: ControlStatements\nSpaceBeforeRangeBasedForLoopColon: true\nSpaceBeforeSquareBrackets: false\nSpaceInEmptyParentheses: false\nSpacesBeforeTrailingComments: 2\nSpacesInAngles: false\nSpacesInCStyleCastParentheses: false\nSpacesInContainerLiterals: false\nSpacesInParentheses: false\nSpacesInSquareBrackets: false\nStandard: c++11\nTabWidth: 8\nUseTab: Never"
  },
  {
    "path": "yolo26/.gitignore",
    "content": "**/build/**\n**/models/**\n**/*.onnx\n**/*.engine\n**/*.pt\n"
  },
  {
    "path": "yolo26/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\n\nproject(yolo26)\n\nadd_definitions(-std=c++11)\nadd_definitions(-DAPI_EXPORTS)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nset(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)\nenable_language(CUDA)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\ninclude_directories(${PROJECT_SOURCE_DIR}/plugin)\n\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\nif(CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n  message(\"embed_platform on\")\n  include_directories(/usr/local/cuda/targets/aarch64-linux/include)\n  link_directories(/usr/local/cuda/targets/aarch64-linux/lib)\nelse()\n  message(\"embed_platform off\")\n\n  # cuda\n  include_directories(/usr/local/cuda/include)\n  link_directories(/usr/local/cuda/lib64)\n\n  # tensorrt\n  include_directories(/workspace/shared/TensorRT-8.6.3/include)\n  link_directories(/workspace/shared/TensorRT-8.6.3/lib)\nendif()\n\nadd_library(yololayerplugins SHARED ${PROJECT_SOURCE_DIR}/plugin/yololayer.cu)\ntarget_link_libraries(yololayerplugins nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nfile(GLOB_RECURSE SRCS ${PROJECT_SOURCE_DIR}/src/*.cpp ${PROJECT_SOURCE_DIR}/src/*.cu)\n\nadd_executable(yolo26_det ${PROJECT_SOURCE_DIR}/yolo26_det.cpp ${SRCS})\ntarget_link_libraries(yolo26_det nvinfer)\ntarget_link_libraries(yolo26_det cudart)\ntarget_link_libraries(yolo26_det yololayerplugins)\ntarget_link_libraries(yolo26_det ${OpenCV_LIBS})\n\nadd_executable(yolo26_obb ${PROJECT_SOURCE_DIR}/yolo26_obb.cpp ${SRCS})\ntarget_link_libraries(yolo26_obb nvinfer)\ntarget_link_libraries(yolo26_obb cudart)\ntarget_link_libraries(yolo26_obb yololayerplugins)\ntarget_link_libraries(yolo26_obb ${OpenCV_LIBS})\n\nadd_executable(yolo26_cls ${PROJECT_SOURCE_DIR}/yolo26_cls.cpp ${SRCS})\ntarget_link_libraries(yolo26_cls nvinfer)\ntarget_link_libraries(yolo26_cls cudart)\ntarget_link_libraries(yolo26_cls yololayerplugins)\ntarget_link_libraries(yolo26_cls ${OpenCV_LIBS})"
  },
  {
    "path": "yolo26/README.md",
    "content": "## Introduction\n\nYolo26 model supports TensorRT-8.\n\nTraining code [link](https://github.com/ultralytics/ultralytics/archive/refs/tags/v8.4.0.zip)\n\n## Environment\n\n* cuda 12.4\n* cudnn 9.1.0.70\n* tensorrt 8.6.3\n* opencv 4.8.0\n* ultralytics 8.4.0\n\n## Support\n\n* [✅] Yolo26n-det, Yolo26s-det, Yolo26m-det, Yolo26l-det, Yolo26sx-det, support FP32/FP16 and C++ API\n* [✅] Yolo26n-obb, Yolo26s-obb, Yolo26m-obb, Yolo26l-obb, Yolo26sx-obb, support FP32/FP16 and C++ API\n* [✅] Yolo26n-cls, Yolo26s-cls, Yolo26m-cls, Yolo26l-cls, Yolo26sx-cls, support FP32/FP16 and C++ API\n\n## COMING FEATURES\n* [⏳] Windows OS Support\n* [⏳] Support Batched Inputs\n* [⏳] Support Quantization\n* [⏳] Yolo26-cls models\n* [⏳] Yolo26-pose models\n* [⏳] Yolo26-seg models\n\n## Config\n\n* Choose the YOLO26 sub-model n/s/m/l/x from command line arguments.\n* Other configs please check [include/config.h](include/config.h)\n\n## Build and Run\n\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\n\n```shell\n# Download ultralytics\nwget https://github.com/ultralytics/ultralytics/archive/refs/tags/v8.4.4.zip -O ultralytics-8.4.4.zip\n# Unzip ultralytics\nunzip ultralytics-8.4.4.zip\ncd ultralytics-8.4.4\n# Download models For Detection\nwget https://github.com/ultralytics/assets/releases/download/v8.4.0/yolo26n.pt -O yolo26n.pt # to download other models, replace 'yolo26n.pt' with 'yolo26s.pt', 'yolo26m.pt', 'yolo26l.pt' or 'yolo26x.pt'\n# Generate .wts\ncp [PATH-TO-MAIN-FOLDER]/gen_wts.py .\npython gen_wts.py -w yolo26n.pt -o yolo26n.wts -t detect\n# A file 'yolo26n.wts' will be generated.\n\n# Download models for Obb\nwget https://github.com/ultralytics/assets/releases/download/v8.4.0/yolo26n-obb.pt -O yolo26n-obb.pt # to download other models, replace 'yolo26n-obb.pt' with 'yolo26s-obb.pt', 'yolo26m-obb.pt', 'yolo26l-obb.pt' or 'yolo26x-obb.pt'\n# Generate .wts\ncp [PATH-TO-MAIN-FOLDER]/gen_wts.py .\npython gen_wts.py -w yolo26n-obb.pt -o yolo26n-obb.wts -t obb\n# A file 'yolo26n-obb.wts' will be generated.\n\n# Download models for Cls\nwget https://github.com/ultralytics/assets/releases/download/v8.4.0/yolo26n-cls.pt -O yolo26n-cls.pt # to download other models, replace 'yolo26n-cls.pt' with 'yolo26s-cls.pt', 'yolo26m-cls.pt', 'yolo26l-cls.pt' or 'yolo26x-cls.pt'\n# Generate .wts\ncp [PATH-TO-MAIN-FOLDER]/gen_wts.py .\npython gen_wts.py -w yolo26n-cls.pt -o yolo26n-cls.wts -t cls\n# A file 'yolo26n-cls.wts' will be generated.\n\n```\n\n2. build and run\n```shell\ncd [PATH-TO-MAIN-FOLDER]\nmkdir build\ncd build\ncmake ..\nmake\n```\n\n### Detection\n```shell\ncp [PATH-TO-ultralytics]/yolo26n.wts .\n# Build and serialize TensorRT engine\n./yolo26_det -s yolo26n.wts yolo26n.engine [n/s/m/l/x]\n# Run inference\n./yolo26_det -d yolo26n.engine ../images\n# results saved in build directory\n```\n\n### Obb\n```shell\ncp [PATH-TO-ultralytics]/yolo26n-obb.wts .\n# Build and serialize TensorRT engine\n./yolo26_obb -s yolo26n-obb.wts yolo26n-obb.engine [n/s/m/l/x]\n# Run inference\n./yolo26_obb -d yolo26n-obb.engine ../images\n# results saved in build directory\n```\n\n### Cls\n```shell\nGenerate classification text file in build folder or download it\n# wget https://github.com/joannzhang00/ImageNet-dataset-classes-labels/blob/main/imagenet_classes.txt\n\ncp [PATH-TO-ultralytics]/yolo26n-cls.wts .\n# Build and serialize TensorRT engine\n./yolo26_cls -s yolo26n-cls.wts yolo26n-cls.engine [n/s/m/l/x]\n# Run inference\n./yolo26_cls -d yolo26n-cls.engine ../images\n# results saved in build directory\n```\n\n## More Information\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)"
  },
  {
    "path": "yolo26/gen_wts.py",
    "content": "import sys  # noqa: F401\nimport argparse\nimport os\nimport struct\nimport torch\n\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description='Convert .pt file to .wts')\n    parser.add_argument('-w', '--weights', required=True,\n                        help='Input weights (.pt) file path (required)')\n    parser.add_argument(\n        '-o', '--output', help='Output (.wts) file path (optional)')\n    parser.add_argument(\n        '-t', '--type', type=str, default='detect', choices=['detect', 'cls', 'seg', 'pose', 'obb'],\n        help='determines the model is detection/classification')\n    args = parser.parse_args()\n    if not os.path.isfile(args.weights):\n        raise SystemExit('Invalid input file')\n    if not args.output:\n        args.output = os.path.splitext(args.weights)[0] + '.wts'\n    elif os.path.isdir(args.output):\n        args.output = os.path.join(\n            args.output,\n            os.path.splitext(os.path.basename(args.weights))[0] + '.wts')\n    return args.weights, args.output, args.type\n\n\npt_file, wts_file, m_type = parse_args()\n\nprint(f'Generating .wts for {m_type} model')\n\n# Load model\nprint(f'Loading {pt_file}')\n\n# Initialize\ndevice = 'cpu'\n\n# Load model\nmodel = torch.load(pt_file, map_location=device, weights_only=False)['model'].float()  # load to FP32\n\nif m_type in ['detect', 'seg', 'pose', 'obb']:\n    anchor_grid = model.model[-1].anchors * model.model[-1].stride[..., None, None]\n\n    delattr(model.model[-1], 'anchors')\n\nmodel.to(device).eval()\n\nwith open(wts_file, 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\n"
  },
  {
    "path": "yolo26/include/block.h",
    "content": "#pragma once\n\n#include <map>\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n\nusing namespace std;\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file);\n\nnvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                      std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                      std::string lname, float eps);\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, std::vector<int> k, int s, std::string lname, int g = 1);\n\nnvinfer1::IElementWiseLayer* C3K2(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int n, bool c3k, bool shortcut, bool atnn, float e, std::string lname);\n\nnvinfer1::IElementWiseLayer* SPPF(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int k, bool shortcut, std::string lname);\n\nnvinfer1::IElementWiseLayer* C2PSA(nvinfer1::INetworkDefinition* network,\n                                   std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input,\n                                   int c1, int c2, int n, float e, std::string lname);\n\nnvinfer1::ILayer* DWConv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int ch, std::vector<int> k, int s, std::string lname);\n\nnvinfer1::ILayer* conv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                       nvinfer1::ITensor& input, int ch, std::vector<int> k, int s, std::string lname, int g = 1,\n                       bool act = true);\n\nnvinfer1::IPluginV2Layer* addYoloLayer(nvinfer1::INetworkDefinition* network, nvinfer1::ITensor& input,\n                                       const std::vector<int>& strides, const std::vector<int>& fm_sizes,\n                                       int stridesLength, bool is_detection, bool is_segmentation, bool is_pose,\n                                       bool is_obb, int anchorCount);"
  },
  {
    "path": "yolo26/include/config.h",
    "content": "#define USE_FP16\n// #define USE_FP32\n// #define USE_INT8\n\nconst static char* kInputTensorName = \"images\";\nconst static char* kOutputTensorName = \"output\";\nconst static char* kProtoTensorName = \"proto\";\nconst static int kNumClass = 80;\nconst static int kPoseNumClass = 1;\nconst static int kNumberOfPoints = 17;  // number of keypoints total\n// obb model's number of classes\nconstexpr static int kObbNumClass = 15;\nconst static int kObbNe = 1;  // number of extra parameters\nconst static int kBatchSize = 1;\nconst static int kGpuId = 0;\nconst static int kInputH = 640;\nconst static int kInputW = 640;\nconst static int kObbInputH = 1024;\nconst static int kObbInputW = 1024;\nconst static float kNmsThresh = 0.45f;\nconst static float kConfThresh = 0.3f;\nconst static float kConfThreshKeypoints = 0.5f;  // keypoints confidence\nconst static int kMaxInputImageSize = 3000 * 3000;\nconst static int kMaxNumOutputBbox = 300;\n// Quantization input image folder path\nconst static char* kInputQuantizationFolder = \"./coco_calib\";\n\n// Classfication model's number of classes\nconstexpr static int kClsNumClass = 1000;\n// Classfication model's input shape\nconstexpr static int kClsInputH = 224;\nconstexpr static int kClsInputW = 224;"
  },
  {
    "path": "yolo26/include/cuda_utils.h",
    "content": "#ifndef TRTX_CUDA_UTILS_H_\n#define TRTX_CUDA_UTILS_H_\n\n#include <cuda_runtime_api.h>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n#endif  // CUDA_CHECK\n\n#endif  // TRTX_CUDA_UTILS_H_"
  },
  {
    "path": "yolo26/include/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"NvInferRuntimeCommon.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H"
  },
  {
    "path": "yolo26/include/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#include \"NvInfer.h\"\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H"
  },
  {
    "path": "yolo26/include/model.h",
    "content": "#pragma once\n\n#include <assert.h>\n#include <string>\n#include \"NvInfer.h\"\n\nnvinfer1::IHostMemory* buildEngineYolo26Det(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type);\n\nnvinfer1::IHostMemory* buildEngineYolo26Obb(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type);\n\nnvinfer1::IHostMemory* buildEngineYolo26Cls(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type);\n"
  },
  {
    "path": "yolo26/include/postprocess.h",
    "content": "#pragma once\n\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\n// Preprocessing functions\ncv::Rect get_rect(cv::Mat& img, float bbox[4]);\n\n// NMS functions\nvoid decode(std::vector<Detection>& res, float* output);\n\nvoid batch_decode(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size);\n\nvoid decode_obb(std::vector<Detection>& res, float* output);\n\nvoid batch_decode_obb(std::vector<std::vector<Detection>>& batch_res, float* output, int batch_size, int output_size);\n\n// Drawing functions\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nvoid draw_bbox_obb(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nvoid draw_bbox_keypoints_line(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks,\n                    std::unordered_map<int, std::string>& labels_map);"
  },
  {
    "path": "yolo26/include/preprocess.h",
    "content": "#pragma once\n\n#include <map>\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\nvoid cuda_preprocess_init(int max_image_size);\n\nvoid cuda_preprocess_destroy();\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream);\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream);"
  },
  {
    "path": "yolo26/include/types.h",
    "content": "#pragma once\n#include \"config.h\"\n\nstruct alignas(float) Detection {\n    // center_x center_y w h\n    float bbox[4];\n    float conf;  // bbox_conf * cls_conf\n    float class_id;\n    float mask[32];\n    float keypoints[kNumberOfPoints * 3];  // 17*3 keypoints\n    float angle;                           // obb angle\n};\n\nstruct AffineMatrix {\n    float value[6];\n};\n\nconst int bbox_element =\n        sizeof(AffineMatrix) / sizeof(float) + 1;  // left, top, right, bottom, confidence, class, keepflag"
  },
  {
    "path": "yolo26/include/utils.h",
    "content": "#pragma once\n#include <dirent.h>\n#include <fstream>\n#include <opencv2/opencv.hpp>\n\nstatic inline cv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h) {\n    int w, h, x, y;\n    float r_w = input_w / (img.cols * 1.0);\n    float r_h = input_h / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = input_w;\n        h = r_w * img.rows;\n        x = 0;\n        y = (input_h - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = input_h;\n        x = (input_w - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n    cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\nstatic inline int read_files_in_dir(const char* p_dir_name, std::vector<std::string>& file_names) {\n    DIR* p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 && strcmp(p_file->d_name, \"..\") != 0) {\n            // std::string cur_file_name(p_dir_name);\n            // cur_file_name += \"/\";\n            // cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            //            std::cout << \"Found file: \" << cur_file_name << std::endl;\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\ninline std::vector<std::string> read_classes(std::string file_name) {\n    std::vector<std::string> classes;\n    std::ifstream ifs(file_name, std::ios::in);\n    if (!ifs.is_open()) {\n        std::cerr << file_name << \" is not found, pls refer to README and download it.\" << std::endl;\n        assert(0);\n    }\n    std::string s;\n    while (std::getline(ifs, s)) {\n        // std::cout << \"Read class: \" << s << std::endl;\n        classes.push_back(s);\n    }\n    ifs.close();\n    return classes;\n}\n\n// Function to trim leading and trailing whitespace from a string\nstatic inline std::string trim_leading_whitespace(const std::string& str) {\n    size_t first = str.find_first_not_of(' ');\n    if (std::string::npos == first) {\n        return str;\n    }\n    size_t last = str.find_last_not_of(' ');\n    return str.substr(first, (last - first + 1));\n}\n\n// Src: https://stackoverflow.com/questions/16605967\nstatic inline std::string to_string_with_precision(const float a_value, const int n = 2) {\n    std::ostringstream out;\n    out.precision(n);\n    out << std::fixed << a_value;\n    return out.str();\n}\n\nstatic inline int read_labels(const std::string labels_filename, std::unordered_map<int, std::string>& labels_map) {\n    std::ifstream file(labels_filename);\n    // Read each line of the file\n    std::string line;\n    int index = 0;\n    while (std::getline(file, line)) {\n        // Strip the line of any leading or trailing whitespace\n        line = trim_leading_whitespace(line);\n\n        // Add the stripped line to the labels_map, using the loop index as the key\n        labels_map[index] = line;\n        index++;\n    }\n    // Close the file\n    file.close();\n\n    return 0;\n}\n\nstatic inline bool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir,\n                              std::string& type, float& gd, float& gw, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && (argc == 5)) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        auto sub_type = std::string(argv[4]);\n\n        if (sub_type[0] == 'n') {\n            gd = 0.50;\n            gw = 0.25;\n            max_channels = 1024;\n            type = \"n\";\n        } else if (sub_type[0] == 's') {\n            gd = 0.50;\n            gw = 0.50;\n            max_channels = 1024;\n            type = \"s\";\n        } else if (sub_type[0] == 'm') {\n            gd = 0.50;\n            gw = 1.00;\n            max_channels = 512;\n            type = \"m\";\n        } else if (sub_type[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n            type = \"l\";\n        } else if (sub_type[0] == 'x') {\n            gd = 1.0;\n            gw = 1.50;\n            max_channels = 512;\n            type = \"x\";\n        } else {\n            return false;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 4) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n    } else {\n        return false;\n    }\n    return true;\n}"
  },
  {
    "path": "yolo26/plugin/yololayer.cu",
    "content": "#include <assert.h>\n#include <math.h>\n#include <iostream>\n#include <vector>\n#include \"cuda_utils.h\"\n#include \"types.h\"\n#include \"yololayer.h\"\n\n__device__ float d_confThreshold = 0.4f;\n\nnamespace Tn {\ntemplate <typename T>\nvoid write(char*& buffer, const T& val) {\n    *reinterpret_cast<T*>(buffer) = val;\n    buffer += sizeof(T);\n}\n\ntemplate <typename T>\nvoid read(const char*& buffer, T& val) {\n    val = *reinterpret_cast<const T*>(buffer);\n    buffer += sizeof(T);\n}\n}  // namespace Tn\n\n__device__ float sigmoid(float x) {\n    return 1.0f / (1.0f + exp(-x));\n}\n\nnamespace nvinfer1 {\n\nvoid setPluginDeviceParams(float confThreshold) {\n    cudaMemcpyToSymbol(d_confThreshold, &confThreshold, sizeof(float));\n}\n\nYoloLayerPlugin::YoloLayerPlugin(int classCount, int numberOfPoints, int maxDetections, bool isDetection,\n                                 bool isSegmentation, bool isPose, bool isObb, int anchor_count) {\n\n    mClassCount = classCount;\n    mNumberOfPoints = numberOfPoints;\n    mThreadCount = 256;\n    mMaxDetections = maxDetections;\n    mIsDetection = isDetection;\n    mIsSegmentation = isSegmentation;\n    mIsPose = isPose;\n    mIsObb = isObb;\n    mAnchorCount = anchor_count;\n\n    /*\n    std::cout << \"YoloLayerPlugin created with the following parameters:\" << std::endl;\n    std::cout << \"  Class Count: \" << mClassCount << std::endl;\n    std::cout << \"  Number of Points: \" << mNumberOfPoints << std::endl;\n    std::cout << \"  Confidence Threshold Keypoints: \" << mConfThreshold << std::endl;\n    std::cout << \"  Max Detections: \" << mMaxDetections << std::endl;\n    std::cout << \"  Is Detection: \" << mIsDetection << std::endl;\n    std::cout << \"  Is Segmentation: \" << mIsSegmentation << std::endl;\n    std::cout << \"  Is Pose: \" << mIsPose << std::endl;\n    std::cout << \"  Is OBB: \" << mIsObb << std::endl;\n    std::cout << \"  Anchor Count: \" << mAnchorCount << std::endl;\n    std::cout << \"  Strides: \";\n    for (int i = 0; i < mStridesLength; ++i) {\n        std::cout << mStrides[i] << \" \";\n    }\n    std::cout << std::endl;\n    */\n}\n\nYoloLayerPlugin::~YoloLayerPlugin() {}\n\nYoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length) {\n    using namespace Tn;\n    const char *d = reinterpret_cast<const char*>(data), *a = d;\n    read(d, mClassCount);\n    read(d, mNumberOfPoints);\n    read(d, mThreadCount);\n    read(d, mMaxDetections);\n    read(d, mIsDetection);\n    read(d, mIsSegmentation);\n    read(d, mIsPose);\n    read(d, mIsObb);\n    read(d, mAnchorCount);\n\n    assert(d == a + length);\n}\n\nvoid YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT {\n\n    using namespace Tn;\n    char *d = static_cast<char*>(buffer), *a = d;\n    write(d, mClassCount);\n    write(d, mNumberOfPoints);\n    write(d, mThreadCount);\n    write(d, mMaxDetections);\n    write(d, mIsDetection);\n    write(d, mIsSegmentation);\n    write(d, mIsPose);\n    write(d, mIsObb);\n    write(d, mAnchorCount);\n\n    assert(d == a + getSerializationSize());\n}\n\nsize_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT {\n    return sizeof(mClassCount) + sizeof(mNumberOfPoints) + sizeof(mThreadCount) + sizeof(mMaxDetections) +\n           sizeof(mIsDetection) + sizeof(mIsSegmentation) + sizeof(mIsPose) + sizeof(mIsObb) + sizeof(mAnchorCount);\n}\n\nint YoloLayerPlugin::initialize() TRT_NOEXCEPT {\n    return 0;\n}\n\nnvinfer1::Dims YoloLayerPlugin::getOutputDimensions(int index, const nvinfer1::Dims* inputs,\n                                                    int nbInputDims) TRT_NOEXCEPT {\n    int total_size = mMaxDetections * sizeof(Detection) / sizeof(float);\n    return nvinfer1::Dims3(total_size + 1, 1, 1);\n}\n\nvoid YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT {\n    mPluginNamespace = pluginNamespace;\n}\n\nconst char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT {\n    return mPluginNamespace;\n}\n\nnvinfer1::DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes,\n                                                      int nbInputs) const TRT_NOEXCEPT {\n    return nvinfer1::DataType::kFLOAT;\n}\n\nbool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                                   int nbInputs) const TRT_NOEXCEPT {\n    return false;\n}\n\nbool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT {\n    return false;\n}\n\nvoid YoloLayerPlugin::configurePlugin(nvinfer1::PluginTensorDesc const* in, int32_t nbInput,\n                                      nvinfer1::PluginTensorDesc const* out, int32_t nbOutput) TRT_NOEXCEPT {}\n\nvoid YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                                      IGpuAllocator* gpuAllocator) TRT_NOEXCEPT {}\n\nvoid YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\nconst char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT {\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nvoid YoloLayerPlugin::destroy() TRT_NOEXCEPT {\n    delete this;\n}\n\nnvinfer1::IPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT {\n    YoloLayerPlugin* p = new YoloLayerPlugin(mClassCount, mNumberOfPoints, mMaxDetections, mIsDetection,\n                                             mIsSegmentation, mIsPose, mIsObb, mAnchorCount);\n    p->setPluginNamespace(mPluginNamespace);\n    return p;\n}\n\nint YoloLayerPlugin::enqueue(int batchSize, const void* const* inputs, void* const* outputs, void* workspace,\n                             cudaStream_t stream) TRT_NOEXCEPT {\n    gatherKernelLauncher(reinterpret_cast<const float* const*>(inputs), reinterpret_cast<float*>(outputs[0]), stream,\n                         batchSize);\n\n    return 0;\n}\n\n__device__ float Logist(float data) {\n    return 1.f / (1.f + expf(-data));\n}\n\n__global__ void gatherKernel(const float* input, float* output, int num_elements, int max_out_object, int class_count,\n                             int nk, int output_elem, bool is_detection, bool is_segmentation, bool is_pose,\n                             bool is_obb) {\n    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n    if (idx >= num_elements)\n        return;\n\n    int outputIdx = 0 * output_elem;  // TODO: ADD BATCH SUPPORT HERE\n    int anchor_size = -1;\n    float angle = 0.0f;\n\n    if (is_detection) {\n        anchor_size = 4 + class_count;\n    } else if (is_obb) {\n        anchor_size = 5 + class_count;\n        angle = input[idx * (anchor_size) + 4 + class_count];\n    }\n\n    float xmin = input[idx * (anchor_size) + 0];\n    float ymin = input[idx * (anchor_size) + 1];\n    float xmax = input[idx * (anchor_size) + 2];\n    float ymax = input[idx * (anchor_size) + 3];\n\n    float score = 0.0f;\n    int class_id = -1;\n    for (int c = 0; c < class_count; c++) {\n        float conf = input[idx * (anchor_size) + 4 + c];\n        if (conf > score) {\n            score = conf;\n            class_id = c;\n        }\n    }\n\n    if (score < d_confThreshold) {\n        return;\n    }\n\n    int count = (int)atomicAdd(output + outputIdx, 1);\n    if (count >= max_out_object) {\n        return;\n    }\n\n    int det_size = sizeof(Detection) / sizeof(float);\n    Detection* det = (Detection*)(output + outputIdx + 1 + count * det_size);\n\n    /*\n    float scale = fminf(640.0f / 1080.0f, 640.0f / 608.0f);    // TODO: GET FROM PARAMETERS WITH SCALE!\n    float offset_x = -scale * 1080.0f / 2.0f + 640.0f / 2.0f;  // TODO: GET FROM PARAMETERS WITH OFFSET!\n    float offset_y = -scale * 608.0f / 2.0f + 640.0f / 2.0f;   // TODO: GET FROM PARAMETERS WITH OFFSET!\n    \n\n    det->conf = score;\n    det->class_id = 1;  // TODO: ADD CLASS ID HERE\n    det->bbox[0] = (xmin - offset_x) / scale;\n    det->bbox[1] = (ymin - offset_y) / scale;\n    det->bbox[2] = (xmax - offset_x) / scale;\n    det->bbox[3] = (ymax - offset_y) / scale;\n    */\n\n    det->conf = score;\n    det->class_id = class_id;\n    det->bbox[0] = xmin;\n    det->bbox[1] = ymin;\n    det->bbox[2] = xmax;\n    det->bbox[3] = ymax;\n\n    if (is_obb) {\n        det->angle = angle;\n    }\n\n    // TODO: ADD KEYPOINTS, SEGMENTATION, OBB HERE\n}\n\nvoid YoloLayerPlugin::gatherKernelLauncher(const float* const* inputs, float* outputs, cudaStream_t stream,\n                                           int batchSize) {\n    // TODO: ADD BATCH SUPPORT, CURRENTLY ONLY BATCH=1 IS SUPPORTED\n    // TODO: ADD SEGMENTATION, POSE, OBB SUPPORT\n    // TODO: num_elem = batch_size * anchor_num\n    const float* input = inputs[0];\n\n    int outputElem = mMaxDetections * sizeof(Detection) / sizeof(float) + 1;\n    int num_elem = mAnchorCount;  // Use anchor count from model configuration\n\n    dim3 blockSize(mThreadCount);\n    dim3 gridSize((num_elem + mThreadCount - 1) / mThreadCount);\n\n    cudaMemsetAsync(outputs, 0, batchSize * outputElem * sizeof(float), stream);  // TODO: adjust for batch size\n\n    gatherKernel<<<gridSize, blockSize, 0, stream>>>(input, outputs, num_elem, mMaxDetections, mClassCount,\n                                                     mNumberOfPoints, outputElem, mIsDetection, mIsSegmentation,\n                                                     mIsPose, mIsObb);\n}\n\nPluginFieldCollection YoloLayerPluginCreator::mFC{};\nstd::vector<PluginField> YoloLayerPluginCreator::mPluginAttributes;\n\nYoloLayerPluginCreator::YoloLayerPluginCreator() {\n    mPluginAttributes.clear();\n    mFC.nbFields = mPluginAttributes.size();\n    mFC.fields = mPluginAttributes.data();\n}\n\nconst char* YoloLayerPluginCreator::getPluginName() const TRT_NOEXCEPT {\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloLayerPluginCreator::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nconst PluginFieldCollection* YoloLayerPluginCreator::getFieldNames() TRT_NOEXCEPT {\n    return &mFC;\n}\n\nIPluginV2IOExt* YoloLayerPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT {\n\n    assert(fc->nbFields == 1);\n    assert(strcmp(fc->fields[0].name, \"combinedInfo\") == 0);\n    const int* combinedInfo = static_cast<const int*>(fc->fields[0].data);\n    int net_info_count = fc->fields[0].length;\n    int class_count = combinedInfo[0];\n    int number_of_points = combinedInfo[1];\n    int max_detections = combinedInfo[2];\n    bool is_detection = combinedInfo[3];\n    bool is_segmentation = combinedInfo[4];\n    bool is_pose = combinedInfo[5];\n    bool is_obb = combinedInfo[6];\n    int anchor_count = combinedInfo[7];\n\n    YoloLayerPlugin* plugin = new YoloLayerPlugin(class_count, number_of_points, max_detections, is_detection,\n                                                  is_segmentation, is_pose, is_obb, anchor_count);\n    plugin->setPluginNamespace(mNamespace.c_str());\n    return plugin;\n}\n\nIPluginV2IOExt* YoloLayerPluginCreator::deserializePlugin(const char* name, const void* serialData,\n                                                          size_t serialLength) TRT_NOEXCEPT {\n    YoloLayerPlugin* plugin = new YoloLayerPlugin(serialData, serialLength);\n    plugin->setPluginNamespace(mNamespace.c_str());\n    return plugin;\n}\n\n}  // namespace nvinfer1"
  },
  {
    "path": "yolo26/plugin/yololayer.h",
    "content": "#pragma once\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n#include \"macros.h\"\n\nnamespace nvinfer1 {\n\nvoid setPluginDeviceParams(float confThreshold);\n\nclass API YoloLayerPlugin : public IPluginV2IOExt {\n   public:\n    YoloLayerPlugin(int classCount, int numberOfPoints, int maxDetections, bool isDetection, bool isSegmentation,\n                    bool isPose, bool isObb, int anchor_count);\n    YoloLayerPlugin(const void* data, size_t length);\n\n    ~YoloLayerPlugin();\n\n    int getNbOutputs() const TRT_NOEXCEPT override { return 1; }\n\n    nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n    int initialize() TRT_NOEXCEPT override;\n\n    virtual void terminate() TRT_NOEXCEPT override {}\n\n    virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0; }\n\n    virtual int enqueue(int batchSize, const void* const* inputs, void* const* outputs, void* workspace,\n                        cudaStream_t stream) TRT_NOEXCEPT override;\n\n    virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n    virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n    bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs,\n                                   int nbOutputs) const TRT_NOEXCEPT override {\n        return inOut[pos].type == nvinfer1::DataType::kFLOAT && inOut[pos].format == nvinfer1::TensorFormat::kLINEAR;\n    }\n\n    const char* getPluginType() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    void destroy() TRT_NOEXCEPT override;\n\n    IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n    nvinfer1::DataType getOutputDataType(int32_t index, nvinfer1::DataType const* inputTypes,\n                                         int32_t nbInputs) const TRT_NOEXCEPT override;\n\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                      int nbInputs) const TRT_NOEXCEPT override;\n\n    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n    void attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                         IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n    void configurePlugin(PluginTensorDesc const* in, int32_t nbInput, PluginTensorDesc const* out,\n                         int32_t nbOutput) TRT_NOEXCEPT override;\n\n    void detachFromContext() TRT_NOEXCEPT override;\n\n   private:\n    void gatherKernelLauncher(const float* const* inputs, float* outputs, cudaStream_t stream, int batchSize);\n    int mThreadCount = 256;\n    const char* mPluginNamespace = \"\";\n    int mClassCount;\n    int mNumberOfPoints;\n    int mMaxDetections;\n    bool mIsDetection;\n    bool mIsSegmentation;\n    bool mIsPose;\n    bool mIsObb;\n    int mAnchorCount;\n};\n\nclass API YoloLayerPluginCreator : public IPluginCreator {\n   public:\n    YoloLayerPluginCreator();\n\n    const char* getPluginName() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    const PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n    IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n    IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData,\n                                      size_t serialLength) TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override { mNamespace = pluginNamespace; }\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override { return mNamespace.c_str(); }\n\n   private:\n    std::string mNamespace;\n    static PluginFieldCollection mFC;\n    static std::vector<PluginField> mPluginAttributes;\n};\n\nREGISTER_TENSORRT_PLUGIN(YoloLayerPluginCreator);\n}  // namespace nvinfer1"
  },
  {
    "path": "yolo26/src/block.cpp",
    "content": "#include \"block.h\"\n#include <assert.h>\n#include <math.h>\n#include <fstream>\n#include <iostream>\n#include \"config.h\"\n#include \"model.h\"\n#include \"yololayer.h\"\n\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, nvinfer1::Weights> WeightMap;\n\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = nvinfer1::DataType::kFLOAT;\n\n        //uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(uint32_t) * size));\n\n        for (uint32_t x = 0, y = size; x < y; x++) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        wt.count = size;\n        WeightMap[name] = wt;\n    }\n    return WeightMap;\n}\n\nnvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                      std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                      std::string lname, float eps) {\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights scale{nvinfer1::DataType::kFLOAT, scval, len};\n\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights shift{nvinfer1::DataType::kFLOAT, shval, len};\n\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    nvinfer1::Weights power{nvinfer1::DataType::kFLOAT, pval, len};\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    nvinfer1::IScaleLayer* output = network->addScale(input, nvinfer1::ScaleMode::kCHANNEL, shift, scale, power);\n    assert(output);\n    output->setName(lname.c_str());\n    return output;\n}\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, std::vector<int> k, int s, std::string lname, int g) {\n\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k[0], k[1]},\n                                                                  weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    // auto pad\n    int p0 = k[0] / 2;\n    int p1 = k[1] / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p0, p1});\n    conv->setNbGroups(g);\n    conv->setName((lname + \"/conv/Conv\").c_str());\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    sigmoid->setName((lname + \"/act/Sigmoid\").c_str());\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    ew->setName((lname + \"/act/Mul\").c_str());\n    return ew;\n}\n\nnvinfer1::ILayer* conv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                       nvinfer1::ITensor& input, int ch, std::vector<int> k, int s, std::string lname, int g,\n                       bool act) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k[0], k[1]},\n                                                                  weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    // auto pad\n    int p0 = k[0] / 2;\n    int p1 = k[1] / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p0, p1});\n    conv->setNbGroups(g);\n    conv->setName((lname + \"/conv/Conv\").c_str());\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    if (!act)\n        return bn;\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    sigmoid->setName((lname + \"/act/Sigmoid\").c_str());\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    ew->setName((lname + \"/act/Mul\").c_str());\n    return ew;\n}\n\nstatic nvinfer1::ILayer* bottleneck(nvinfer1::INetworkDefinition* network,\n                                    std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                    int c1, int c2, bool shortcut, std::vector<int> k1, std::vector<int> k2, float e,\n                                    std::string lname, int g = 1) {\n    int c_ = (int)((float)c2 * e);\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, c_, k1, 1, lname + \".cv1\");\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *conv1->getOutput(0), c2, k2, 1, lname + \".cv2\", g);\n\n    if (shortcut && c1 == c2) {\n        nvinfer1::IElementWiseLayer* ew =\n                network->addElementWise(input, *conv2->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n        ew->setName((lname + \".add\").c_str());\n        return ew;\n    }\n    return conv2;\n}\n\nstatic nvinfer1::ILayer* convBn(nvinfer1::INetworkDefinition* network,\n                                std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int ch,\n                                int k, int s, std::string lname, int g = 1) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k, k}, weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    int p = k / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n    conv->setNbGroups(g);\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n    return bn;\n}\n\nstatic nvinfer1::ILayer* Attention(nvinfer1::INetworkDefinition* network,\n                                   std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                   int dim, int num_heads, float attn_ratio, std::string lname) {\n    int head_dim = dim / num_heads;\n    int key_dim = head_dim * attn_ratio;\n    float scale = pow(key_dim, -0.5);\n    int nh_kd = key_dim * num_heads;\n    int h = dim + nh_kd * 2;\n\n    auto d = input.getDimensions();\n    int B = d.d[0];\n    int H = d.d[2];\n    int W = d.d[3];\n    int N = H * W;\n    auto* qkv = convBn(network, weightMap, input, h, 1, 1, lname + \".qkv\");\n    // qkv.view(B, self.num_heads, -1, N)\n    auto shuffle = network->addShuffle(*qkv->getOutput(0));\n    shuffle->setReshapeDimensions(nvinfer1::Dims4{B, num_heads, -1, N});\n    // q, k, v = .split([self.key_dim, self.key_dim, self.head_dim], dim=2)\n    auto d1 = shuffle->getOutput(0)->getDimensions();\n    auto q = network->addSlice(*shuffle->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n                               nvinfer1::Dims4{d1.d[0], d1.d[1], key_dim, d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    auto k = network->addSlice(*shuffle->getOutput(0), nvinfer1::Dims4{0, 0, key_dim, 0},\n                               nvinfer1::Dims4{d1.d[0], d1.d[1], key_dim, d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    auto v = network->addSlice(*shuffle->getOutput(0), nvinfer1::Dims4{0, 0, key_dim * 2, 0},\n                               nvinfer1::Dims4{d1.d[0], d1.d[1], head_dim, d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    // attn = ((q.transpose(-2, -1) @ k) * self.scale)\n    auto qT = network->addShuffle(*q->getOutput(0));\n    qT->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});\n    auto matmul = network->addMatrixMultiply(*qT->getOutput(0), nvinfer1::MatrixOperation::kNONE, *k->getOutput(0),\n                                             nvinfer1::MatrixOperation::kNONE);\n    // There are not many memory leaks, and I will change it when I have time\n    float* scale_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    scale_val[0] = scale;\n    nvinfer1::Weights s_w{nvinfer1::DataType::kFLOAT, scale_val, 1};\n    float* shift_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    shift_val[0] = 0;\n    nvinfer1::Weights sh_w{nvinfer1::DataType::kFLOAT, shift_val, 1};\n    float* power_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    power_val[0] = 1;\n    nvinfer1::Weights p_w{nvinfer1::DataType::kFLOAT, power_val, 1};\n    nvinfer1::IScaleLayer* scaleLayer =\n            network->addScale(*matmul->getOutput(0), nvinfer1::ScaleMode::kUNIFORM, sh_w, s_w, p_w);\n    // attn = attn.softmax(dim=-1)\n    nvinfer1::ISoftMaxLayer* softmax = network->addSoftMax(*scaleLayer->getOutput(0));\n    softmax->setAxes(1 << 3);\n    // x = (v @ attn.transpose(-2, -1)).view(B, -1, H, W) + self.pe(v.reshape(B, -1, H, W))\n    auto attnT = network->addShuffle(*softmax->getOutput(0));\n    attnT->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});\n    auto matmul2 = network->addMatrixMultiply(*v->getOutput(0), nvinfer1::MatrixOperation::kNONE, *attnT->getOutput(0),\n                                              nvinfer1::MatrixOperation::kNONE);\n    auto reshape = network->addShuffle(*matmul2->getOutput(0));\n    reshape->setReshapeDimensions(nvinfer1::Dims4{B, -1, H, W});\n    auto v_reshape = network->addShuffle(*v->getOutput(0));\n    v_reshape->setReshapeDimensions(nvinfer1::Dims4{B, -1, H, W});\n    // self.pe = Conv(dim, dim, 3, 1, g=dim, act=False)\n    auto pe = convBn(network, weightMap, *v_reshape->getOutput(0), dim, 3, 1, lname + \".pe\", dim);\n    auto sum = network->addElementWise(*reshape->getOutput(0), *pe->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    // x = self.proj(x)\n    // self.proj = Conv(dim, dim, 1, act=False)\n    auto proj = convBn(network, weightMap, *sum->getOutput(0), dim, 1, 1, lname + \".proj\");\n    return proj;\n}\n\nstatic nvinfer1::ILayer* PSABlock(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int dim,\n                                  float attn_ratio, int num_heads, bool shortcut, std::string lname) {\n\n    auto attn = Attention(network, weightMap, input, dim, num_heads, attn_ratio, lname + \".attn\");\n    nvinfer1::ILayer* shortcut_layer = nullptr;\n    if (shortcut) {\n        shortcut_layer = network->addElementWise(input, *attn->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    } else {\n        shortcut_layer = attn;\n    }\n    // self.ffn = nn.Sequential(Conv(c, c * 2, 1), Conv(c * 2, c, 1, act=False))\n    // x = x + self.ffn(x) if self.add else self.ffn(x)\n    auto ffn0 = convBnSiLU(network, weightMap, *shortcut_layer->getOutput(0), dim * 2, {1, 1}, 1, lname + \".ffn.0\");\n    auto ffn1 = convBn(network, weightMap, *ffn0->getOutput(0), dim, 1, 1, lname + \".ffn.1\");\n    if (shortcut) {\n        return network->addElementWise(*shortcut_layer->getOutput(0), *ffn1->getOutput(0),\n                                       nvinfer1::ElementWiseOperation::kSUM);\n    } else {\n        return ffn1;\n    }\n}\n\nstatic nvinfer1::ILayer* C3k(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int c1, int c2, int n, bool shortcut, std::vector<int> k1,\n                             std::vector<int> k2, float e, std::string lname) {\n    int c_ = (int)((float)c2 * e);\n    auto cv1 = convBnSiLU(network, weightMap, input, c_, {1, 1}, 1, lname + \".cv1\");\n    auto cv2 = convBnSiLU(network, weightMap, input, c_, {1, 1}, 1, lname + \".cv2\");\n    nvinfer1::ITensor* y1 = cv1->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        auto b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, k1, k2, 1.0, lname + \".m.\" + std::to_string(i));\n        y1 = b->getOutput(0);\n    }\n\n    nvinfer1::ITensor* inputTensors[] = {y1, cv2->getOutput(0)};\n    auto cat = network->addConcatenation(inputTensors, 2);\n    cat->setName((lname + \".cat\").c_str());\n\n    auto cv3 = convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv3\");\n    return cv3;\n}\n\nnvinfer1::IElementWiseLayer* C3K2(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int n, bool c3k, bool shortcut, bool attn, float e, std::string lname) {\n    int c_ = (int)((float)c2 * e);\n\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, 2 * c_, {1, 1}, 1, lname + \".cv1\");\n    nvinfer1::Dims d = conv1->getOutput(0)->getDimensions();\n\n    nvinfer1::ISliceLayer* split1 =\n            network->addSlice(*conv1->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n                              nvinfer1::Dims4{d.d[0], d.d[1] / 2, d.d[2], d.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    split1->setName((lname + \".split1\").c_str());\n    nvinfer1::ISliceLayer* split2 =\n            network->addSlice(*conv1->getOutput(0), nvinfer1::Dims4{0, d.d[1] / 2, 0, 0},\n                              nvinfer1::Dims4{d.d[0], d.d[1] / 2, d.d[2], d.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    split2->setName((lname + \".split2\").c_str());\n    nvinfer1::ITensor* inputTensor0[] = {split1->getOutput(0), split2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensor0, 2);\n    cat->setName((lname + \".cat0\").c_str());\n    nvinfer1::ITensor* y1 = split2->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        nvinfer1::ILayer* b = nullptr;\n        if (attn) {\n            b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, {3, 3}, {3, 3}, 0.5,\n                           lname + \".m.\" + std::to_string(i) + \".0\");\n\n            b = PSABlock(network, weightMap, *b->getOutput(0), c_, 0.5, max(1, c_ / 64), shortcut,\n                         lname + \".m.\" + std::to_string(i) + \".1\");\n\n        } else if (c3k) {\n            b = C3k(network, weightMap, *y1, c_, c_, 2, shortcut, {3, 3}, {3, 3}, 0.5,\n                    lname + \".m.\" + std::to_string(i));\n        } else {\n            b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, {3, 3}, {3, 3}, 0.5,\n                           lname + \".m.\" + std::to_string(i));\n        }\n        y1 = b->getOutput(0);\n\n        nvinfer1::ITensor* inputTensors[] = {cat->getOutput(0), b->getOutput(0)};\n        cat = network->addConcatenation(inputTensors, 2);\n        cat->setName((lname + \".cat\" + std::to_string(i + 1)).c_str());\n    }\n\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv2\");\n\n    return conv2;\n}\n\nnvinfer1::IElementWiseLayer* SPPF(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int k, bool shortcut, std::string lname) {\n    int c_ = c1 / 2;\n    nvinfer1::ILayer* conv1 = conv(network, weightMap, input, c_, {1, 1}, 1, lname + \".cv1\", 1, false);\n    nvinfer1::IPoolingLayer* pool1 =\n            network->addPoolingNd(*conv1->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool1->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool1->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::IPoolingLayer* pool2 =\n            network->addPoolingNd(*pool1->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool2->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::IPoolingLayer* pool3 =\n            network->addPoolingNd(*pool2->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool3->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool3->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::ITensor* inputTensors[] = {conv1->getOutput(0), pool1->getOutput(0), pool2->getOutput(0),\n                                         pool3->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensors, 4);\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv2\");\n\n    if (shortcut && (c1 == c2)) {\n        nvinfer1::IElementWiseLayer* sum =\n                network->addElementWise(input, *conv2->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n        return sum;\n    } else {\n        return conv2;\n    }\n}\n\nnvinfer1::IElementWiseLayer* C2PSA(nvinfer1::INetworkDefinition* network,\n                                   std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input,\n                                   int c1, int c2, int n, float e, std::string lname) {\n    int c = c2 * e;\n\n    // cv1 branch\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, 2 * c, {1, 1}, 1, lname + \".cv1\");\n    nvinfer1::ITensor* cv1_out = conv1->getOutput(0);\n\n    // Split the output of cv1 into two tensors\n    nvinfer1::Dims dims = cv1_out->getDimensions();\n    nvinfer1::ISliceLayer* split1 = network->addSlice(*cv1_out, nvinfer1::Dims4{0, 0, 0, 0},\n                                                      nvinfer1::Dims4{dims.d[0], dims.d[1] / 2, dims.d[2], dims.d[3]},\n                                                      nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* split2 = network->addSlice(*cv1_out, nvinfer1::Dims4{0, dims.d[1] / 2, 0, 0},\n                                                      nvinfer1::Dims4{dims.d[0], dims.d[1] / 2, dims.d[2], dims.d[3]},\n                                                      nvinfer1::Dims4{1, 1, 1, 1});\n\n    // Create y1 bottleneck sequence\n    nvinfer1::ITensor* y = split2->getOutput(0);\n    for (int i = 0; i < n; ++i) {\n        auto* bottleneck_layer =\n                PSABlock(network, weightMap, *y, c, 0.5, c / 64, true, lname + \".m.\" + std::to_string(i));\n        y = bottleneck_layer->getOutput(0);  // update 'y1' to be the output of the current bottleneck\n    }\n\n    // Concatenate y1 with the second split of cv1\n    nvinfer1::ITensor* concatInputs[2] = {split1->getOutput(0), y};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(concatInputs, 2);\n\n    // cv2 to produce the final output\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c1, {1, 1}, 1, lname + \".cv2\");\n\n    return conv2;\n}\n\nnvinfer1::ILayer* DWConv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int ch, std::vector<int> k, int s, std::string lname) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k[0], k[1]},\n                                                                  weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setNbGroups(ch);\n    // auto pad\n    int p0 = k[0] / 2;\n    int p1 = k[1] / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p0, p1});\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n\nnvinfer1::IPluginV2Layer* addYoloLayer(nvinfer1::INetworkDefinition* network, nvinfer1::ITensor& input,\n                                       const std::vector<int>& strides, const std::vector<int>& fm_sizes,\n                                       int stridesLength, bool is_detection, bool is_segmentation, bool is_pose,\n                                       bool is_obb, int anchorCount) {\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    const int netinfo_count = 8;\n    const int total_count = netinfo_count + stridesLength;\n    int class_num = kNumClass;\n    if (is_pose) {\n        class_num = kPoseNumClass;\n    }\n\n    if (is_obb) {\n        class_num = kObbNumClass;\n    }\n\n    std::vector<int> combinedInfo(total_count);\n    combinedInfo[0] = class_num;\n    combinedInfo[1] = kNumberOfPoints;\n    combinedInfo[2] = kMaxNumOutputBbox;\n    combinedInfo[3] = is_detection;\n    combinedInfo[4] = is_segmentation;\n    combinedInfo[5] = is_pose;\n    combinedInfo[6] = is_obb;\n    combinedInfo[7] = anchorCount;\n\n    nvinfer1::PluginField pluginField;\n    pluginField.name = \"combinedInfo\";\n    pluginField.data = combinedInfo.data();\n    pluginField.type = nvinfer1::PluginFieldType::kINT32;\n    pluginField.length = combinedInfo.size();\n\n    nvinfer1::PluginFieldCollection pluginFieldCollection;\n    pluginFieldCollection.nbFields = 1;\n    pluginFieldCollection.fields = &pluginField;\n\n    nvinfer1::IPluginV2* pluginObject = creator->createPlugin(\"yololayer\", &pluginFieldCollection);\n\n    // Use the single input tensor instead of multiple detection heads\n    nvinfer1::ITensor* inputTensors[] = {&input};\n    nvinfer1::IPluginV2Layer* yololayer = network->addPluginV2(inputTensors, 1, *pluginObject);\n    return yololayer;\n}"
  },
  {
    "path": "yolo26/src/model.cpp",
    "content": "#include <math.h>\n#include <iostream>\n\n#include \"block.h\"\n// #include \"calibrator.h\"\n#include \"config.h\"\n#include \"model.h\"\n\nstatic int get_width(int x, float gw, int max_channels, int divisor = 8) {\n    auto channel = std::min(x, max_channels);\n    channel = int(ceil((channel * gw) / divisor)) * divisor;\n    return channel;\n}\n\nstatic int get_depth(int x, float gd) {\n    if (x == 1)\n        return 1;\n    int r = round(x * gd);\n    if (x * gd - int(x * gd) == 0.5 && (int(x * gd) % 2) == 0)\n        --r;\n    return std::max<int>(r, 1);\n}\n\nvoid calculateStrides(nvinfer1::IElementWiseLayer* conv_layers[], int size, int reference_size, int strides[]) {\n    for (int i = 0; i < size; ++i) {\n        nvinfer1::ILayer* layer = conv_layers[i];\n        nvinfer1::Dims dims = layer->getOutput(0)->getDimensions();\n        int feature_map_size = dims.d[2];\n        strides[i] = reference_size / feature_map_size;\n    }\n}\n\nnvinfer1::IHostMemory* buildEngineYolo26Det(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type)\n\n{\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    /*******************************************************************************************************\n     ******************************************  YOLO26 INPUT  **********************************************\n     *******************************************************************************************************/\n\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n    *****************************************  YOLO26 BACKBONE  ********************************************\n    *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* block0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), {3, 3}, 2, \"model.0\");\n\n    nvinfer1::IElementWiseLayer* block1 = convBnSiLU(network, weightMap, *block0->getOutput(0),\n                                                     get_width(128, gw, max_channels), {3, 3}, 2, \"model.1\");\n\n    bool c3k = false;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        c3k = true;\n    }\n\n    nvinfer1::IElementWiseLayer* conv2 =\n            C3K2(network, weightMap, *block1->getOutput(0), get_width(128, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), c3k, true, false, 0.25, \"model.2\");\n\n    nvinfer1::IElementWiseLayer* block3 = convBnSiLU(network, weightMap, *conv2->getOutput(0),\n                                                     get_width(256, gw, max_channels), {3, 3}, 2, \"model.3\");\n\n    nvinfer1::IElementWiseLayer* block4 =\n            C3K2(network, weightMap, *block3->getOutput(0), get_width(256, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, false, 0.25, \"model.4\");\n\n    nvinfer1::IElementWiseLayer* block5 = convBnSiLU(network, weightMap, *block4->getOutput(0),\n                                                     get_width(512, gw, max_channels), {3, 3}, 2, \"model.5\");\n\n    nvinfer1::IElementWiseLayer* block6 =\n            C3K2(network, weightMap, *block5->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), true, true, false, 0.5, \"model.6\");\n\n    nvinfer1::IElementWiseLayer* block7 = convBnSiLU(network, weightMap, *block6->getOutput(0),\n                                                     get_width(1024, gw, max_channels), {3, 3}, 2, \"model.7\");\n\n    nvinfer1::IElementWiseLayer* block8 =\n            C3K2(network, weightMap, *block7->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), get_depth(2, gd), true, true, false, 0.5, \"model.8\");\n\n    nvinfer1::IElementWiseLayer* block9 =\n            SPPF(network, weightMap, *block8->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 5, true, \"model.9\");\n\n    nvinfer1::IElementWiseLayer* block10 =\n            C2PSA(network, weightMap, *block9->getOutput(0), get_width(1024, gw, max_channels),\n                  get_width(1024, gw, max_channels), get_depth(2, gd), 0.5, \"model.10\");\n    /*******************************************************************************************************\n    *********************************************  YOLO26 HEAD  ********************************************\n    *******************************************************************************************************/\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample11 = network->addResize(*block10->getOutput(0));\n    assert(upsample11);\n\n    upsample11->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample11->setScales(scale, 4);\n    nvinfer1::ITensor* inputTensors12[] = {upsample11->getOutput(0), block6->getOutput(0)};\n\n    nvinfer1::IConcatenationLayer* cat12 = network->addConcatenation(inputTensors12, 2);\n\n    nvinfer1::IElementWiseLayer* block13 =\n            C3K2(network, weightMap, *cat12->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), true, true, false, 0.5, \"model.13\");\n\n    nvinfer1::IResizeLayer* upsample14 = network->addResize(*block13->getOutput(0));\n    assert(upsample14);\n\n    upsample14->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample14->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensors15[] = {upsample14->getOutput(0), block4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat15 = network->addConcatenation(inputTensors15, 2);\n\n    nvinfer1::IElementWiseLayer* block16 =\n            C3K2(network, weightMap, *cat15->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), true, true, false, 0.5, \"model.16\");\n\n    nvinfer1::IElementWiseLayer* block17 = convBnSiLU(network, weightMap, *block16->getOutput(0),\n                                                      get_width(256, gw, max_channels), {3, 3}, 2, \"model.17\");\n\n    nvinfer1::ITensor* inputTensors18[] = {block17->getOutput(0), block13->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat18 = network->addConcatenation(inputTensors18, 2);\n\n    nvinfer1::IElementWiseLayer* block19 =\n            C3K2(network, weightMap, *cat18->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), true, true, false, 0.5, \"model.19\");\n\n    nvinfer1::IElementWiseLayer* block20 = convBnSiLU(network, weightMap, *block19->getOutput(0),\n                                                      get_width(512, gw, max_channels), {3, 3}, 2, \"model.20\");\n\n    nvinfer1::ITensor* inputTensors21[] = {block20->getOutput(0), block10->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21 = network->addConcatenation(inputTensors21, 2);\n\n    nvinfer1::IElementWiseLayer* block22 =\n            C3K2(network, weightMap, *cat21->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 1, true, true, true, 0.5,\n                 \"model.22\");  // WARN: get_depth(2, gd) changed to 1.\n\n    /*******************************************************************************************************\n    *********************************************  YOLO26 OUTPUT  ********************************************\n    *******************************************************************************************************/\n\n    int c2 = std::max(std::max(16, get_width(256, gw, max_channels)), 16 * 4);\n    int c3 = std::max(get_width(256, gw, max_channels), std::min(kNumClass, 100));\n\n    /////////////////////////////////////////////////////\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_0_0_0 =\n            convBnSiLU(network, weightMap, *block16->getOutput(0), c2, {3, 3}, 1, \"model.23.one2one_cv3.0.0.0\", c2);\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_0_0_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_0_0_0->getOutput(0), c3, {1, 1}, 1,\n                       \"model.23.one2one_cv3.0.0.1\", 1);\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_0_1_0 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_0_0_1->getOutput(0), c3, {3, 3}, 1,\n                       \"model.23.one2one_cv3.0.1.0\", c3);\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_0_1_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_0_1_0->getOutput(0), c3, {1, 1}, 1,\n                       \"model.23.one2one_cv3.0.1.1\", 1);\n\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv3_0_2 = network->addConvolutionNd(\n            *conv23_one2one_cv3_0_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv3.0.2.weight\"], weightMap[\"model.23.one2one_cv3.0.2.bias\"]);\n    conv23_one2one_cv3_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv3_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv3_0_2->setNbGroups(1);\n\n    nvinfer1::IShuffleLayer* reshape23_3 = network->addShuffle(*conv23_one2one_cv3_0_2->getOutput(0));\n    reshape23_3->setReshapeDimensions(nvinfer1::Dims3{1, kNumClass, -1});\n\n    /////////////////////////////////////////////////////\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_1_0_0 = convBnSiLU(\n            network, weightMap, *block19->getOutput(0), c2 * 2, {3, 3}, 1, \"model.23.one2one_cv3.1.0.0\", c2 * 2);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_1_0_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_1_0_0->getOutput(0), c3, {1, 1}, 1,\n                       \"model.23.one2one_cv3.1.0.1\", 1);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_1_1_0 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_1_0_1->getOutput(0), c3, {3, 3}, 1,\n                       \"model.23.one2one_cv3.1.1.0\", c3);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_1_1_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_1_1_0->getOutput(0), c3, {1, 1}, 1,\n                       \"model.23.one2one_cv3.1.1.1\", 1);\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv3_1_2 = network->addConvolutionNd(\n            *conv23_one2one_cv3_1_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv3.1.2.weight\"], weightMap[\"model.23.one2one_cv3.1.2.bias\"]);\n    conv23_one2one_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv3_1_2->setNbGroups(1);\n    nvinfer1::IShuffleLayer* reshape23_4 = network->addShuffle(*conv23_one2one_cv3_1_2->getOutput(0));\n    reshape23_4->setReshapeDimensions(nvinfer1::Dims3{1, kNumClass, -1});\n\n    /////////////////////////////////////////////////////\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_2_0_0;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        conv23_one2one_cv3_2_0_0 = convBnSiLU(network, weightMap, *block22->getOutput(0), c2 * 2, {3, 3}, 1,\n                                              \"model.23.one2one_cv3.2.0.0\", c2 * 2);\n    } else {\n        conv23_one2one_cv3_2_0_0 = convBnSiLU(network, weightMap, *block22->getOutput(0), c2 * 4, {3, 3}, 1,\n                                              \"model.23.one2one_cv3.2.0.0\", c2 * 4);\n    }\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_2_0_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_2_0_0->getOutput(0), c3, {1, 1}, 1,\n                       \"model.23.one2one_cv3.2.0.1\", 1);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_2_1_0 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_2_0_1->getOutput(0), c3, {3, 3}, 1,\n                       \"model.23.one2one_cv3.2.1.0\", c3);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_2_1_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_2_1_0->getOutput(0), c3, {1, 1}, 1,\n                       \"model.23.one2one_cv3.2.1.1\", 1);\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv3_2_2 = network->addConvolutionNd(\n            *conv23_one2one_cv3_2_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv3.2.2.weight\"], weightMap[\"model.23.one2one_cv3.2.2.bias\"]);\n    conv23_one2one_cv3_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv3_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv3_2_2->setNbGroups(1);\n    nvinfer1::IShuffleLayer* reshape23_5 = network->addShuffle(*conv23_one2one_cv3_2_2->getOutput(0));\n    reshape23_5->setReshapeDimensions(nvinfer1::Dims3{1, kNumClass, -1});\n\n    /////////////////////////////////////////////////////\n\n    nvinfer1::ITensor* inputTensors23_1[] = {reshape23_3->getOutput(0), reshape23_4->getOutput(0),\n                                             reshape23_5->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_1 = network->addConcatenation(inputTensors23_1, 3);\n    cat23_1->setAxis(2);\n    nvinfer1::IActivationLayer* sigmoid23 = network->addActivation(\n            *cat23_1->getOutput(0),\n            nvinfer1::ActivationType::kSIGMOID);  // TODO: THIS IS UNNESSARY, REMOVE AFTER PLUGIN IS READY\n\n    /////////////////////////////////////////////////////\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv2_0_0 =\n            convBnSiLU(network, weightMap, *block16->getOutput(0), c2 / 4, {3, 3}, 1, \"model.23.one2one_cv2.0.0\", 1);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv2_0_0->getOutput(0), c2 / 4, {3, 3}, 1,\n                       \"model.23.one2one_cv2.0.1\", 1);\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv2_0_2 = network->addConvolutionNd(\n            *conv23_one2one_cv2_0_1->getOutput(0), 4, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv2.0.2.weight\"], weightMap[\"model.23.one2one_cv2.0.2.bias\"]);\n    conv23_one2one_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv2_0_2->setNbGroups(1);\n    nvinfer1::IShuffleLayer* reshape23 = network->addShuffle(*conv23_one2one_cv2_0_2->getOutput(0));\n    reshape23->setReshapeDimensions(nvinfer1::Dims3{1, 4, -1});\n\n    /////////////////////////////////////////////////////\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv2_1_0 =\n            convBnSiLU(network, weightMap, *block19->getOutput(0), c2 / 4, {3, 3}, 1, \"model.23.one2one_cv2.1.0\", 1);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv2_1_0->getOutput(0), c2 / 4, {3, 3}, 1,\n                       \"model.23.one2one_cv2.1.1\", 1);\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv2_1_2 = network->addConvolutionNd(\n            *conv23_one2one_cv2_1_1->getOutput(0), 4, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv2.1.2.weight\"], weightMap[\"model.23.one2one_cv2.1.2.bias\"]);\n    conv23_one2one_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv2_1_2->setNbGroups(1);\n    nvinfer1::IShuffleLayer* reshape23_1 = network->addShuffle(*conv23_one2one_cv2_1_2->getOutput(0));\n    reshape23_1->setReshapeDimensions(nvinfer1::Dims3{1, 4, -1});\n\n    /////////////////////////////////////////////////////\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv2_2_0 =\n            convBnSiLU(network, weightMap, *block22->getOutput(0), c2 / 4, {3, 3}, 1, \"model.23.one2one_cv2.2.0\", 1);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv2_2_0->getOutput(0), c2 / 4, {3, 3}, 1,\n                       \"model.23.one2one_cv2.2.1\", 1);\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv2_2_2 = network->addConvolutionNd(\n            *conv23_one2one_cv2_2_1->getOutput(0), 4, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv2.2.2.weight\"], weightMap[\"model.23.one2one_cv2.2.2.bias\"]);\n    conv23_one2one_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv2_2_2->setNbGroups(1);\n    nvinfer1::IShuffleLayer* reshape23_2 = network->addShuffle(*conv23_one2one_cv2_2_2->getOutput(0));\n    reshape23_2->setReshapeDimensions(nvinfer1::Dims3{1, 4, -1});\n\n    /////////////////////////////////////////////////////\n\n    nvinfer1::ITensor* inputTensors23[] = {reshape23->getOutput(0), reshape23_1->getOutput(0),\n                                           reshape23_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23 = network->addConcatenation(inputTensors23, 3);\n    cat23->setAxis(2);\n\n    /////////////////////////////////////////////////////\n\n    nvinfer1::ISliceLayer* slice23_1 = network->addSlice(\n            *cat23->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{cat23->getOutput(0)->getDimensions().d[0], cat23->getOutput(0)->getDimensions().d[1] / 2,\n                            cat23->getOutput(0)->getDimensions().d[2]},\n            nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* slice23 = network->addSlice(\n            *cat23->getOutput(0), nvinfer1::Dims3{0, cat23->getOutput(0)->getDimensions().d[1] / 2, 0},\n            nvinfer1::Dims3{cat23->getOutput(0)->getDimensions().d[0], cat23->getOutput(0)->getDimensions().d[1] / 2,\n                            cat23->getOutput(0)->getDimensions().d[2]},\n            nvinfer1::Dims3{1, 1, 1});\n\n    // TODO: MAKE HARDCODED TO AUTOMATIC\n    const int anchor_num = cat23->getOutput(0)->getDimensions().d[2];\n\n    std::vector<int> fm_sizes;\n    int fm_h_0 = block16->getOutput(0)->getDimensions().d[2];  // P3\n    int fm_h_1 = block19->getOutput(0)->getDimensions().d[2];  // P4\n    int fm_h_2 = block22->getOutput(0)->getDimensions().d[2];  // P5\n\n    fm_sizes.push_back(fm_h_0);\n    fm_sizes.push_back(fm_h_1);\n    fm_sizes.push_back(fm_h_2);\n\n    std::vector<int> strides = {kInputH / fm_h_0, kInputH / fm_h_1, kInputH / fm_h_2};\n    std::vector<float> grid(anchor_num * 2);\n    std::vector<float> stride_vec(anchor_num);\n    std::fill(stride_vec.begin(), stride_vec.begin() + fm_sizes[0] * fm_sizes[0], strides[0]);\n    std::fill(stride_vec.begin() + fm_sizes[0] * fm_sizes[0],\n              stride_vec.begin() + fm_sizes[0] * fm_sizes[0] + fm_sizes[1] * fm_sizes[1], strides[1]);\n    std::fill(stride_vec.begin() + fm_sizes[0] * fm_sizes[0] + fm_sizes[1] * fm_sizes[1], stride_vec.end(), strides[2]);\n\n    int idx = 0;\n    for (int s = 0; s < fm_sizes.size(); ++s) {\n        int h = fm_sizes[s];\n        int w = fm_sizes[s];\n\n        for (int y = 0; y < h; ++y) {\n            for (int x = 0; x < w; ++x) {\n                grid[idx] = x + 0.5f;\n                grid[idx + anchor_num] = y + 0.5f;\n\n                idx++;\n            }\n        }\n    }\n\n    nvinfer1::Dims gridDims;\n    gridDims.nbDims = 3;\n    gridDims.d[0] = 1;\n    gridDims.d[1] = 2;\n    gridDims.d[2] = anchor_num;\n\n    nvinfer1::IConstantLayer* constant_grid = network->addConstant(\n            gridDims, nvinfer1::Weights{nvinfer1::DataType::kFLOAT, grid.data(), (int64_t)grid.size()});\n\n    nvinfer1::IElementWiseLayer* conv23_add_1 = network->addElementWise(\n            *constant_grid->getOutput(0), *slice23->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n\n    nvinfer1::IElementWiseLayer* conv23_sub_1 = network->addElementWise(\n            *constant_grid->getOutput(0), *slice23_1->getOutput(0), nvinfer1::ElementWiseOperation::kSUB);\n\n    nvinfer1::ITensor* tensor23[] = {conv23_sub_1->getOutput(0), conv23_add_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_2 = network->addConcatenation(tensor23, 2);\n    cat23_2->setAxis(1);\n\n    nvinfer1::IConstantLayer* constant_stride = network->addConstant(\n            nvinfer1::Dims3{1, 1, anchor_num},\n            nvinfer1::Weights{nvinfer1::DataType::kFLOAT, stride_vec.data(), (int64_t)stride_vec.size()});\n\n    nvinfer1::IElementWiseLayer* mul23_2 = network->addElementWise(\n            *cat23_2->getOutput(0), *constant_stride->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n\n    ///////////////////////////////////////////////////////////\n\n    nvinfer1::IConcatenationLayer* cat23_3 = network->addConcatenation(\n            std::array<nvinfer1::ITensor*, 2>{mul23_2->getOutput(0), sigmoid23->getOutput(0)}.data(), 2);\n    cat23_3->setAxis(1);\n\n    nvinfer1::IShuffleLayer* transpose = network->addShuffle(*cat23_3->getOutput(0));\n    transpose->setFirstTranspose(nvinfer1::Permutation{0, 2, 1});\n    // transpose->setReshapeDimensions(nvinfer1::Dims3{1, anchor_num, kNumClass + 4});\n\n    ///////////////////////////////////////////////////////////\n\n    int stridesLength = strides.size();\n    nvinfer1::IPluginV2Layer* yolo = addYoloLayer(network, *transpose->getOutput(0), strides, fm_sizes, stridesLength,\n                                                  true, false, false, false, anchor_num);\n    assert(yolo);\n\n    ///////////////////////////////////////////////////////////\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    // Use setMemoryPoolLimit instead of deprecated setMaxWorkspaceSize\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cerr << \"INT8 not supported for YOLO26 model yet.\" << std::endl;\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolo26Obb(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type)\n\n{\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    /*******************************************************************************************************\n     ******************************************  YOLO26-Obb INPUT  **********************************************\n     *******************************************************************************************************/\n\n    nvinfer1::ITensor* data =\n            network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kObbInputH, kObbInputW});\n    assert(data);\n\n    nvinfer1::IElementWiseLayer* block0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), {3, 3}, 2, \"model.0\");\n\n    nvinfer1::IElementWiseLayer* block1 = convBnSiLU(network, weightMap, *block0->getOutput(0),\n                                                     get_width(128, gw, max_channels), {3, 3}, 2, \"model.1\");\n\n    bool c3k = false;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        c3k = true;\n    }\n\n    nvinfer1::IElementWiseLayer* conv2 =\n            C3K2(network, weightMap, *block1->getOutput(0), get_width(128, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), c3k, true, false, 0.25, \"model.2\");\n\n    nvinfer1::IElementWiseLayer* block3 = convBnSiLU(network, weightMap, *conv2->getOutput(0),\n                                                     get_width(256, gw, max_channels), {3, 3}, 2, \"model.3\");\n\n    nvinfer1::IElementWiseLayer* block4 =\n            C3K2(network, weightMap, *block3->getOutput(0), get_width(256, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, false, 0.25, \"model.4\");\n\n    nvinfer1::IElementWiseLayer* block5 = convBnSiLU(network, weightMap, *block4->getOutput(0),\n                                                     get_width(512, gw, max_channels), {3, 3}, 2, \"model.5\");\n\n    nvinfer1::IElementWiseLayer* block6 =\n            C3K2(network, weightMap, *block5->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), true, true, false, 0.5, \"model.6\");\n\n    nvinfer1::IElementWiseLayer* block7 = convBnSiLU(network, weightMap, *block6->getOutput(0),\n                                                     get_width(1024, gw, max_channels), {3, 3}, 2, \"model.7\");\n\n    nvinfer1::IElementWiseLayer* block8 =\n            C3K2(network, weightMap, *block7->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), get_depth(2, gd), true, true, false, 0.5, \"model.8\");\n\n    nvinfer1::IElementWiseLayer* block9 = SPPF(network, weightMap, *block8->getOutput(0),\n                                               get_width(1024, gw, max_channels), get_width(1024, gw, max_channels), 5,\n                                               true, \"model.9\");  // TODO: VERIFY THIS BLOCK FOR OTHER YOLO26 MODELS\n\n    nvinfer1::IElementWiseLayer* block10 =\n            C2PSA(network, weightMap, *block9->getOutput(0), get_width(1024, gw, max_channels),\n                  get_width(1024, gw, max_channels), get_depth(2, gd), 0.5, \"model.10\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLO26-Obb HEAD  ********************************************\n    *******************************************************************************************************/\n\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample11 = network->addResize(*block10->getOutput(0));\n    assert(upsample11);\n\n    upsample11->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample11->setScales(scale, 4);\n    nvinfer1::ITensor* inputTensors12[] = {upsample11->getOutput(0), block6->getOutput(0)};\n\n    nvinfer1::IConcatenationLayer* cat12 = network->addConcatenation(inputTensors12, 2);\n\n    nvinfer1::IElementWiseLayer* block13 =\n            C3K2(network, weightMap, *cat12->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), true, true, false, 0.5, \"model.13\");\n\n    nvinfer1::IResizeLayer* upsample14 = network->addResize(*block13->getOutput(0));\n    assert(upsample14);\n\n    upsample14->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample14->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensors15[] = {upsample14->getOutput(0), block4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat15 = network->addConcatenation(inputTensors15, 2);\n\n    nvinfer1::IElementWiseLayer* block16 =\n            C3K2(network, weightMap, *cat15->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), true, true, false, 0.5, \"model.16\");\n\n    nvinfer1::IElementWiseLayer* block17 = convBnSiLU(network, weightMap, *block16->getOutput(0),\n                                                      get_width(256, gw, max_channels), {3, 3}, 2, \"model.17\");\n\n    nvinfer1::ITensor* inputTensors18[] = {block17->getOutput(0), block13->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat18 = network->addConcatenation(inputTensors18, 2);\n\n    nvinfer1::IElementWiseLayer* block19 =\n            C3K2(network, weightMap, *cat18->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), true, true, false, 0.5, \"model.19\");\n\n    nvinfer1::IElementWiseLayer* block20 = convBnSiLU(network, weightMap, *block19->getOutput(0),\n                                                      get_width(512, gw, max_channels), {3, 3}, 2, \"model.20\");\n\n    nvinfer1::ITensor* inputTensors21[] = {block20->getOutput(0), block10->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21 = network->addConcatenation(inputTensors21, 2);\n\n    nvinfer1::IElementWiseLayer* block22 =\n            C3K2(network, weightMap, *cat21->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 1, true, true, true, 0.5,\n                 \"model.22\");  // WARN: get_depth(2, gd) changed to 1.\n\n    /*******************************************************************************************************\n    *********************************************  YOLO26-Obb OUTPUT  ********************************************\n    *******************************************************************************************************/\n\n    int c2 = std::max(std::max(16, get_width(256, gw, max_channels)), 16 * 4);\n    int c3 = std::max(get_width(256, gw, max_channels), std::min(kObbNumClass, 100));\n\n    //cv.2.*.*\n    /////////////////////////////////////////////////////\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv2_0_0 =\n            convBnSiLU(network, weightMap, *block16->getOutput(0), c2 / 4, {3, 3}, 1, \"model.23.one2one_cv2.0.0\", 1);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv2_0_0->getOutput(0), c2 / 4, {3, 3}, 1,\n                       \"model.23.one2one_cv2.0.1\", 1);\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv2_0_2 = network->addConvolutionNd(\n            *conv23_one2one_cv2_0_1->getOutput(0), 4, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv2.0.2.weight\"], weightMap[\"model.23.one2one_cv2.0.2.bias\"]);\n    conv23_one2one_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv2_0_2->setNbGroups(1);\n    nvinfer1::IShuffleLayer* reshape23 = network->addShuffle(*conv23_one2one_cv2_0_2->getOutput(0));\n    reshape23->setReshapeDimensions(nvinfer1::Dims3{1, 4, -1});\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv2_1_0 =\n            convBnSiLU(network, weightMap, *block19->getOutput(0), c2 / 4, {3, 3}, 1, \"model.23.one2one_cv2.1.0\", 1);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv2_1_0->getOutput(0), c2 / 4, {3, 3}, 1,\n                       \"model.23.one2one_cv2.1.1\", 1);\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv2_1_2 = network->addConvolutionNd(\n            *conv23_one2one_cv2_1_1->getOutput(0), 4, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv2.1.2.weight\"], weightMap[\"model.23.one2one_cv2.1.2.bias\"]);\n    conv23_one2one_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv2_1_2->setNbGroups(1);\n    nvinfer1::IShuffleLayer* reshape23_1 = network->addShuffle(*conv23_one2one_cv2_1_2->getOutput(0));\n    reshape23_1->setReshapeDimensions(nvinfer1::Dims3{1, 4, -1});\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv2_2_0 =\n            convBnSiLU(network, weightMap, *block22->getOutput(0), c2 / 4, {3, 3}, 1, \"model.23.one2one_cv2.2.0\", 1);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv2_2_0->getOutput(0), c2 / 4, {3, 3}, 1,\n                       \"model.23.one2one_cv2.2.1\", 1);\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv2_2_2 = network->addConvolutionNd(\n            *conv23_one2one_cv2_2_1->getOutput(0), 4, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv2.2.2.weight\"], weightMap[\"model.23.one2one_cv2.2.2.bias\"]);\n    conv23_one2one_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv2_2_2->setNbGroups(1);\n    nvinfer1::IShuffleLayer* reshape23_2 = network->addShuffle(*conv23_one2one_cv2_2_2->getOutput(0));\n    reshape23_2->setReshapeDimensions(nvinfer1::Dims3{1, 4, -1});\n\n    nvinfer1::ITensor* inputTensors23[] = {reshape23->getOutput(0), reshape23_1->getOutput(0),\n                                           reshape23_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23 = network->addConcatenation(inputTensors23, 3);\n    cat23->setAxis(2);\n\n    //cv.4.*.*\n    /////////////////////////////////////////////////////\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv4_0_0 =\n            convBnSiLU(network, weightMap, *block16->getOutput(0), c2 / 4, {3, 3}, 1, \"model.23.one2one_cv4.0.0\", 1);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv4_0_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv4_0_0->getOutput(0), c2 / 4, {3, 3}, 1,\n                       \"model.23.one2one_cv4.0.1\", 1);\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv4_0_2 = network->addConvolutionNd(\n            *conv23_one2one_cv4_0_1->getOutput(0), 1, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv4.0.2.weight\"], weightMap[\"model.23.one2one_cv4.0.2.bias\"]);\n    conv23_one2one_cv4_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv4_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv4_0_2->setNbGroups(1);\n    nvinfer1::IShuffleLayer* reshape23_6 = network->addShuffle(*conv23_one2one_cv4_0_2->getOutput(0));\n    reshape23_6->setReshapeDimensions(nvinfer1::Dims3{1, 1, -1});\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv4_1_0 =\n            convBnSiLU(network, weightMap, *block19->getOutput(0), c2 / 4, {3, 3}, 1, \"model.23.one2one_cv4.1.0\", 1);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv4_1_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv4_1_0->getOutput(0), c2 / 4, {3, 3}, 1,\n                       \"model.23.one2one_cv4.1.1\", 1);\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv4_1_2 = network->addConvolutionNd(\n            *conv23_one2one_cv4_1_1->getOutput(0), 1, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv4.1.2.weight\"], weightMap[\"model.23.one2one_cv4.1.2.bias\"]);\n    conv23_one2one_cv4_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv4_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv4_1_2->setNbGroups(1);\n    nvinfer1::IShuffleLayer* reshape23_7 = network->addShuffle(*conv23_one2one_cv4_1_2->getOutput(0));\n    reshape23_7->setReshapeDimensions(nvinfer1::Dims3{1, 1, -1});\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv4_2_0 =\n            convBnSiLU(network, weightMap, *block22->getOutput(0), c2 / 4, {3, 3}, 1, \"model.23.one2one_cv4.2.0\", 1);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv4_2_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv4_2_0->getOutput(0), c2 / 4, {3, 3}, 1,\n                       \"model.23.one2one_cv4.2.1\", 1);\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv4_2_2 = network->addConvolutionNd(\n            *conv23_one2one_cv4_2_1->getOutput(0), 1, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv4.2.2.weight\"], weightMap[\"model.23.one2one_cv4.2.2.bias\"]);\n    conv23_one2one_cv4_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv4_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv4_2_2->setNbGroups(1);\n    nvinfer1::IShuffleLayer* reshape23_8 = network->addShuffle(*conv23_one2one_cv4_2_2->getOutput(0));\n    reshape23_8->setReshapeDimensions(nvinfer1::Dims3{1, 1, -1});\n\n    nvinfer1::ITensor* inputTensors23_2[] = {reshape23_6->getOutput(0), reshape23_7->getOutput(0),\n                                             reshape23_8->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_2 = network->addConcatenation(inputTensors23_2, 3);\n    cat23_2->setAxis(2);\n\n    /////////////////////////////////////////////////////\n    nvinfer1::ISliceLayer* split23__0 = network->addSlice(\n            *cat23->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{cat23->getOutput(0)->getDimensions().d[0], cat23->getOutput(0)->getDimensions().d[1] / 2,\n                            cat23->getOutput(0)->getDimensions().d[2]},\n            nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23__1 = network->addSlice(\n            *cat23->getOutput(0), nvinfer1::Dims3{0, cat23->getOutput(0)->getDimensions().d[1] / 2, 0},\n            nvinfer1::Dims3{cat23->getOutput(0)->getDimensions().d[0], cat23->getOutput(0)->getDimensions().d[1] / 2,\n                            cat23->getOutput(0)->getDimensions().d[2]},\n            nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IElementWiseLayer* sub23 = network->addElementWise(*split23__1->getOutput(0), *split23__0->getOutput(0),\n                                                                 nvinfer1::ElementWiseOperation::kSUB);\n\n    // Divide by 2\n    static float two = 2.0f;\n    nvinfer1::Weights two_weights{nvinfer1::DataType::kFLOAT, &two, 1};\n    nvinfer1::IConstantLayer* const_two = network->addConstant(nvinfer1::Dims3{1, 1, 1}, two_weights);\n    nvinfer1::IElementWiseLayer* div23 = network->addElementWise(*sub23->getOutput(0), *const_two->getOutput(0),\n                                                                 nvinfer1::ElementWiseOperation::kDIV);\n\n    nvinfer1::ISliceLayer* split23_1__0 = network->addSlice(\n            *div23->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{div23->getOutput(0)->getDimensions().d[0], div23->getOutput(0)->getDimensions().d[1] / 2,\n                            div23->getOutput(0)->getDimensions().d[2]},\n            nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::ISliceLayer* split23_1__1 = network->addSlice(\n            *div23->getOutput(0), nvinfer1::Dims3{0, div23->getOutput(0)->getDimensions().d[1] / 2, 0},\n            nvinfer1::Dims3{div23->getOutput(0)->getDimensions().d[0], div23->getOutput(0)->getDimensions().d[1] / 2,\n                            div23->getOutput(0)->getDimensions().d[2]},\n            nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IUnaryLayer* cos23 = network->addUnary(*cat23_2->getOutput(0), nvinfer1::UnaryOperation::kCOS);\n    nvinfer1::IUnaryLayer* sin23 = network->addUnary(*cat23_2->getOutput(0), nvinfer1::UnaryOperation::kSIN);\n\n    nvinfer1::IElementWiseLayer* mul23 = network->addElementWise(*split23_1__0->getOutput(0), *cos23->getOutput(0),\n                                                                 nvinfer1::ElementWiseOperation::kPROD);\n    nvinfer1::IElementWiseLayer* mul23_1 = network->addElementWise(*split23_1__1->getOutput(0), *sin23->getOutput(0),\n                                                                   nvinfer1::ElementWiseOperation::kPROD);\n    nvinfer1::IElementWiseLayer* sub23_1 =\n            network->addElementWise(*mul23->getOutput(0), *mul23_1->getOutput(0), nvinfer1::ElementWiseOperation::kSUB);\n\n    nvinfer1::IElementWiseLayer* mul23_2 = network->addElementWise(*split23_1__0->getOutput(0), *sin23->getOutput(0),\n                                                                   nvinfer1::ElementWiseOperation::kPROD);\n    nvinfer1::IElementWiseLayer* mul23_3 = network->addElementWise(*split23_1__1->getOutput(0), *cos23->getOutput(0),\n                                                                   nvinfer1::ElementWiseOperation::kPROD);\n    nvinfer1::IElementWiseLayer* add23 = network->addElementWise(*mul23_2->getOutput(0), *mul23_3->getOutput(0),\n                                                                 nvinfer1::ElementWiseOperation::kSUM);\n\n    nvinfer1::ITensor* tensor23[] = {sub23_1->getOutput(0), add23->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_3 = network->addConcatenation(tensor23, 2);\n    cat23_3->setAxis(1);\n\n    std::vector<int> fm_sizes;\n    int fm_h_0 = block16->getOutput(0)->getDimensions().d[2];  // P3\n    int fm_h_1 = block19->getOutput(0)->getDimensions().d[2];  // P4\n    int fm_h_2 = block22->getOutput(0)->getDimensions().d[2];  // P5\n\n    fm_sizes.push_back(fm_h_0);\n    fm_sizes.push_back(fm_h_1);\n    fm_sizes.push_back(fm_h_2);\n\n    int grid_num = fm_h_0 * fm_h_0 + fm_h_1 * fm_h_1 + fm_h_2 * fm_h_2;\n\n    assert((kObbInputH % fm_h_0) == 0 && (kObbInputH % fm_h_1) == 0 && (kObbInputH % fm_h_2) == 0);\n    assert((fm_h_0 == block16->getOutput(0)->getDimensions().d[3]) &&\n           (fm_h_1 == block19->getOutput(0)->getDimensions().d[3]) &&\n           (fm_h_2 == block22->getOutput(0)->getDimensions().d[3]));  // verify fm_w == fm_h\n\n    assert(cat23_3->getOutput(0)->getDimensions().d[2] == grid_num);\n\n    int idx = 0;\n    std::vector<float> grid(grid_num * 2);\n    auto fill_grid = [&](int fm_h) {\n        for (int y = 0; y < fm_h; ++y) {\n            for (int x = 0; x < fm_h; ++x) {\n                grid[idx] = x + 0.5f;\n                grid[idx + grid_num] = y + 0.5f;\n                idx++;\n            }\n        }\n    };\n    fill_grid(fm_h_0);\n    fill_grid(fm_h_1);\n    fill_grid(fm_h_2);\n\n    std::vector<float> stride_vec(grid_num);\n    idx = 0;\n    auto fill_stride = [&](int fm_h, int fm_w, int stride) {\n        for (int y = 0; y < fm_h; ++y) {\n            for (int x = 0; x < fm_w; ++x) {\n                stride_vec[idx] = static_cast<float>(stride);\n                idx++;\n            }\n        }\n    };\n\n    std::vector<int> strides = {kObbInputH / fm_h_0, kObbInputH / fm_h_1, kObbInputH / fm_h_2};\n    fill_stride(fm_h_0, fm_h_0, strides[0]);\n    fill_stride(fm_h_1, fm_h_1, strides[1]);\n    fill_stride(fm_h_2, fm_h_2, strides[2]);\n\n    nvinfer1::Dims gridDims{3, {1, 2, grid_num}};\n    nvinfer1::IConstantLayer* constant_grid = network->addConstant(\n            gridDims, nvinfer1::Weights{nvinfer1::DataType::kFLOAT, grid.data(), (int64_t)grid.size()});\n\n    nvinfer1::Dims strideDims{3, {1, 1, grid_num}};\n    nvinfer1::IConstantLayer* constant_stride = network->addConstant(\n            strideDims, nvinfer1::Weights{nvinfer1::DataType::kFLOAT, stride_vec.data(), (int64_t)stride_vec.size()});\n\n    nvinfer1::IElementWiseLayer* add23_1 = network->addElementWise(*cat23_3->getOutput(0), *constant_grid->getOutput(0),\n                                                                   nvinfer1::ElementWiseOperation::kSUM);\n\n    nvinfer1::IElementWiseLayer* add23_2 = network->addElementWise(*split23__0->getOutput(0), *split23__1->getOutput(0),\n                                                                   nvinfer1::ElementWiseOperation::kSUM);\n\n    nvinfer1::ITensor* tensor23_4[] = {add23_1->getOutput(0), add23_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_4 = network->addConcatenation(tensor23_4, 2);\n    cat23_4->setAxis(1);\n\n    nvinfer1::IElementWiseLayer* mul23_4 = network->addElementWise(\n            *cat23_4->getOutput(0), *constant_stride->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n\n    /////////////////////////////////////////////////////\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_0_0_0 =\n            convBnSiLU(network, weightMap, *block16->getOutput(0), c2, {3, 3}, 1, \"model.23.one2one_cv3.0.0.0\", c2);\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_0_0_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_0_0_0->getOutput(0), c3, {1, 1}, 1,\n                       \"model.23.one2one_cv3.0.0.1\", 1);\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_0_1_0 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_0_0_1->getOutput(0), c3, {3, 3}, 1,\n                       \"model.23.one2one_cv3.0.1.0\", c3);\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_0_1_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_0_1_0->getOutput(0), c3, {1, 1}, 1,\n                       \"model.23.one2one_cv3.0.1.1\", 1);\n\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv3_0_2 = network->addConvolutionNd(\n            *conv23_one2one_cv3_0_1_1->getOutput(0), kObbNumClass, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv3.0.2.weight\"], weightMap[\"model.23.one2one_cv3.0.2.bias\"]);\n    conv23_one2one_cv3_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv3_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv3_0_2->setNbGroups(1);\n\n    nvinfer1::IShuffleLayer* reshape23_3 = network->addShuffle(*conv23_one2one_cv3_0_2->getOutput(0));\n    reshape23_3->setReshapeDimensions(nvinfer1::Dims3{1, kObbNumClass, -1});\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_1_0_0 = convBnSiLU(\n            network, weightMap, *block19->getOutput(0), c2 * 2, {3, 3}, 1, \"model.23.one2one_cv3.1.0.0\", c2 * 2);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_1_0_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_1_0_0->getOutput(0), c3, {1, 1}, 1,\n                       \"model.23.one2one_cv3.1.0.1\", 1);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_1_1_0 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_1_0_1->getOutput(0), c3, {3, 3}, 1,\n                       \"model.23.one2one_cv3.1.1.0\", c3);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_1_1_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_1_1_0->getOutput(0), c3, {1, 1}, 1,\n                       \"model.23.one2one_cv3.1.1.1\", 1);\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv3_1_2 = network->addConvolutionNd(\n            *conv23_one2one_cv3_1_1_1->getOutput(0), kObbNumClass, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv3.1.2.weight\"], weightMap[\"model.23.one2one_cv3.1.2.bias\"]);\n    conv23_one2one_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv3_1_2->setNbGroups(1);\n    nvinfer1::IShuffleLayer* reshape23_4 = network->addShuffle(*conv23_one2one_cv3_1_2->getOutput(0));\n    reshape23_4->setReshapeDimensions(nvinfer1::Dims3{1, kObbNumClass, -1});\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_2_0_0;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        conv23_one2one_cv3_2_0_0 = convBnSiLU(network, weightMap, *block22->getOutput(0), c2 * 2, {3, 3}, 1,\n                                              \"model.23.one2one_cv3.2.0.0\", c2 * 2);\n    } else {\n        conv23_one2one_cv3_2_0_0 = convBnSiLU(network, weightMap, *block22->getOutput(0), c2 * 4, {3, 3}, 1,\n                                              \"model.23.one2one_cv3.2.0.0\", c2 * 4);\n    }\n\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_2_0_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_2_0_0->getOutput(0), c3, {1, 1}, 1,\n                       \"model.23.one2one_cv3.2.0.1\", 1);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_2_1_0 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_2_0_1->getOutput(0), c3, {3, 3}, 1,\n                       \"model.23.one2one_cv3.2.1.0\", c3);\n    nvinfer1::IElementWiseLayer* conv23_one2one_cv3_2_1_1 =\n            convBnSiLU(network, weightMap, *conv23_one2one_cv3_2_1_0->getOutput(0), c3, {1, 1}, 1,\n                       \"model.23.one2one_cv3.2.1.1\", 1);\n    nvinfer1::IConvolutionLayer* conv23_one2one_cv3_2_2 = network->addConvolutionNd(\n            *conv23_one2one_cv3_2_1_1->getOutput(0), kObbNumClass, nvinfer1::DimsHW{1, 1},\n            weightMap[\"model.23.one2one_cv3.2.2.weight\"], weightMap[\"model.23.one2one_cv3.2.2.bias\"]);\n    conv23_one2one_cv3_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_one2one_cv3_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv23_one2one_cv3_2_2->setNbGroups(1);\n    nvinfer1::IShuffleLayer* reshape23_5 = network->addShuffle(*conv23_one2one_cv3_2_2->getOutput(0));\n    reshape23_5->setReshapeDimensions(nvinfer1::Dims3{1, kObbNumClass, -1});\n\n    nvinfer1::ITensor* tensor23_1[] = {reshape23_3->getOutput(0), reshape23_4->getOutput(0), reshape23_5->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_1 = network->addConcatenation(tensor23_1, 3);\n    cat23_1->setAxis(2);\n    nvinfer1::IActivationLayer* sigmoid23 = network->addActivation(\n            *cat23_1->getOutput(0),\n            nvinfer1::ActivationType::kSIGMOID);  // TODO: THIS IS UNNESSARY, REMOVE AFTER PLUGIN IS READY\n    /////////////////////////////////////////////////////\n\n    nvinfer1::ITensor* tensor23_5[] = {mul23_4->getOutput(0), sigmoid23->getOutput(0), cat23_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_5 = network->addConcatenation(tensor23_5, 3);\n    cat23_5->setAxis(1);\n\n    nvinfer1::IShuffleLayer* transpose = network->addShuffle(*cat23_5->getOutput(0));\n    transpose->setFirstTranspose(nvinfer1::Permutation{0, 2, 1});\n\n    nvinfer1::IPluginV2Layer* yolo = addYoloLayer(network, *transpose->getOutput(0), strides, fm_sizes, strides.size(),\n                                                  false, false, false, true, grid_num);\n\n    /////////////////////////////////////////////////////\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n    // Use setMemoryPoolLimit instead of deprecated setMaxWorkspaceSize\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cerr << \"INT8 not supported for YOLO26 model yet.\" << std::endl;\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolo26Cls(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    /*******************************************************************************************************\n     ******************************************  YOLO26 INPUT  **********************************************\n     *******************************************************************************************************/\n\n    nvinfer1::ITensor* data =\n            network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kClsInputH, kClsInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n    *****************************************  YOLO26 BACKBONE  ********************************************\n    *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* block0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), {3, 3}, 2, \"model.0\");\n\n    nvinfer1::IElementWiseLayer* block1 = convBnSiLU(network, weightMap, *block0->getOutput(0),\n                                                     get_width(128, gw, max_channels), {3, 3}, 2, \"model.1\");\n\n    bool c3k = false;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        c3k = true;\n    }\n\n    nvinfer1::IElementWiseLayer* conv2 =\n            C3K2(network, weightMap, *block1->getOutput(0), get_width(128, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), c3k, true, false, 0.25, \"model.2\");\n\n    nvinfer1::IElementWiseLayer* block3 = convBnSiLU(network, weightMap, *conv2->getOutput(0),\n                                                     get_width(256, gw, max_channels), {3, 3}, 2, \"model.3\");\n\n    nvinfer1::IElementWiseLayer* block4 =\n            C3K2(network, weightMap, *block3->getOutput(0), get_width(256, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, false, 0.25, \"model.4\");\n\n    nvinfer1::IElementWiseLayer* block5 = convBnSiLU(network, weightMap, *block4->getOutput(0),\n                                                     get_width(512, gw, max_channels), {3, 3}, 2, \"model.5\");\n\n    nvinfer1::IElementWiseLayer* block6 =\n            C3K2(network, weightMap, *block5->getOutput(0), get_width(512, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), true, true, false, 0.5, \"model.6\");\n\n    nvinfer1::IElementWiseLayer* block7 = convBnSiLU(network, weightMap, *block6->getOutput(0),\n                                                     get_width(1024, gw, max_channels), {3, 3}, 2, \"model.7\");\n\n    nvinfer1::IElementWiseLayer* block8 =\n            C3K2(network, weightMap, *block7->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), get_depth(2, gd), true, true, false, 0.5, \"model.8\");\n\n    nvinfer1::IElementWiseLayer* block9 =\n            C2PSA(network, weightMap, *block8->getOutput(0), get_width(1024, gw, max_channels),\n                  get_width(1024, gw, max_channels), get_depth(2, gd), 0.5, \"model.9\");\n\n    /////////////////////////////////////////////////////\n\n    nvinfer1::IElementWiseLayer* block10_convbn =\n            convBnSiLU(network, weightMap, *block9->getOutput(0), 1280, {1, 1}, 1, \"model.10.conv\");\n    nvinfer1::Dims dims =\n            block10_convbn->getOutput(0)->getDimensions();  // Obtain the dimensions of the output of conv_class\n    assert(dims.nbDims == 4);\n    nvinfer1::IPoolingLayer* block10_pool = network->addPoolingNd(\n            *block10_convbn->getOutput(0), nvinfer1::PoolingType::kAVERAGE, nvinfer1::DimsHW{dims.d[2], dims.d[3]});\n    nvinfer1::IShuffleLayer* block10_reshape = network->addShuffle(*block10_pool->getOutput(0));\n    block10_reshape->setReshapeDimensions(nvinfer1::Dims2{kBatchSize, 1280});\n    nvinfer1::IConstantLayer* block10_linear_weight =\n            network->addConstant(nvinfer1::Dims2{kClsNumClass, 1280}, weightMap[\"model.10.linear.weight\"]);\n    nvinfer1::IConstantLayer* block10_linear_bias =\n            network->addConstant(nvinfer1::Dims2{kClsNumClass, 1}, weightMap[\"model.10.linear.bias\"]);\n    nvinfer1::IMatrixMultiplyLayer* block10_linear_matrix_multiply =\n            network->addMatrixMultiply(*block10_reshape->getOutput(0), nvinfer1::MatrixOperation::kNONE,\n                                       *block10_linear_weight->getOutput(0), nvinfer1::MatrixOperation::kTRANSPOSE);\n    nvinfer1::IElementWiseLayer* block10_linear_add =\n            network->addElementWise(*block10_linear_matrix_multiply->getOutput(0), *block10_linear_bias->getOutput(0),\n                                    nvinfer1::ElementWiseOperation::kSUM);\n    nvinfer1::IActivationLayer* output =\n            network->addActivation(*block10_linear_add->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    assert(output);\n\n    output->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*output->getOutput(0));\n    // Use setMemoryPoolLimit instead of deprecated setMaxWorkspaceSize\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cerr << \"INT8 not supported for YOLO26 model yet.\" << std::endl;\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n"
  },
  {
    "path": "yolo26/src/postprocess.cpp",
    "content": "\n#include \"postprocess.h\"\n#include \"utils.h\"\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    float l, r, t, b;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n\n    if (r_h > r_w) {\n        l = bbox[0];\n        r = bbox[2];\n        t = bbox[1] - (kInputH - r_w * img.rows) / 2;\n        b = bbox[3] - (kInputH - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - (kInputW - r_h * img.cols) / 2;\n        r = bbox[2] - (kInputW - r_h * img.cols) / 2;\n        t = bbox[1];\n        b = bbox[3];\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\ncv::Rect get_rect_obb(cv::Mat& img, float bbox[4]) {\n    float l, r, t, b;\n    float r_w = kObbInputW / (img.cols * 1.0);\n    float r_h = kObbInputH / (img.rows * 1.0);\n\n    if (r_h > r_w) {\n        l = bbox[0];\n        r = bbox[2];\n        t = bbox[1] - (kObbInputH - r_w * img.rows) / 2;\n        b = bbox[3] - (kObbInputH - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - (kObbInputW - r_h * img.cols) / 2;\n        r = bbox[2] - (kObbInputW - r_h * img.cols) / 2;\n        t = bbox[1];\n        b = bbox[3];\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\ncv::Rect get_rect_adapt_landmark(cv::Mat& img, float bbox[4], float lmk[kNumberOfPoints * 3]) {\n    float l, r, t, b;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n    if (r_h > r_w) {\n        l = bbox[0] / r_w;\n        r = bbox[2] / r_w;\n        t = (bbox[1] - (kInputH - r_w * img.rows) / 2) / r_w;\n        b = (bbox[3] - (kInputH - r_w * img.rows) / 2) / r_w;\n        for (int i = 0; i < kNumberOfPoints * 3; i += 3) {\n            lmk[i] /= r_w;\n            lmk[i + 1] = (lmk[i + 1] - (kInputH - r_w * img.rows) / 2) / r_w;\n            // lmk[i + 2]\n        }\n    } else {\n        l = (bbox[0] - (kInputW - r_h * img.cols) / 2) / r_h;\n        r = (bbox[2] - (kInputW - r_h * img.cols) / 2) / r_h;\n        t = bbox[1] / r_h;\n        b = bbox[3] / r_h;\n        for (int i = 0; i < kNumberOfPoints * 3; i += 3) {\n            lmk[i] = (lmk[i] - (kInputW - r_h * img.cols) / 2) / r_h;\n            lmk[i + 1] /= r_h;\n            // lmk[i + 2]\n        }\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\nstatic float iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n            (std::max)(lbox[0], rbox[0]),\n            (std::min)(lbox[2], rbox[2]),\n            (std::max)(lbox[1], rbox[1]),\n            (std::min)(lbox[3], rbox[3]),\n    };\n\n    if (interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS = (interBox[1] - interBox[0]) * (interBox[3] - interBox[2]);\n    float unionBoxS = (lbox[2] - lbox[0]) * (lbox[3] - lbox[1]) + (rbox[2] - rbox[0]) * (rbox[3] - rbox[1]) - interBoxS;\n    return interBoxS / unionBoxS;\n}\n\nstatic bool cmp(const Detection& a, const Detection& b) {\n    if (a.conf == b.conf) {\n        return a.bbox[0] < b.bbox[0];\n    }\n    return a.conf > b.conf;\n}\n\nvoid decode(std::vector<Detection>& res, float* output) {\n    int det_size = sizeof(Detection) / sizeof(float);\n    std::map<float, std::vector<Detection>> m;\n\n    for (int i = 0; i < output[0]; i++) {\n        Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        res.push_back(det);\n    }\n}\n\nvoid batch_decode(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size) {\n    res_batch.resize(batch_size);\n    for (int i = 0; i < batch_size; i++) {\n        decode(res_batch[i], &output[i * output_size]);\n    }\n}\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        cv::Mat img = img_batch[i];\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect(img, res[j].bbox);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                        cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n        }\n    }\n}\n\nvoid draw_bbox_keypoints_line(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    const std::vector<std::pair<int, int>> skeleton_pairs = {\n            {0, 1}, {0, 2},  {0, 5}, {0, 6},  {1, 2},   {1, 3},   {2, 4},   {5, 6},   {5, 7},  {5, 11},\n            {6, 8}, {6, 12}, {7, 9}, {8, 10}, {11, 12}, {11, 13}, {12, 14}, {13, 15}, {14, 16}};\n\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        cv::Mat img = img_batch[i];\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect_adapt_landmark(img, res[j].bbox, res[j].keypoints);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                        cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n\n            for (int k = 0; k < kNumberOfPoints * 3; k += 3) {\n                if (res[j].keypoints[k + 2] > 0.5) {\n                    cv::circle(img, cv::Point((int)res[j].keypoints[k], (int)res[j].keypoints[k + 1]), 3,\n                               cv::Scalar(0, 0x27, 0xC1), -1);\n                }\n            }\n\n            for (const auto& bone : skeleton_pairs) {\n                int kp1_idx = bone.first * 3;\n                int kp2_idx = bone.second * 3;\n                if (res[j].keypoints[kp1_idx + 2] > 0.5 && res[j].keypoints[kp2_idx + 2] > 0.5) {\n                    cv::Point p1((int)res[j].keypoints[kp1_idx], (int)res[j].keypoints[kp1_idx + 1]);\n                    cv::Point p2((int)res[j].keypoints[kp2_idx], (int)res[j].keypoints[kp2_idx + 1]);\n                    cv::line(img, p1, p2, cv::Scalar(0, 0x27, 0xC1), 2);\n                }\n            }\n        }\n    }\n}\n\ncv::Mat scale_mask(cv::Mat mask, cv::Mat img) {\n    int x, y, w, h;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = kInputW;\n        h = r_w * img.rows;\n        x = 0;\n        y = (kInputH - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = kInputH;\n        x = (kInputW - w) / 2;\n        y = 0;\n    }\n    cv::Rect r(x, y, w, h);\n    cv::Mat res;\n    cv::resize(mask(r), res, img.size());\n    return res;\n}\n\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks,\n                    std::unordered_map<int, std::string>& labels_map) {\n    static std::vector<uint32_t> colors = {0xFF3838, 0xFF9D97, 0xFF701F, 0xFFB21D, 0xCFD231, 0x48F90A, 0x92CC17,\n                                           0x3DDB86, 0x1A9334, 0x00D4BB, 0x2C99A8, 0x00C2FF, 0x344593, 0x6473FF,\n                                           0x0018EC, 0x8438FF, 0x520085, 0xCB38FF, 0xFF95C8, 0xFF37C7};\n    for (size_t i = 0; i < dets.size(); i++) {\n        cv::Mat img_mask = scale_mask(masks[i], img);\n        auto color = colors[(int)dets[i].class_id % colors.size()];\n        auto bgr = cv::Scalar(color & 0xFF, color >> 8 & 0xFF, color >> 16 & 0xFF);\n\n        cv::Rect r = get_rect(img, dets[i].bbox);\n        for (int x = r.x; x < r.x + r.width; x++) {\n            for (int y = r.y; y < r.y + r.height; y++) {\n                float val = img_mask.at<float>(y, x);\n                if (val <= 0.5)\n                    continue;\n                img.at<cv::Vec3b>(y, x)[0] = img.at<cv::Vec3b>(y, x)[0] / 2 + bgr[0] / 2;\n                img.at<cv::Vec3b>(y, x)[1] = img.at<cv::Vec3b>(y, x)[1] / 2 + bgr[1] / 2;\n                img.at<cv::Vec3b>(y, x)[2] = img.at<cv::Vec3b>(y, x)[2] / 2 + bgr[2] / 2;\n            }\n        }\n\n        cv::rectangle(img, r, bgr, 2);\n\n        // Get the size of the text\n        cv::Size textSize =\n                cv::getTextSize(labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf),\n                                cv::FONT_HERSHEY_PLAIN, 1.2, 2, NULL);\n        // Set the top left corner of the rectangle\n        cv::Point topLeft(r.x, r.y - textSize.height);\n\n        // Set the bottom right corner of the rectangle\n        cv::Point bottomRight(r.x + textSize.width, r.y + textSize.height);\n\n        // Set the thickness of the rectangle lines\n        int lineThickness = 2;\n\n        // Draw the rectangle on the image\n        cv::rectangle(img, topLeft, bottomRight, bgr, -1);\n\n        cv::putText(img, labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf),\n                    cv::Point(r.x, r.y + 4), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar::all(0xFF), 2);\n    }\n}\n\nstd::tuple<float, float, float> convariance_matrix(Detection res) {\n    float w = res.bbox[2];\n    float h = res.bbox[3];\n\n    float a = w * w / 12.0;\n    float b = h * h / 12.0;\n    float c = res.angle;\n\n    float cos_r = std::cos(c);\n    float sin_r = std::sin(c);\n\n    float cos_r2 = cos_r * cos_r;\n    float sin_r2 = sin_r * sin_r;\n\n    float a_val = a * cos_r2 + b * sin_r2;\n    float b_val = a * sin_r2 + b * cos_r2;\n    float c_val = (a - b) * cos_r * sin_r;\n\n    return std::make_tuple(a_val, b_val, c_val);\n}\n\nstatic float probiou(const Detection& res1, const Detection& res2, float eps = 1e-7) {\n    // Calculate the prob iou between oriented bounding boxes, https://arxiv.org/pdf/2106.06072v1.pdf.\n    float a1, b1, c1, a2, b2, c2;\n    std::tuple<float, float, float> matrix1 = {a1, b1, c1};\n    std::tuple<float, float, float> matrix2 = {a2, b2, c2};\n    matrix1 = convariance_matrix(res1);\n    matrix2 = convariance_matrix(res2);\n    a1 = std::get<0>(matrix1);\n    b1 = std::get<1>(matrix1);\n    c1 = std::get<2>(matrix1);\n    a2 = std::get<0>(matrix2);\n    b2 = std::get<1>(matrix2);\n    c2 = std::get<2>(matrix2);\n\n    float x1 = res1.bbox[0], y1 = res1.bbox[1];\n    float x2 = res2.bbox[0], y2 = res2.bbox[1];\n\n    float t1 = ((a1 + a2) * std::pow(y1 - y2, 2) + (b1 + b2) * std::pow(x1 - x2, 2)) /\n               ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2) + eps);\n    float t2 = ((c1 + c2) * (x2 - x1) * (y1 - y2)) / ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2) + eps);\n    float t3 = std::log(\n            ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2)) /\n                    (4 * std::sqrt(std::max(a1 * b1 - c1 * c1, 0.0f)) * std::sqrt(std::max(a2 * b2 - c2 * c2, 0.0f)) +\n                     eps) +\n            eps);\n\n    float bd = 0.25f * t1 + 0.5f * t2 + 0.5f * t3;\n    bd = std::max(std::min(bd, 100.0f), eps);\n    float hd = std::sqrt(1.0 - std::exp(-bd) + eps);\n\n    return 1 - hd;\n}\n\nvoid decode_obb(std::vector<Detection>& res, float* output) {\n    int det_size = sizeof(Detection) / sizeof(float);\n    std::map<float, std::vector<Detection>> m;\n\n    for (int i = 0; i < output[0]; i++) {\n        Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        res.push_back(det);\n    }\n}\n\nvoid batch_decode_obb(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size) {\n    res_batch.resize(batch_size);\n    for (int i = 0; i < batch_size; i++) {\n        decode_obb(res_batch[i], &output[i * output_size]);\n    }\n}\n\nstatic std::vector<cv::Point> get_corner(cv::Mat& img, const Detection& box) {\n    float cos_value, sin_value;\n\n    // Calculate center point and width/height\n    float x1 = box.bbox[0];\n    float y1 = box.bbox[1];\n    float w = box.bbox[2];\n    float h = box.bbox[3];\n    float angle = box.angle * 180.0f / CV_PI;  // Convert radians to degrees\n\n    // Print original angle\n    std::cout << \"Original angle: \" << angle << std::endl;\n\n    // Swap width and height if height is greater than or equal to width\n    if (h >= w) {\n        std::swap(w, h);\n        angle = fmod(angle + 90.0f, 180.0f);  // Adjust angle to be within [0, 180)\n    }\n\n    // Ensure the angle is between 0 and 180 degrees\n    if (angle < 0) {\n        angle += 360.0f;  // Convert to positive value\n    }\n    if (angle > 180.0f) {\n        angle -= 180.0f;  // Subtract 180 from angles greater than 180\n    }\n\n    // Print adjusted angle\n    std::cout << \"Adjusted angle: \" << angle << std::endl;\n\n    // Convert to normal angle value\n    float normal_angle = fmod(angle, 180.0f);\n    if (normal_angle < 0) {\n        normal_angle += 180.0f;  // Ensure it's a positive value\n    }\n\n    // Print normal angle value\n    std::cout << \"Normal angle: \" << normal_angle << std::endl;\n\n    cos_value = std::cos(angle * CV_PI / 180.0f);  // Convert to radians\n    sin_value = std::sin(angle * CV_PI / 180.0f);\n\n    // Calculate each corner point\n    float l = x1 - w / 2;  // Left boundary\n    float r = x1 + w / 2;  // Right boundary\n    float t = y1 - h / 2;  // Top boundary\n    float b = y1 + h / 2;  // Bottom boundary\n\n    // Use get_rect function to scale the coordinates\n    float bbox[4] = {l, t, r, b};\n    cv::Rect rect = get_rect_obb(img, bbox);\n\n    float x_ = (rect.x + rect.x + rect.width) / 2;   // Center x\n    float y_ = (rect.y + rect.y + rect.height) / 2;  // Center y\n    float width = rect.width;                        // Width\n    float height = rect.height;                      // Height\n\n    // Calculate each corner point\n    std::vector<cv::Point> corner_points(4);\n    float vec1x = width / 2 * cos_value;\n    float vec1y = width / 2 * sin_value;\n    float vec2x = -height / 2 * sin_value;\n    float vec2y = height / 2 * cos_value;\n\n    corner_points[0] = cv::Point(int(round(x_ + vec1x + vec2x)), int(round(y_ + vec1y + vec2y)));  // Top-left corner\n    corner_points[1] = cv::Point(int(round(x_ + vec1x - vec2x)), int(round(y_ + vec1y - vec2y)));  // Top-right corner\n    corner_points[2] =\n            cv::Point(int(round(x_ - vec1x - vec2x)), int(round(y_ - vec1y - vec2y)));  // Bottom-right corner\n    corner_points[3] = cv::Point(int(round(x_ - vec1x + vec2x)), int(round(y_ - vec1y + vec2y)));  // Bottom-left corner\n\n    // Check and adjust corner points to ensure the rectangle is parallel to image boundaries\n    for (auto& point : corner_points) {\n        point.x = std::max(0, std::min(point.x, img.cols - 1));\n        point.y = std::max(0, std::min(point.y, img.rows - 1));\n    }\n\n    return corner_points;\n}\n\nvoid draw_bbox_obb(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    static std::vector<uint32_t> colors = {0xFF3838, 0xFF9D97, 0xFF701F, 0xFFB21D, 0xCFD231, 0x48F90A, 0x92CC17,\n                                           0x3DDB86, 0x1A9334, 0x00D4BB, 0x2C99A8, 0x00C2FF, 0x344593, 0x6473FF,\n                                           0x0018EC, 0x8438FF, 0x520085, 0xCB38FF, 0xFF95C8, 0xFF37C7};\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        auto& img = img_batch[i];\n        for (auto& obj : res) {\n            auto color = colors[(int)obj.class_id % colors.size()];\n            auto bgr = cv::Scalar(color & 0xFF, color >> 8 & 0xFF, color >> 16 & 0xFF);\n            auto corner_points = get_corner(img, obj);\n            cv::polylines(img, std::vector<std::vector<cv::Point>>{corner_points}, true, bgr, 1);\n\n            auto text = (std::to_string((int)(obj.class_id)) + \":\" + to_string_with_precision(obj.conf));\n            cv::Size textsize = cv::getTextSize(text, 0, 0.3, 1, nullptr);\n\n            int width = textsize.width;\n            int height = textsize.height;\n            bool outside = (corner_points[0].y - height >= 3) ? true : false;\n            cv::Point p1(corner_points[0].x, corner_points[0].y), p2;\n            p2.x = corner_points[0].x + width;\n            if (outside) {\n                p2.y = corner_points[0].y - height - 3;\n            } else {\n                p2.y = corner_points[0].y + height + 3;\n            }\n            cv::rectangle(img, p1, p2, bgr, -1, cv::LINE_AA);\n            cv::putText(\n                    img, text,\n                    cv::Point(corner_points[0].x, (outside ? corner_points[0].y - 2 : corner_points[0].y + height + 2)),\n                    0, 0.3, cv::Scalar::all(255), 1, cv::LINE_AA);\n        }\n    }\n}\n"
  },
  {
    "path": "yolo26/src/preprocess.cu",
    "content": "#include \"cuda_utils.h\"\n#include \"preprocess.h\"\n\nstatic uint8_t* img_buffer_host = nullptr;\nstatic uint8_t* img_buffer_device = nullptr;\n\n__global__ void warpaffine_kernel(uint8_t* src, int src_line_size, int src_width, int src_height, float* dst,\n                                  int dst_width, int dst_height, uint8_t const_value_st, AffineMatrix d2s, int edge) {\n    int position = blockDim.x * blockIdx.x + threadIdx.x;\n    if (position >= edge)\n        return;\n\n    float m_x1 = d2s.value[0];\n    float m_y1 = d2s.value[1];\n    float m_z1 = d2s.value[2];\n    float m_x2 = d2s.value[3];\n    float m_y2 = d2s.value[4];\n    float m_z2 = d2s.value[5];\n\n    int dx = position % dst_width;\n    int dy = position / dst_width;\n    float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;\n    float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;\n    float c0, c1, c2;\n\n    if (src_x <= -1 || src_x >= src_width || src_y <= -1 || src_y >= src_height) {\n        // out of range\n        c0 = const_value_st;\n        c1 = const_value_st;\n        c2 = const_value_st;\n    } else {\n        int y_low = floorf(src_y);\n        int x_low = floorf(src_x);\n        int y_high = y_low + 1;\n        int x_high = x_low + 1;\n\n        uint8_t const_value[] = {const_value_st, const_value_st, const_value_st};\n        float ly = src_y - y_low;\n        float lx = src_x - x_low;\n        float hy = 1 - ly;\n        float hx = 1 - lx;\n        float w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;\n        uint8_t* v1 = const_value;\n        uint8_t* v2 = const_value;\n        uint8_t* v3 = const_value;\n        uint8_t* v4 = const_value;\n\n        if (y_low >= 0) {\n            if (x_low >= 0)\n                v1 = src + y_low * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v2 = src + y_low * src_line_size + x_high * 3;\n        }\n\n        if (y_high < src_height) {\n            if (x_low >= 0)\n                v3 = src + y_high * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v4 = src + y_high * src_line_size + x_high * 3;\n        }\n\n        c0 = w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0];\n        c1 = w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1];\n        c2 = w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2];\n    }\n\n    // bgr to rgb\n    float t = c2;\n    c2 = c0;\n    c0 = t;\n\n    // normalization\n    c0 = c0 / 255.0f;\n    c1 = c1 / 255.0f;\n    c2 = c2 / 255.0f;\n\n    // rgbrgbrgb to rrrgggbbb\n    int area = dst_width * dst_height;\n    float* pdst_c0 = dst + dy * dst_width + dx;\n    float* pdst_c1 = pdst_c0 + area;\n    float* pdst_c2 = pdst_c1 + area;\n    *pdst_c0 = c0;\n    *pdst_c1 = c1;\n    *pdst_c2 = c2;\n}\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream) {\n    int img_size = src_width * src_height * 3;\n    // copy data to pinned memory\n    memcpy(img_buffer_host, src, img_size);\n    // copy data to device memory\n    CUDA_CHECK(cudaMemcpyAsync(img_buffer_device, img_buffer_host, img_size, cudaMemcpyHostToDevice, stream));\n\n    AffineMatrix s2d, d2s;\n    float scale = std::min(dst_height / (float)src_height, dst_width / (float)src_width);\n\n    s2d.value[0] = scale;\n    s2d.value[1] = 0;\n    s2d.value[2] = -scale * src_width * 0.5 + dst_width * 0.5;\n    s2d.value[3] = 0;\n    s2d.value[4] = scale;\n    s2d.value[5] = -scale * src_height * 0.5 + dst_height * 0.5;\n    cv::Mat m2x3_s2d(2, 3, CV_32F, s2d.value);\n    cv::Mat m2x3_d2s(2, 3, CV_32F, d2s.value);\n    cv::invertAffineTransform(m2x3_s2d, m2x3_d2s);\n\n    memcpy(d2s.value, m2x3_d2s.ptr<float>(0), sizeof(d2s.value));\n\n    int jobs = dst_height * dst_width;\n    int threads = 256;\n    int blocks = ceil(jobs / (float)threads);\n    warpaffine_kernel<<<blocks, threads, 0, stream>>>(img_buffer_device, src_width * 3, src_width, src_height, dst,\n                                                      dst_width, dst_height, 128, d2s, jobs);\n}\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream) {\n    int dst_size = dst_width * dst_height * 3;\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        cuda_preprocess(img_batch[i].ptr(), img_batch[i].cols, img_batch[i].rows, &dst[dst_size * i], dst_width,\n                        dst_height, stream);\n        CUDA_CHECK(cudaStreamSynchronize(stream));\n    }\n}\n\nvoid cuda_preprocess_init(int max_image_size) {\n    // prepare input data in pinned memory\n    CUDA_CHECK(cudaMallocHost((void**)&img_buffer_host, max_image_size * 3));\n    // prepare input data in device memory\n    CUDA_CHECK(cudaMalloc((void**)&img_buffer_device, max_image_size * 3));\n}\n\nvoid cuda_preprocess_destroy() {\n    CUDA_CHECK(cudaFree(img_buffer_device));\n    CUDA_CHECK(cudaFreeHost(img_buffer_host));\n}"
  },
  {
    "path": "yolo26/yolo26_cls.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <numeric>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"types.h\"\n#include \"utils.h\"\n\n#include \"yololayer.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst static int kOutputSize = kClsNumClass;\n\nvoid batch_preprocess(std::vector<cv::Mat>& imgs, float* output, int dst_width = 224, int dst_height = 224) {\n    for (size_t b = 0; b < imgs.size(); b++) {\n        int h = imgs[b].rows;\n        int w = imgs[b].cols;\n        int m = std::min(h, w);\n        int top = (h - m) / 2;\n        int left = (w - m) / 2;\n        cv::Mat img = imgs[b](cv::Rect(left, top, m, m));\n        cv::resize(img, img, cv::Size(dst_width, dst_height), 0, 0, cv::INTER_LINEAR);\n        cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n        img.convertTo(img, CV_32F, 1 / 255.0);\n\n        std::vector<cv::Mat> channels(3);\n        cv::split(img, channels);\n\n        // CHW format\n        for (int c = 0; c < 3; ++c) {\n            int i = 0;\n            for (int row = 0; row < dst_height; ++row) {\n                for (int col = 0; col < dst_width; ++col) {\n                    output[b * 3 * dst_height * dst_width + c * dst_height * dst_width + i] =\n                            channels[c].at<float>(row, col);\n                    ++i;\n                }\n            }\n        }\n    }\n}\n\nvoid serialize_engine(const std::string& wts_name, std::string& engine_name, float& gd, float& gw, int& max_channels,\n                      std::string& type) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine =\n            buildEngineYolo26Cls(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels, type);\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** input_buffer_host, float** output_buffer_host) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kClsInputH * kClsInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n\n    *input_buffer_host = new float[kBatchSize * 3 * kClsInputH * kClsInputW];\n    *output_buffer_host = new float[kBatchSize * kOutputSize];\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* input, float* output,\n           int batchSize) {\n    CUDA_CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * 3 * kClsInputH * kClsInputW * sizeof(float),\n                               cudaMemcpyHostToDevice, stream));\n    context.enqueueV2(buffers, stream, nullptr);\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                               stream));\n    cudaStreamSynchronize(stream);\n}\n\nstd::vector<int> topk(const std::vector<float>& vec, int k) {\n    std::vector<int> topk_index;\n    std::vector<size_t> vec_index(vec.size());\n    std::iota(vec_index.begin(), vec_index.end(), 0);\n\n    std::sort(vec_index.begin(), vec_index.end(),\n              [&vec](size_t index_1, size_t index_2) { return vec[index_1] > vec[index_2]; });\n\n    int k_num = std::min<int>(vec.size(), k);\n\n    for (int i = 0; i < k_num; ++i) {\n        topk_index.push_back(vec_index[i]);\n    }\n\n    return topk_index;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(kGpuId);\n    std::string wts_name;\n    std::string engine_name;\n    std::string img_dir;\n    std::string type;\n    int model_bboxes = 0;\n    float gd = 0, gw = 0;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir, type, gd, gw, max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolo26_cls -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to \"\n                     \"plan file\"\n                  << std::endl;\n        std::cerr << \"./yolo26_cls -d [.engine] ../images  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, gd, gw, max_channels, type);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* input_buffer_host = nullptr;\n    float* output_buffer_host = nullptr;\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &input_buffer_host, &output_buffer_host);\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    // Read imagenet labels\n    auto classes = read_classes(\"imagenet_classes.txt\");\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n\n        // Preprocess\n        batch_preprocess(img_batch, input_buffer_host, kClsInputW, kClsInputH);\n\n        std::ofstream p(\"engine_input.txt\");\n        if (!p) {\n            std::cout << \"could not open input file\" << std::endl;\n            assert(false);\n        }\n        for (int i = 0; i < kBatchSize * 3 * kClsInputH * kClsInputW; i++) {\n            p << input_buffer_host[i] << \"\\n\";\n        }\n        p.close();\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n        infer(*context, stream, (void**)device_buffers, input_buffer_host, output_buffer_host, kBatchSize);\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n\n        // Postprocess and get top-k result\n        for (size_t b = 0; b < img_name_batch.size(); b++) {\n            float* p = &output_buffer_host[b * kOutputSize];\n            std::vector<float> prob(p, p + kOutputSize);\n            auto topk_idx = topk(prob, 3);\n            std::cout << img_name_batch[b] << std::endl;\n            for (auto idx : topk_idx) {\n                std::cout << \"  \" << classes[idx] << \" \" << p[idx] << std::endl;\n            }\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    delete[] input_buffer_host;\n    delete[] output_buffer_host;\n    delete context;\n    delete engine;\n    delete runtime;\n    return 0;\n}"
  },
  {
    "path": "yolo26/yolo26_det.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"types.h\"\n#include \"utils.h\"\n\n#include \"yololayer.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\n\nvoid serialize_engine(const std::string& wts_name, std::string& engine_name, float& gd, float& gw, int& max_channels,\n                      std::string& type) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine =\n            buildEngineYolo26Det(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels, type);\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_buffer_host) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n\n    *output_buffer_host = new float[kBatchSize * kOutputSize];\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchsize,\n           int model_bboxes) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueueV2(buffers, stream, nullptr);\n\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                               stream));\n\n    auto end = std::chrono::system_clock::now();\n    std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n              << \"ms\" << std::endl;\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(kGpuId);\n    std::string wts_name;\n    std::string engine_name;\n    std::string img_dir;\n    std::string type;\n    int model_bboxes = 0;\n    float gd = 0, gw = 0;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir, type, gd, gw, max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolo26_det -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to \"\n                     \"plan file\"\n                  << std::endl;\n        std::cerr << \"./yolo26_det -d [.engine] ../images  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, gd, gw, max_channels, type);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* output_buffer_host = nullptr;\n\n    // WARN: If you change kMaxNumOutputBbox, it must be smaller than the value kMaxNumOutputBbox in config.h,\n    // otherwise there will be memory overflow!\n    // Or you should modify the config.h and recompile.\n    setPluginDeviceParams(kConfThresh);\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host);\n\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n\n        // Run inference\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize, model_bboxes);\n\n        std::vector<std::vector<Detection>> res_batch;\n        batch_decode(res_batch, output_buffer_host, kBatchSize, kOutputSize);\n\n        // Draw bounding boxes\n        draw_bbox(img_batch, res_batch);\n\n        // Save images\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    delete[] output_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    return 0;\n}"
  },
  {
    "path": "yolo26/yolo26_obb.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"types.h\"\n#include \"utils.h\"\n#include \"yololayer.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\n\nvoid serialize_engine(const std::string& wts_name, std::string& engine_name, float& gd, float& gw, int& max_channels,\n                      std::string& type) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine =\n            buildEngineYolo26Obb(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels, type);\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_buffer_host) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kObbInputH * kObbInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n\n    *output_buffer_host = new float[kBatchSize * kOutputSize];\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchsize,\n           int model_bboxes) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueueV2(buffers, stream, nullptr);\n\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                               stream));\n    auto end = std::chrono::system_clock::now();\n    std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n              << \"ms\" << std::endl;\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(kGpuId);\n    std::string wts_name;\n    std::string engine_name;\n    std::string img_dir;\n    std::string type;\n    int model_bboxes;\n    float gd = 0.0f, gw = 0.0f;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir, type, gd, gw, max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolo26_obb -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./yolo26_obb -d [.engine] ../images  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, gd, gw, max_channels, type);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* output_buffer_host = nullptr;\n\n    setPluginDeviceParams(kConfThresh);\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host);\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kObbInputW, kObbInputH, stream);\n\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize, model_bboxes);\n\n        std::vector<std::vector<Detection>> res_batch;\n        batch_decode_obb(res_batch, output_buffer_host, img_batch.size(), kOutputSize);\n        draw_bbox_obb(img_batch, res_batch);\n        // Save images\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    delete[] output_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    return 0;\n}"
  },
  {
    "path": "yolop/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(yolop)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Release)\n\nfind_package(CUDA  REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n\nfind_package(OpenCV REQUIRED)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\n# cuda\ninclude_directories(/usr/local/cuda-10.2/include)\nlink_directories(/usr/local/cuda-10.2/lib64)\n# tensorrt\ninclude_directories(/usr/include/aarch64-linux-gnu/)\nlink_directories(/usr/lib/aarch64-linux-gnu/)\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\n# to generate plugins\ncuda_add_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/yololayer.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\n# to generate trt and test image dir\nadd_executable(yolop ${PROJECT_SOURCE_DIR}/yolop.cpp)\ntarget_link_libraries(yolop nvinfer cudart myplugins ${OpenCV_LIBS})\nadd_definitions(-O3 -pthread)\n\n"
  },
  {
    "path": "yolop/README.md",
    "content": "YoloP\n=====\n\nThe original pytorch model is from [hustvl/YOLOP](https://github.com/hustvl/YOLOP)\n\n## Authors\n\n<a href=\"https://github.com/ausk\"><img src=\"https://avatars.githubusercontent.com/u/4545060?v=4?s=48\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/aliceint\"><img src=\"https://avatars.githubusercontent.com/u/15520773?v=4?s=48\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/mantuoluozk\"><img src=\"https://avatars.githubusercontent.com/u/43333969?v=4?s=48\" width=\"40px;\" alt=\"\"/></a>\n\n## 1. Prepare building environments\n\nMake sure you have install `c++`(support c++11)、 `cmake`、`opencv`(4.x)、`cuda`(10.x)、`nvinfer`(7.x).\n\n\n## 2. build yolop\n\nGo to `yolop`.\n\n```\nmkdir build\ncd build\n\ncmake ..\nmake\n```\n\nNow you can get `yolop` and `libmyplugins.so`.\n\n\n## 3. Test in C++\n\nGo to `yolop/build`.\n\n### 3.1 generate yolop.wts\nDownload/Clone [YOLOP](https://github.com/hustvl/YOLOP)\n\nEdit `gen_wts.py` , change `YOLOP_BASE_DIR` to realpath of `YOLOP`.\n\n```\n# [WARN] Please download/clone YOLOP, then set YOLOP_BASE_DIR to the root of YOLOP\npython3 ../gen_wts.py\n```\n\n### 3.2 generate yolop.trt\n```\n./yolop -s yolop.wts  yolop.trt\n```\n\nNow you have such files:  `libmyplugins.so yolop yolop.wts  yolop.trt`\n\n\n### 3.3 test yolop.trt\n```\nmkdir ../results\n\nYOLOP_BASE_DIR=/home/user/jetson/tmp/YOLOP\n./yolop -d yolop.trt  $YOLOP_BASE_DIR/inference/images/\n```\n\nIt will output like as follow if successful! ( test on `Jetson Xavier NX - Jetpack 4.4`)\n```\n1601ms # the fist time is slow\n26ms   # then it is faster\n29ms\n27ms\n29ms\n29ms\n```\n\n![](https://user-images.githubusercontent.com/4545060/197756635-38348dc5-d8e7-4ae3-be56-6b231dd2f5db.jpg)\n\n\n## 4. Test in python3\nGo to `yolop`.\n\nMake sure you have install `pycuda` `tensorrt`; and modify `image_dir` to your image dir.\n\n```\n# usage: xxx <engine file> <plugin file> <image dir>\n\npython3 yolop_trt.py  build/yolop.trt  build/libmyplugins.so /home/user/jetson/tmp/YOLOP/inference/images\n```\n\nIt will output like as follow if successful! ( test on `Jetson Xavier NX - Jetpack 4.4`)\n```\nusage: xxx <engine file> <plugin file> <image dir>\n[WARN] preaprea you image_dir, such as: samples, or /home/user/jetson/tmp/YOLOP/inference/images\nbingding:  data (3, 384, 640)\nbingding:  det (6001, 1, 1)\nbingding:  seg (1, 360, 640)\nbingding:  lane (1, 360, 640)\nbatch size is 1\nwarm_up->(384, 640, 3), time->1070.87ms\ninput->['/home/user/jetson/tmp/YOLOP/inference/images/3c0e7240-96e390d2.jpg'], time->25.94ms, saving into output/\ninput->['/home/user/jetson/tmp/YOLOP/inference/images/adb4871d-4d063244.jpg'], time->25.34ms, saving into output/\ninput->['/home/user/jetson/tmp/YOLOP/inference/images/8e1c1ab0-a8b92173.jpg'], time->25.03ms, saving into output/\ninput->['/home/user/jetson/tmp/YOLOP/inference/images/7dd9ef45-f197db95.jpg'], time->25.45ms, saving into output/\ninput->['/home/user/jetson/tmp/YOLOP/inference/images/9aa94005-ff1d4c9a.jpg'], time->24.93ms, saving into output/\ninput->['/home/user/jetson/tmp/YOLOP/inference/images/0ace96c3-48481887.jpg'], time->25.33ms, saving into output/\ndone!\n```\n\n![](https://user-images.githubusercontent.com/4545060/198003852-204f3bae-18ad-44fb-9ecd-4a2a07a726a3.jpg)\n\n\n**Notice** : The results of c++ and python are not aligned for now!\n\n----------------------------------------\n\n```BibTeX\n@misc{2108.11250,\nAuthor = {Dong Wu and Manwen Liao and Weitian Zhang and Xinggang Wang},\nTitle = {YOLOP: You Only Look Once for Panoptic Driving Perception},\nYear = {2021},\nEprint = {arXiv:2108.11250},\n}\n```\n\n"
  },
  {
    "path": "yolop/common.hpp",
    "content": "#pragma once\n\n#include <fstream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"yololayer.h\"\n\nusing namespace nvinfer1;\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    int l, r, t, b;\n    float r_w = Yolo::INPUT_W / (img.cols * 1.0);\n    float r_h = Yolo::INPUT_H / (img.rows * 1.0);\n    if (r_h > r_w) {\n        l = bbox[0] - bbox[2] / 2.f;\n        r = bbox[0] + bbox[2] / 2.f;\n        t = bbox[1] - bbox[3] / 2.f - (Yolo::INPUT_H - r_w * img.rows) / 2;\n        b = bbox[1] + bbox[3] / 2.f - (Yolo::INPUT_H - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - bbox[2] / 2.f - (Yolo::INPUT_W - r_h * img.cols) / 2;\n        r = bbox[0] + bbox[2] / 2.f - (Yolo::INPUT_W - r_h * img.cols) / 2;\n        t = bbox[1] - bbox[3] / 2.f;\n        b = bbox[1] + bbox[3] / 2.f;\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    return cv::Rect(l, t, r - l, b - t);\n}\n\nfloat iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n        (std::max)(lbox[0] - lbox[2] / 2.f , rbox[0] - rbox[2] / 2.f), //left\n        (std::min)(lbox[0] + lbox[2] / 2.f , rbox[0] + rbox[2] / 2.f), //right\n        (std::max)(lbox[1] - lbox[3] / 2.f , rbox[1] - rbox[3] / 2.f), //top\n        (std::min)(lbox[1] + lbox[3] / 2.f , rbox[1] + rbox[3] / 2.f), //bottom\n    };\n\n    if (interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS = (interBox[1] - interBox[0])*(interBox[3] - interBox[2]);\n    return interBoxS / (lbox[2] * lbox[3] + rbox[2] * rbox[3] - interBoxS);\n}\n\nbool cmp(const Yolo::Detection& a, const Yolo::Detection& b) {\n    return a.conf > b.conf;\n}\n\nvoid nms(std::vector<Yolo::Detection>& res, float *output, float conf_thresh, float nms_thresh = 0.5) {\n    int det_size = sizeof(Yolo::Detection) / sizeof(float);\n    std::map<float, std::vector<Yolo::Detection>> m;\n    for (int i = 0; i < output[0] && i < Yolo::MAX_OUTPUT_BBOX_COUNT; i++) {\n        if (output[1 + det_size * i + 4] <= conf_thresh) continue;\n        Yolo::Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        if (m.count(det.class_id) == 0) m.emplace(det.class_id, std::vector<Yolo::Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        //std::cout << it->second[0].class_id << \" --- \" << std::endl;\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                    dets.erase(dets.begin() + n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{ DataType::kFLOAT, nullptr, 0 };\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{ DataType::kFLOAT, scval, len };\n\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{ DataType::kFLOAT, shval, len };\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{ DataType::kFLOAT, pval, len };\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* convBlock(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int ksize, int s, int g, std::string lname) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n    int p = ksize / 2;\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ ksize, ksize }, weightMap[lname + \".conv.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{ s, s });\n    conv1->setPaddingNd(DimsHW{ p, p });\n    conv1->setNbGroups(g);\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn\", 1e-3);\n\n    // silu = x * sigmoid\n    // auto sig = network->addActivation(*bn1->getOutput(0), ActivationType::kSIGMOID);\n    // assert(sig);\n    // auto ew = network->addElementWise(*bn1->getOutput(0), *sig->getOutput(0), ElementWiseOperation::kPROD);\n    // assert(ew);\n\n    // hard_swish = x * hard_sigmoid\n    auto hsig = network->addActivation(*bn1->getOutput(0), ActivationType::kHARD_SIGMOID);\n    assert(hsig);\n    hsig->setAlpha(1.0 / 6.0);\n    hsig->setBeta(0.5);\n    auto ew = network->addElementWise(*bn1->getOutput(0), *hsig->getOutput(0), ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n\nILayer* focus(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int ksize, std::string lname) {\n    ISliceLayer *s1 = network->addSlice(input, Dims3{ 0, 0, 0 }, Dims3{ inch, Yolo::INPUT_H / 2, Yolo::INPUT_W / 2 }, Dims3{ 1, 2, 2 });\n    ISliceLayer *s2 = network->addSlice(input, Dims3{ 0, 1, 0 }, Dims3{ inch, Yolo::INPUT_H / 2, Yolo::INPUT_W / 2 }, Dims3{ 1, 2, 2 });\n    ISliceLayer *s3 = network->addSlice(input, Dims3{ 0, 0, 1 }, Dims3{ inch, Yolo::INPUT_H / 2, Yolo::INPUT_W / 2 }, Dims3{ 1, 2, 2 });\n    ISliceLayer *s4 = network->addSlice(input, Dims3{ 0, 1, 1 }, Dims3{ inch, Yolo::INPUT_H / 2, Yolo::INPUT_W / 2 }, Dims3{ 1, 2, 2 });\n    ITensor* inputTensors[] = { s1->getOutput(0), s2->getOutput(0), s3->getOutput(0), s4->getOutput(0) };\n    auto cat = network->addConcatenation(inputTensors, 4);\n    auto conv = convBlock(network, weightMap, *cat->getOutput(0), outch, ksize, 1, 1, lname + \".conv\");\n    return conv;\n}\n\nILayer* bottleneck(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2, bool shortcut, int g, float e, std::string lname) {\n    auto cv1 = convBlock(network, weightMap, input, (int)((float)c2 * e), 1, 1, 1, lname + \".cv1\");\n    auto cv2 = convBlock(network, weightMap, *cv1->getOutput(0), c2, 3, 1, g, lname + \".cv2\");\n    if (shortcut && c1 == c2) {\n        auto ew = network->addElementWise(input, *cv2->getOutput(0), ElementWiseOperation::kSUM);\n        return ew;\n    }\n    return cv2;\n}\n\nILayer* bottleneckCSP(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2, int n, bool shortcut, int g, float e, std::string lname) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n    int c_ = (int)((float)c2 * e);\n    auto cv1 = convBlock(network, weightMap, input, c_, 1, 1, 1, lname + \".cv1\");\n    auto cv2 = network->addConvolutionNd(input, c_, DimsHW{ 1, 1 }, weightMap[lname + \".cv2.weight\"], emptywts);\n    ITensor *y1 = cv1->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        auto b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, g, 1.0, lname + \".m.\" + std::to_string(i));\n        y1 = b->getOutput(0);\n    }\n    auto cv3 = network->addConvolutionNd(*y1, c_, DimsHW{ 1, 1 }, weightMap[lname + \".cv3.weight\"], emptywts);\n\n    ITensor* inputTensors[] = { cv3->getOutput(0), cv2->getOutput(0) };\n    auto cat = network->addConcatenation(inputTensors, 2);\n\n    IScaleLayer* bn = addBatchNorm2d(network, weightMap, *cat->getOutput(0), lname + \".bn\", 1e-4);\n    auto lr = network->addActivation(*bn->getOutput(0), ActivationType::kLEAKY_RELU);\n    lr->setAlpha(0.1);\n\n    auto cv4 = convBlock(network, weightMap, *lr->getOutput(0), c2, 1, 1, 1, lname + \".cv4\");\n    return cv4;\n}\n\nILayer* C3(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2, int n, bool shortcut, int g, float e, std::string lname) {\n    int c_ = (int)((float)c2 * e);\n    auto cv1 = convBlock(network, weightMap, input, c_, 1, 1, 1, lname + \".cv1\");\n    auto cv2 = convBlock(network, weightMap, input, c_, 1, 1, 1, lname + \".cv2\");\n    ITensor *y1 = cv1->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        auto b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, g, 1.0, lname + \".m.\" + std::to_string(i));\n        y1 = b->getOutput(0);\n    }\n\n    ITensor* inputTensors[] = { y1, cv2->getOutput(0) };\n    auto cat = network->addConcatenation(inputTensors, 2);\n\n    auto cv3 = convBlock(network, weightMap, *cat->getOutput(0), c2, 1, 1, 1, lname + \".cv3\");\n    return cv3;\n}\n\nILayer* SPP(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2, int k1, int k2, int k3, std::string lname) {\n    int c_ = c1 / 2;\n    auto cv1 = convBlock(network, weightMap, input, c_, 1, 1, 1, lname + \".cv1\");\n\n    auto pool1 = network->addPoolingNd(*cv1->getOutput(0), PoolingType::kMAX, DimsHW{ k1, k1 });\n    pool1->setPaddingNd(DimsHW{ k1 / 2, k1 / 2 });\n    pool1->setStrideNd(DimsHW{ 1, 1 });\n    auto pool2 = network->addPoolingNd(*cv1->getOutput(0), PoolingType::kMAX, DimsHW{ k2, k2 });\n    pool2->setPaddingNd(DimsHW{ k2 / 2, k2 / 2 });\n    pool2->setStrideNd(DimsHW{ 1, 1 });\n    auto pool3 = network->addPoolingNd(*cv1->getOutput(0), PoolingType::kMAX, DimsHW{ k3, k3 });\n    pool3->setPaddingNd(DimsHW{ k3 / 2, k3 / 2 });\n    pool3->setStrideNd(DimsHW{ 1, 1 });\n\n    ITensor* inputTensors[] = { cv1->getOutput(0), pool1->getOutput(0), pool2->getOutput(0), pool3->getOutput(0) };\n    auto cat = network->addConcatenation(inputTensors, 4);\n\n    auto cv2 = convBlock(network, weightMap, *cat->getOutput(0), c2, 1, 1, 1, lname + \".cv2\");\n    return cv2;\n}\n\nILayer* preprocess_layer(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input) {\n    // rescale\n    auto rescale = network->addResize(input);\n    rescale->setOutputDimensions(Dims3{ 3, Yolo::IMG_H, Yolo::IMG_W });\n    rescale->setResizeMode(ResizeMode::kLINEAR);\n    // normalize\n    // long len = 3 * Yolo::IMG_H * Yolo::IMG_W;\n    // float *normval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    // for (size_t i = 0; i < len; ++i) {\n    //     normval[i] = 255.0;\n    // }\n    // Weights norm{ DataType::kFLOAT, normval, len };\n    // weightMap[\"prep.norm\"] = norm;\n    // auto constant = network->addConstant(Dims3{ 3, Yolo::IMG_H, Yolo::IMG_W }, norm);\n    // auto normalize = network->addElementWise(*rescale->getOutput(0), *constant->getOutput(0), ElementWiseOperation::kDIV);\n\n    //paddng\n    auto padding = network->addPaddingNd(*rescale->getOutput(0),\n                                        DimsHW{ (Yolo::INPUT_H - Yolo::IMG_H) / 2, (Yolo::INPUT_W - Yolo::IMG_W) / 2 },\n                                        DimsHW{ (Yolo::INPUT_H - Yolo::IMG_H) / 2, (Yolo::INPUT_W - Yolo::IMG_W) / 2 });\n\n    assert(padding);\n    return padding;\n\n}\n\nstd::vector<float> getAnchors(std::map<std::string, Weights>& weightMap)\n{\n    std::vector<float> anchors_yolo;\n    Weights Yolo_Anchors = weightMap[\"model.24.anchor_grid\"];\n    assert(Yolo_Anchors.count == 18);\n    int each_yololayer_anchorsnum = Yolo_Anchors.count / 3;\n    const float* tempAnchors = (const float*)(Yolo_Anchors.values);\n    for (int i = 0; i < Yolo_Anchors.count; i++)\n    {\n        if (i < each_yololayer_anchorsnum)\n        {\n            anchors_yolo.push_back(const_cast<float*>(tempAnchors)[i]);\n        }\n        if ((i >= each_yololayer_anchorsnum) && (i < (2 * each_yololayer_anchorsnum)))\n        {\n            anchors_yolo.push_back(const_cast<float*>(tempAnchors)[i]);\n        }\n        if (i >= (2 * each_yololayer_anchorsnum))\n        {\n            anchors_yolo.push_back(const_cast<float*>(tempAnchors)[i]);\n        }\n    }\n\n    return anchors_yolo;\n}\n\nIPluginV2Layer* addYoLoLayer(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, IConvolutionLayer* det0, IConvolutionLayer* det1, IConvolutionLayer* det2)\n{\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    std::vector<float> anchors_yolo = getAnchors(weightMap);\n    PluginField pluginMultidata[4];\n    int NetData[4];\n    NetData[0] = Yolo::CLASS_NUM;\n    NetData[1] = Yolo::INPUT_W;\n    NetData[2] = Yolo::INPUT_H;\n    NetData[3] = Yolo::MAX_OUTPUT_BBOX_COUNT;\n    pluginMultidata[0].data = NetData;\n    pluginMultidata[0].length = 3;\n    pluginMultidata[0].name = \"netdata\";\n    pluginMultidata[0].type = PluginFieldType::kFLOAT32;\n    int scale[3] = { 8, 16, 32 };\n    int plugindata[3][8];\n    std::string names[3];\n    for (int k = 1; k < 4; k++)\n    {\n        plugindata[k - 1][0] = Yolo::INPUT_W / scale[k - 1];\n        plugindata[k - 1][1] = Yolo::INPUT_H / scale[k - 1];\n        for (int i = 2; i < 8; i++)\n        {\n            plugindata[k - 1][i] = int(anchors_yolo[(k - 1) * 6 + i - 2]);\n        }\n        pluginMultidata[k].data = plugindata[k - 1];\n        pluginMultidata[k].length = 8;\n        names[k - 1] = \"yolodata\" + std::to_string(k);\n        pluginMultidata[k].name = names[k - 1].c_str();\n        pluginMultidata[k].type = PluginFieldType::kFLOAT32;\n    }\n    PluginFieldCollection pluginData;\n    pluginData.nbFields = 4;\n    pluginData.fields = pluginMultidata;\n    IPluginV2 *pluginObj = creator->createPlugin(\"yololayer\", &pluginData);\n    ITensor* inputTensors_yolo[] = { det2->getOutput(0), det1->getOutput(0), det0->getOutput(0) };\n    auto yolo = network->addPluginV2(inputTensors_yolo, 3, *pluginObj);\n    return yolo;\n}\n"
  },
  {
    "path": "yolop/cuda_utils.h",
    "content": "#pragma once\n#include <cuda_runtime_api.h>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)\\\n    {\\\n        cudaError_t error_code = callstr;\\\n        if (error_code != cudaSuccess) {\\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__;\\\n            assert(0);\\\n        }\\\n    }\n#endif  // CUDA_CHECK\n\n"
  },
  {
    "path": "yolop/gen_wts.py",
    "content": "import os, sys\nimport torch\nimport struct\n\n# TODO: YOLOP_BASE_DIR is the root of YOLOP\nprint(\"[WARN] Please download/clone YOLOP, then set YOLOP_BASE_DIR to the root of YOLOP\")\n\n#YOLOP_BASE_DIR = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\nYOLOP_BASE_DIR = \"/home/user/jetson/tmp/YOLOP\"\n\nsys.path.append(YOLOP_BASE_DIR)\nfrom lib.models import get_net\nfrom lib.config import cfg\n\n\n# Initialize\ndevice = torch.device('cpu')\n# Load model\nmodel = get_net(cfg)\ncheckpoint = torch.load(YOLOP_BASE_DIR + '/weights/End-to-end.pth', map_location=device)\nmodel.load_state_dict(checkpoint['state_dict'])\n# load to FP32\nmodel.float()\nmodel.to(device).eval()\n\nf = open('yolop.wts', 'w')\nf.write('{}\\n'.format(len(model.state_dict().keys())))\nfor k, v in model.state_dict().items():\n    vr = v.reshape(-1).cpu().numpy()\n    f.write('{} {} '.format(k, len(vr)))\n    for vv in vr:\n        f.write(' ')\n        f.write(struct.pack('>f',float(vv)).hex())\n    f.write('\\n')\n\nf.close()\n\nprint(\"save as yolop.wts\")"
  },
  {
    "path": "yolop/logging.h",
    "content": "// create by ausk(jinlj) 2022/10/25\n#pragma once\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"macros.h\"\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#else\n#define TRT_NOEXCEPT\n#endif\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override\n    {\n        if (severity < Severity::kINFO) {\n            std::cout << msg << std::endl;\n        }\n    }\n};\n"
  },
  {
    "path": "yolop/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H"
  },
  {
    "path": "yolop/utils.h",
    "content": "#pragma once\n\n#include <dirent.h>\n#include <opencv2/opencv.hpp>\n\n#include <iostream>\n#include \"common.hpp\"\n\n#define SHOW_IMG\n\nstatic inline cv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h) {\n    int w, h, x, y;\n    float r_w = input_w / (img.cols*1.0);\n    float r_h = input_h / (img.rows*1.0);\n    if (r_h > r_w) {\n        w = input_w;\n        h = r_w * img.rows;\n        x = 0;\n        y = (input_h - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = input_h;\n        x = (input_w - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n    cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(114, 114, 114));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    cv::Mat tensor;\n    out.convertTo(tensor, CV_32FC3, 1.f / 255.f);\n\n    cv::subtract(tensor, cv::Scalar(0.485, 0.456, 0.406), tensor, cv::noArray(), -1);\n    cv::divide(tensor, cv::Scalar(0.229, 0.224, 0.225), tensor, 1, -1);\n    return tensor;\n}\n\nstatic inline int read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 && strcmp(p_file->d_name, \"..\") != 0) {\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n"
  },
  {
    "path": "yolop/yololayer.cu",
    "content": "#include <assert.h>\n#include <vector>\n#include <iostream>\n#include \"yololayer.h\"\n#include \"cuda_utils.h\"\n\nnamespace Tn\n{\n    template<typename T>\n    void write(char*& buffer, const T& val)\n    {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T>\n    void read(const char*& buffer, T& val)\n    {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n}\n\nusing namespace Yolo;\n\nnamespace nvinfer1\n{\n    YoloLayerPlugin::YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, const std::vector<Yolo::YoloKernel>& vYoloKernel)\n    {\n        mClassCount = classCount;\n        mNetWidth = netWidth;\n        mNetHeight = netHeight;\n        mMaxOutObject = maxOut;\n        mYoloKernel = vYoloKernel;\n        mKernelCount = vYoloKernel.size();\n\n        CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n        size_t AnchorLen = sizeof(float)* CHECK_COUNT * 2;\n        for (int ii = 0; ii < mKernelCount; ii++)\n        {\n            CUDA_CHECK(cudaMalloc(&mAnchor[ii], AnchorLen));\n            const auto& yolo = mYoloKernel[ii];\n            CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n        }\n    }\n    YoloLayerPlugin::~YoloLayerPlugin()\n    {\n        for (int ii = 0; ii < mKernelCount; ii++)\n        {\n            CUDA_CHECK(cudaFree(mAnchor[ii]));\n        }\n        CUDA_CHECK(cudaFreeHost(mAnchor));\n    }\n\n    // create the plugin at runtime from a byte stream\n    YoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length)\n    {\n        using namespace Tn;\n        const char *d = reinterpret_cast<const char *>(data), *a = d;\n        read(d, mClassCount);\n        read(d, mThreadCount);\n        read(d, mKernelCount);\n        read(d, mNetWidth);\n        read(d, mNetHeight);\n        read(d, mMaxOutObject);\n        mYoloKernel.resize(mKernelCount);\n        auto kernelSize = mKernelCount * sizeof(YoloKernel);\n        memcpy(mYoloKernel.data(), d, kernelSize);\n        d += kernelSize;\n        CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n        size_t AnchorLen = sizeof(float)* CHECK_COUNT * 2;\n        for (int ii = 0; ii < mKernelCount; ii++)\n        {\n            CUDA_CHECK(cudaMalloc(&mAnchor[ii], AnchorLen));\n            const auto& yolo = mYoloKernel[ii];\n            CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n        }\n        assert(d == a + length);\n    }\n\n    void YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT\n    {\n        using namespace Tn;\n        char* d = static_cast<char*>(buffer), *a = d;\n        write(d, mClassCount);\n        write(d, mThreadCount);\n        write(d, mKernelCount);\n        write(d, mNetWidth);\n        write(d, mNetHeight);\n        write(d, mMaxOutObject);\n        auto kernelSize = mKernelCount * sizeof(YoloKernel);\n        memcpy(d, mYoloKernel.data(), kernelSize);\n        d += kernelSize;\n\n        assert(d == a + getSerializationSize());\n    }\n\n    size_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT\n    {\n        return sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mKernelCount) + sizeof(Yolo::YoloKernel) * mYoloKernel.size() + sizeof(mNetWidth) + sizeof(mNetHeight) + sizeof(mMaxOutObject);\n    }\n\n    int YoloLayerPlugin::initialize() TRT_NOEXCEPT\n    {\n        return 0;\n    }\n\n    Dims YoloLayerPlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT\n    {\n        //output the result to channel\n        int totalsize = mMaxOutObject * sizeof(Detection) / sizeof(float);\n\n        return Dims3(totalsize + 1, 1, 1);\n    }\n\n    // Set plugin namespace\n    void YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT\n    {\n        mPluginNamespace = pluginNamespace;\n    }\n\n    const char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT\n    {\n        return mPluginNamespace;\n    }\n\n    // Return the DataType of the plugin output at the requested index\n    DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT\n    {\n        return DataType::kFLOAT;\n    }\n\n    // Return true if output tensor is broadcast across a batch.\n    bool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    // Return true if plugin can use input that is broadcast across batch without replication.\n    bool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    void YoloLayerPlugin::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT\n    {\n    }\n\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\n    void YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT\n    {\n    }\n\n    // Detach the plugin object from its execution context.\n    void YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\n    const char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT\n    {\n        return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT\n    {\n        return \"1\";\n    }\n\n    void YoloLayerPlugin::destroy() TRT_NOEXCEPT\n    {\n        delete this;\n    }\n\n    // Clone the plugin\n    IPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT\n    {\n        YoloLayerPlugin* p = new YoloLayerPlugin(mClassCount, mNetWidth, mNetHeight, mMaxOutObject, mYoloKernel);\n        p->setPluginNamespace(mPluginNamespace);\n        return p;\n    }\n\n    __device__ float Logist(float data) { return 1.0f / (1.0f + expf(-data)); };\n\n    __global__ void CalDetection(const float *input, float *output, int noElements,\n        const int netwidth, const int netheight, int maxoutobject, int yoloWidth, int yoloHeight, const float anchors[CHECK_COUNT * 2], int classes, int outputElem)\n    {\n\n        int idx = threadIdx.x + blockDim.x * blockIdx.x;\n        if (idx >= noElements) return;\n\n        int total_grid = yoloWidth * yoloHeight;\n        int bnIdx = idx / total_grid;\n        idx = idx - total_grid * bnIdx;\n        int info_len_i = 5 + classes;\n        const float* curInput = input + bnIdx * (info_len_i * total_grid * CHECK_COUNT);\n\n        for (int k = 0; k < 3; ++k) {\n            float box_prob = Logist(curInput[idx + k * info_len_i * total_grid + 4 * total_grid]);\n            if (box_prob < IGNORE_THRESH) continue;\n            int class_id = 0;\n            float max_cls_prob = 0.0;\n            for (int i = 5; i < info_len_i; ++i) {\n                float p = Logist(curInput[idx + k * info_len_i * total_grid + i * total_grid]);\n                if (p > max_cls_prob) {\n                    max_cls_prob = p;\n                    class_id = i - 5;\n                }\n            }\n            float *res_count = output + bnIdx * outputElem;\n            int count = (int)atomicAdd(res_count, 1);\n            if (count >= maxoutobject) return;\n            char* data = (char *)res_count + sizeof(float) + count * sizeof(Detection);\n            Detection* det = (Detection*)(data);\n\n            int row = idx / yoloWidth;\n            int col = idx % yoloWidth;\n\n            //Location\n            // pytorch:\n            //  y = x[i].sigmoid()\n            //  y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i]  # xy\n            //  y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh\n            //  X: (sigmoid(tx) + cx)/FeaturemapW *  netwidth\n            det->bbox[0] = (col - 0.5f + 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 0 * total_grid])) * netwidth / yoloWidth;\n            det->bbox[1] = (row - 0.5f + 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 1 * total_grid])) * netheight / yoloHeight;\n\n            // W: (Pw * e^tw) / FeaturemapW * netwidth\n            // v5: https://github.com/ultralytics/yolov5/issues/471\n            det->bbox[2] = 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 2 * total_grid]);\n            det->bbox[2] = det->bbox[2] * det->bbox[2] * anchors[2 * k];\n            det->bbox[3] = 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 3 * total_grid]);\n            det->bbox[3] = det->bbox[3] * det->bbox[3] * anchors[2 * k + 1];\n            det->conf = box_prob * max_cls_prob;\n            det->class_id = class_id;\n        }\n    }\n\n    void YoloLayerPlugin::forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize)\n    {\n        int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);\n        for (int idx = 0; idx < batchSize; ++idx) {\n            CUDA_CHECK(cudaMemset(output + idx * outputElem, 0, sizeof(float)));\n        }\n        int numElem = 0;\n        for (unsigned int i = 0; i < mYoloKernel.size(); ++i)\n        {\n            const auto& yolo = mYoloKernel[i];\n            numElem = yolo.width*yolo.height*batchSize;\n            if (numElem < mThreadCount)\n                mThreadCount = numElem;\n\n            //printf(\"Net: %d  %d \\n\", mNetWidth, mNetHeight);\n            CalDetection << < (yolo.width*yolo.height*batchSize + mThreadCount - 1) / mThreadCount, mThreadCount >> >\n                (inputs[i], output, numElem, mNetWidth, mNetHeight, mMaxOutObject, yolo.width, yolo.height, (float *)mAnchor[i], mClassCount, outputElem);\n        }\n    }\n\n\n    int YoloLayerPlugin::enqueue(int batchSize, const void*const * inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT\n    {\n        forwardGpu((const float *const *)inputs, (float*)outputs[0], stream, batchSize);\n        return 0;\n    }\n\n    PluginFieldCollection YoloPluginCreator::mFC{};\n    std::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\n    YoloPluginCreator::YoloPluginCreator()\n    {\n        mPluginAttributes.clear();\n\n        mFC.nbFields = mPluginAttributes.size();\n        mFC.fields = mPluginAttributes.data();\n    }\n\n    const char* YoloPluginCreator::getPluginName() const TRT_NOEXCEPT\n    {\n        return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloPluginCreator::getPluginVersion() const TRT_NOEXCEPT\n    {\n        return \"1\";\n    }\n\n    const PluginFieldCollection* YoloPluginCreator::getFieldNames() TRT_NOEXCEPT\n    {\n        return &mFC;\n    }\n\n    IPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT\n    {\n        int class_count = -1;\n        int input_w = -1;\n        int input_h = -1;\n        int max_output_object_count = -1;\n        std::vector<Yolo::YoloKernel> yolo_kernels(3);\n\n        const PluginField* fields = fc->fields;\n        for (int i = 0; i < fc->nbFields; i++) {\n            if (strcmp(fields[i].name, \"netdata\") == 0) {\n                assert(fields[i].type == PluginFieldType::kFLOAT32);\n                int *tmp = (int*)(fields[i].data);\n                class_count = tmp[0];\n                input_w = tmp[1];\n                input_h = tmp[2];\n                max_output_object_count = tmp[3];\n            } else if (strstr(fields[i].name, \"yolodata\") != NULL) {\n                assert(fields[i].type == PluginFieldType::kFLOAT32);\n                int *tmp = (int*)(fields[i].data);\n                YoloKernel kernel;\n                kernel.width = tmp[0];\n                kernel.height = tmp[1];\n                for (int j = 0; j < fields[i].length - 2; j++) {\n                    kernel.anchors[j] = tmp[j + 2];\n                }\n                yolo_kernels[2 - (fields[i].name[8] - '1')] = kernel;\n            }\n        }\n        assert(class_count && input_w && input_h && max_output_object_count);\n        YoloLayerPlugin* obj = new YoloLayerPlugin(class_count, input_w, input_h, max_output_object_count, yolo_kernels);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n    IPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT\n    {\n        // This object will be deleted when the network is destroyed, which will\n        // call YoloLayerPlugin::destroy()\n        YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n}\n\n"
  },
  {
    "path": "yolop/yololayer.h",
    "content": "#ifndef _YOLO_LAYER_H\n#define _YOLO_LAYER_H\n\n#include <vector>\n#include <string>\n#include \"NvInfer.h\"\n#include \"macros.h\"\n\nnamespace Yolo\n{\n    static constexpr int CHECK_COUNT = 3;\n    static constexpr float IGNORE_THRESH = 0.1f;\n    struct YoloKernel\n    {\n        int width;\n        int height;\n        float anchors[CHECK_COUNT * 2];\n    };\n    static constexpr int MAX_OUTPUT_BBOX_COUNT = 1000;\n    static constexpr int CLASS_NUM = 1;\n    static constexpr int INPUT_H = 384;\n    static constexpr int INPUT_W = 640;\n    static constexpr int IMG_H = 360;\n    static constexpr int IMG_W = 640;\n\n    static constexpr int LOCATIONS = 4;\n    struct alignas(float) Detection {\n        //center_x center_y w h\n        float bbox[LOCATIONS];\n        float conf;  // bbox_conf * cls_conf\n        float class_id;\n    };\n}\n\nnamespace nvinfer1\n{\n    class YoloLayerPlugin : public IPluginV2IOExt\n    {\n    public:\n        YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, const std::vector<Yolo::YoloKernel>& vYoloKernel);\n        YoloLayerPlugin(const void* data, size_t length);\n        ~YoloLayerPlugin();\n\n        int getNbOutputs() const TRT_NOEXCEPT override\n        {\n            return 1;\n        }\n\n        Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n        int initialize() TRT_NOEXCEPT override;\n\n        virtual void terminate()  TRT_NOEXCEPT override {};\n\n        virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0; }\n\n        virtual int enqueue(int batchSize, const void*const * inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT override;\n\n        virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n        virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n        bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const TRT_NOEXCEPT override {\n            return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n        }\n\n        const char* getPluginType() const TRT_NOEXCEPT override;\n\n        const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n        void destroy() TRT_NOEXCEPT override;\n\n        IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n        void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n        const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n        DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override;\n\n        bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT override;\n\n        bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n        void attachToContext(\n            cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n        void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT override;\n\n        using IPluginV2Ext::configurePlugin;\n\n        void detachFromContext() TRT_NOEXCEPT override;\n\n    private:\n        void forwardGpu(const float *const * inputs, float * output, cudaStream_t stream, int batchSize = 1);\n        int mThreadCount = 256;\n        const char* mPluginNamespace;\n        int mKernelCount;\n        int mClassCount;\n        int mNetWidth;\n        int mNetHeight;\n        int mMaxOutObject;\n        std::vector<Yolo::YoloKernel> mYoloKernel;\n        void** mAnchor;\n    };\n\n    class YoloPluginCreator : public IPluginCreator\n    {\n    public:\n        YoloPluginCreator();\n\n        ~YoloPluginCreator() override = default;\n\n        const char* getPluginName() const TRT_NOEXCEPT override;\n\n        const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n        const PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n        IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n        IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT override;\n\n        void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override\n        {\n            mNamespace = libNamespace;\n        }\n\n        const char* getPluginNamespace() const TRT_NOEXCEPT override\n        {\n            return mNamespace.c_str();\n        }\n\n    private:\n        std::string mNamespace;\n        static PluginFieldCollection mFC;\n        static std::vector<PluginField> mPluginAttributes;\n    };\n    REGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n};\n\n#endif\n"
  },
  {
    "path": "yolop/yolop.cpp",
    "content": "#include \"yolop.hpp\"\r\n\r\n\r\nint main(int argc, char** argv) {\r\n    cudaSetDevice(DEVICE);\r\n\r\n    std::string wts_name = \"\";\r\n    std::string engine_name = \"\";\r\n    std::string img_dir;\r\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir)) {\r\n        std::cerr << \"arguments not right!\" << std::endl;\r\n        std::cerr << \"./yolop -s [.wts] [.engine] // serialize model to plan file\" << std::endl;\r\n        std::cerr << \"./yolop -d [.engine] ../samples  // deserialize plan file and run inference\" << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    // create a model using the API directly and serialize it to a stream\r\n    if (!wts_name.empty()) {\r\n        IHostMemory* modelStream{ nullptr };\r\n        APIToModel(BATCH_SIZE, &modelStream, wts_name);\r\n        assert(modelStream != nullptr);\r\n        std::ofstream p(engine_name, std::ios::binary);\r\n        if (!p) {\r\n            std::cerr << \"could not open plan output file\" << std::endl;\r\n            return -1;\r\n        }\r\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\r\n        modelStream->destroy();\r\n        return 0;\r\n    }\r\n\r\n    // deserialize the .engine and run inference\r\n    std::ifstream file(engine_name, std::ios::binary);\r\n    if (!file.good()) {\r\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\r\n        return -1;\r\n    }\r\n    char *trtModelStream = nullptr;\r\n    size_t size = 0;\r\n    file.seekg(0, file.end);\r\n    size = file.tellg();\r\n    file.seekg(0, file.beg);\r\n    trtModelStream = new char[size];\r\n    assert(trtModelStream);\r\n    file.read(trtModelStream, size);\r\n    file.close();\r\n\r\n    std::vector<std::string> file_names;\r\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\r\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    // prepare input data ---------------------------\r\n    static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];\r\n    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\r\n    //    data[i] = 1.0;\r\n    static float prob[BATCH_SIZE * OUTPUT_SIZE];\r\n    static int seg_out[BATCH_SIZE * IMG_H * IMG_W];\r\n    static int lane_out[BATCH_SIZE * IMG_H * IMG_W];\r\n    IRuntime* runtime = createInferRuntime(gLogger);\r\n    assert(runtime != nullptr);\r\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\r\n    assert(engine != nullptr);\r\n    IExecutionContext* context = engine->createExecutionContext();\r\n    assert(context != nullptr);\r\n    delete[] trtModelStream;\r\n    assert(engine->getNbBindings() == 4);\r\n    void* buffers[4];\r\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\r\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\r\n    const int inputIndex = engine->getBindingIndex(INPUT_BLOB_NAME);\r\n    const int output_det_index = engine->getBindingIndex(OUTPUT_DET_NAME);\r\n    const int output_seg_index = engine->getBindingIndex(OUTPUT_SEG_NAME);\r\n    const int output_lane_index = engine->getBindingIndex(OUTPUT_LANE_NAME);\r\n    assert(inputIndex == 0);\r\n    assert(output_det_index == 1);\r\n    assert(output_seg_index == 2);\r\n    assert(output_lane_index == 3);\r\n    // Create GPU buffers on device\r\n    CUDA_CHECK(cudaMalloc(&buffers[inputIndex], BATCH_SIZE * 3 * INPUT_H * INPUT_W * sizeof(float)));\r\n    CUDA_CHECK(cudaMalloc(&buffers[output_det_index], BATCH_SIZE * OUTPUT_SIZE * sizeof(float)));\r\n    CUDA_CHECK(cudaMalloc(&buffers[output_seg_index], BATCH_SIZE * IMG_H * IMG_W * sizeof(int)));\r\n    CUDA_CHECK(cudaMalloc(&buffers[output_lane_index], BATCH_SIZE * IMG_H * IMG_W * sizeof(int)));\r\n    // Create stream\r\n    cudaStream_t stream;\r\n    CUDA_CHECK(cudaStreamCreate(&stream));\r\n\r\n    // store seg results\r\n    cv::Mat tmp_seg(IMG_H, IMG_W, CV_32S, seg_out);\r\n    // store lane results\r\n    cv::Mat tmp_lane(IMG_H, IMG_W, CV_32S, lane_out);\r\n    // PrintMat(tmp_seg);\r\n    std::vector<cv::Vec3b> segColor;\r\n    segColor.push_back(cv::Vec3b(0, 0, 0));\r\n    segColor.push_back(cv::Vec3b(0, 255, 0));\r\n    segColor.push_back(cv::Vec3b(255, 0, 0));\r\n\r\n    std::vector<cv::Vec3b> laneColor;\r\n    laneColor.push_back(cv::Vec3b(0, 0, 0));\r\n    laneColor.push_back(cv::Vec3b(0, 0, 255));\r\n    laneColor.push_back(cv::Vec3b(0, 0, 0));\r\n\r\n    int fcount = 0;  // set for batch-inference\r\n    for (int f = 0; f < (int)file_names.size(); f++) {\r\n        fcount++;\r\n        if (fcount < BATCH_SIZE && f + 1 != (int)file_names.size()) continue;\r\n\r\n        // preprocess ~3ms\r\n        for (int b = 0; b < fcount; b++) {\r\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[f - fcount + 1 + b]);  // load image takes ~17ms\r\n            if (img.empty()) continue;\r\n            //cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\r\n            cv::Mat pr_img = preprocess_img(img, INPUT_W, INPUT_H); // letterbox\r\n            int i = 0;\r\n            // BGR to RGB and normalize\r\n            for (int row = 0; row < INPUT_H; ++row) {\r\n                float* uc_pixel = pr_img.ptr<float>(row);\r\n                for (int col = 0; col < INPUT_W; ++col) {\r\n                    data[b * 3 * INPUT_H * INPUT_W + i] = uc_pixel[0];\r\n                    data[b * 3 * INPUT_H * INPUT_W + i + INPUT_H * INPUT_W] = uc_pixel[1];\r\n                    data[b * 3 * INPUT_H * INPUT_W + i + 2 * INPUT_H * INPUT_W] = uc_pixel[2];\r\n                    uc_pixel += 3;\r\n                    ++i;\r\n                }\r\n            }\r\n        }\r\n\r\n        // Run inference\r\n        auto start = std::chrono::system_clock::now();\r\n        doInferenceCpu(*context, stream, buffers, data, prob, seg_out, lane_out, BATCH_SIZE);\r\n        auto end = std::chrono::system_clock::now();\r\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\r\n\r\n        // postprocess ~0ms\r\n        std::vector<std::vector<Yolo::Detection>> batch_res(fcount);\r\n        for (int b = 0; b < fcount; b++) {\r\n            auto& res = batch_res[b];\r\n            nms(res, &prob[b * OUTPUT_SIZE], CONF_THRESH, NMS_THRESH);\r\n        }\r\n\r\n        // show results\r\n        for (int b = 0; b < fcount; ++b) {\r\n            auto& res = batch_res[b];\r\n            //std::cout << res.size() << std::endl;\r\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[f - fcount + 1 + b]);\r\n\r\n            // handling seg and lane results\r\n            cv::Mat seg_res(img.rows, img.cols, CV_32S);\r\n            cv::resize(tmp_seg, seg_res, seg_res.size(), 0, 0, cv::INTER_NEAREST);\r\n            cv::Mat lane_res(img.rows, img.cols, CV_32S);\r\n            cv::resize(tmp_lane, lane_res, lane_res.size(), 0, 0, cv::INTER_NEAREST);\r\n            for (int row = 0; row < img.rows; ++row) {\r\n                uchar* pdata = img.data + row * img.step;\r\n                for (int col = 0; col < img.cols; ++col) {\r\n                    int seg_idx = seg_res.at<int>(row, col);\r\n                    int lane_idx = lane_res.at<int>(row, col);\r\n                    //std::cout << \"enter\" << ix << std::endl;\r\n                    for (int i = 0; i < 3; ++i) {\r\n                        if (lane_idx) {\r\n                            if (i != 2)\r\n                                pdata[i] = pdata[i] / 2 + laneColor[lane_idx][i] / 2;\r\n                        }\r\n                        else if (seg_idx)\r\n                            pdata[i] = pdata[i] / 2 + segColor[seg_idx][i] / 2;\r\n                    }\r\n                    pdata += 3;\r\n                }\r\n            }\r\n            // handling det results\r\n\r\n            for (size_t j = 0; j < res.size(); ++j) {\r\n                cv::Rect r = get_rect(img, res[j].bbox);\r\n                cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\r\n                cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0xFF, 0xFF, 0xFF), 2);\r\n            }\r\n            cv::imwrite(\"../results/_\" + file_names[f - fcount + 1 + b], img);\r\n        }\r\n        fcount = 0;\r\n    }\r\n\r\n    // Release stream and buffers\r\n    cudaStreamDestroy(stream);\r\n    CUDA_CHECK(cudaFree(buffers[inputIndex]));\r\n    CUDA_CHECK(cudaFree(buffers[output_det_index]));\r\n    CUDA_CHECK(cudaFree(buffers[output_seg_index]));\r\n    CUDA_CHECK(cudaFree(buffers[output_lane_index]));\r\n    // Destroy the engine\r\n    context->destroy();\r\n    engine->destroy();\r\n    runtime->destroy();\r\n\r\n    return 0;\r\n}\r\n"
  },
  {
    "path": "yolop/yolop.hpp",
    "content": "#pragma once\n\n#include <chrono>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"utils.h\"\n\n#define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32\n#define DEVICE 0  // GPU id\n#define NMS_THRESH 0.45\n#define CONF_THRESH 0.25\n#define BATCH_SIZE 1\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = Yolo::INPUT_H;\nstatic const int INPUT_W = Yolo::INPUT_W;\nstatic const int IMG_H = Yolo::IMG_H;\nstatic const int IMG_W = Yolo::IMG_W;\nstatic const int CLASS_NUM = Yolo::CLASS_NUM;\nstatic const int OUTPUT_SIZE = Yolo::MAX_OUTPUT_BBOX_COUNT * sizeof(Yolo::Detection) / sizeof(float) + 1;  // we assume the yololayer outputs no more than MAX_OUTPUT_BBOX_COUNT boxes that conf >= 0.1\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_DET_NAME = \"det\";\nconst char* OUTPUT_SEG_NAME = \"seg\";\nconst char* OUTPUT_LANE_NAME = \"lane\";\nstatic Logger gLogger;\n\nICudaEngine* build_engine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, std::string& wts_name) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{ 3, INPUT_H, INPUT_W });\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n\n    // yolop backbone\n    // auto focus0 = focus(network, weightMap, *shuffle->getOutput(0), 3, 32, 3, \"model.0\");\n    auto focus0 = focus(network, weightMap, *data, 3, 32, 3, \"model.0\");\n    auto conv1 = convBlock(network, weightMap, *focus0->getOutput(0), 64, 3, 2, 1, \"model.1\");\n    auto bottleneck_CSP2 = bottleneckCSP(network, weightMap, *conv1->getOutput(0), 64, 64, 1, true, 1, 0.5, \"model.2\");\n    auto conv3 = convBlock(network, weightMap, *bottleneck_CSP2->getOutput(0), 128, 3, 2, 1, \"model.3\");\n    auto bottleneck_csp4 = bottleneckCSP(network, weightMap, *conv3->getOutput(0), 128, 128, 3, true, 1, 0.5, \"model.4\");\n    auto conv5 = convBlock(network, weightMap, *bottleneck_csp4->getOutput(0), 256, 3, 2, 1, \"model.5\");\n    auto bottleneck_csp6 = bottleneckCSP(network, weightMap, *conv5->getOutput(0), 256, 256, 3, true, 1, 0.5, \"model.6\");\n    auto conv7 = convBlock(network, weightMap, *bottleneck_csp6->getOutput(0), 512, 3, 2, 1, \"model.7\");\n    auto spp8 = SPP(network, weightMap, *conv7->getOutput(0), 512, 512, 5, 9, 13, \"model.8\");\n\n    // yolop head\n    auto bottleneck_csp9 = bottleneckCSP(network, weightMap, *spp8->getOutput(0), 512, 512, 1, false, 1, 0.5, \"model.9\");\n    auto conv10 = convBlock(network, weightMap, *bottleneck_csp9->getOutput(0), 256, 1, 1, 1, \"model.10\");\n\n    float *deval = reinterpret_cast<float*>(malloc(sizeof(float) * 256 * 2 * 2));\n    for (int i = 0; i < 256 * 2 * 2; i++) {\n        deval[i] = 1.0;\n    }\n    Weights deconvwts11{ DataType::kFLOAT, deval, 256 * 2 * 2 };\n    IDeconvolutionLayer* deconv11 = network->addDeconvolutionNd(*conv10->getOutput(0), 256, DimsHW{ 2, 2 }, deconvwts11, emptywts);\n    deconv11->setStrideNd(DimsHW{ 2, 2 });\n    deconv11->setNbGroups(256);\n    weightMap[\"deconv11\"] = deconvwts11;\n\n    ITensor* inputTensors12[] = { deconv11->getOutput(0), bottleneck_csp6->getOutput(0) };\n    auto cat12 = network->addConcatenation(inputTensors12, 2);\n    auto bottleneck_csp13 = bottleneckCSP(network, weightMap, *cat12->getOutput(0), 512, 256, 1, false, 1, 0.5, \"model.13\");\n    auto conv14 = convBlock(network, weightMap, *bottleneck_csp13->getOutput(0), 128, 1, 1, 1, \"model.14\");\n\n    Weights deconvwts15{ DataType::kFLOAT, deval, 128 * 2 * 2 };\n    IDeconvolutionLayer* deconv15 = network->addDeconvolutionNd(*conv14->getOutput(0), 128, DimsHW{ 2, 2 }, deconvwts15, emptywts);\n    deconv15->setStrideNd(DimsHW{ 2, 2 });\n    deconv15->setNbGroups(128);\n\n    ITensor* inputTensors16[] = { deconv15->getOutput(0), bottleneck_csp4->getOutput(0) };\n    auto cat16 = network->addConcatenation(inputTensors16, 2);\n    auto bottleneck_csp17 = bottleneckCSP(network, weightMap, *cat16->getOutput(0), 256, 128, 1, false, 1, 0.5, \"model.17\");\n    IConvolutionLayer* det0 = network->addConvolutionNd(*bottleneck_csp17->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap[\"model.24.m.0.weight\"], weightMap[\"model.24.m.0.bias\"]);\n\n    auto conv18 = convBlock(network, weightMap, *bottleneck_csp17->getOutput(0), 128, 3, 2, 1, \"model.18\");\n    ITensor* inputTensors19[] = { conv18->getOutput(0), conv14->getOutput(0) };\n    auto cat19 = network->addConcatenation(inputTensors19, 2);\n    auto bottleneck_csp20 = bottleneckCSP(network, weightMap, *cat19->getOutput(0), 256, 256, 1, false, 1, 0.5, \"model.20\");\n    IConvolutionLayer* det1 = network->addConvolutionNd(*bottleneck_csp20->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap[\"model.24.m.1.weight\"], weightMap[\"model.24.m.1.bias\"]);\n\n    auto conv21 = convBlock(network, weightMap, *bottleneck_csp20->getOutput(0), 256, 3, 2, 1, \"model.21\");\n    ITensor* inputTensors22[] = { conv21->getOutput(0), conv10->getOutput(0) };\n    auto cat22 = network->addConcatenation(inputTensors22, 2);\n    auto bottleneck_csp23 = bottleneckCSP(network, weightMap, *cat22->getOutput(0), 512, 512, 1, false, 1, 0.5, \"model.23\");\n    IConvolutionLayer* det2 = network->addConvolutionNd(*bottleneck_csp23->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{ 1, 1 }, weightMap[\"model.24.m.2.weight\"], weightMap[\"model.24.m.2.bias\"]);\n\n    auto detect24 = addYoLoLayer(network, weightMap, det0, det1, det2);\n    detect24->getOutput(0)->setName(OUTPUT_DET_NAME);\n\n    auto conv25 = convBlock(network, weightMap, *cat16->getOutput(0), 128, 3, 1, 1, \"model.25\");\n    // upsample 26\n    Weights deconvwts26{ DataType::kFLOAT, deval, 128 * 2 * 2 };\n    IDeconvolutionLayer* deconv26 = network->addDeconvolutionNd(*conv25->getOutput(0), 128, DimsHW{ 2, 2 }, deconvwts26, emptywts);\n    deconv26->setStrideNd(DimsHW{ 2, 2 });\n    deconv26->setNbGroups(128);\n\n    auto bottleneck_csp27 = bottleneckCSP(network, weightMap, *deconv26->getOutput(0), 128, 64, 1, false, 1, 0.5, \"model.27\");\n    auto conv28 = convBlock(network, weightMap, *bottleneck_csp27->getOutput(0), 32, 3, 1, 1, \"model.28\");\n    // upsample 29\n    Weights deconvwts29{ DataType::kFLOAT, deval, 32 * 2 * 2 };\n    IDeconvolutionLayer* deconv29 = network->addDeconvolutionNd(*conv28->getOutput(0), 32, DimsHW{ 2, 2 }, deconvwts29, emptywts);\n    deconv29->setStrideNd(DimsHW{ 2, 2 });\n    deconv29->setNbGroups(32);\n\n    auto conv30 = convBlock(network, weightMap, *deconv29->getOutput(0), 16, 3, 1, 1, \"model.30\");\n    auto bottleneck_csp31 = bottleneckCSP(network, weightMap, *conv30->getOutput(0), 16, 8, 1, false, 1, 0.5, \"model.31\");\n\n    // upsample32\n    Weights deconvwts32{ DataType::kFLOAT, deval, 8 * 2 * 2 };\n    IDeconvolutionLayer* deconv32 = network->addDeconvolutionNd(*bottleneck_csp31->getOutput(0), 8, DimsHW{ 2, 2 }, deconvwts32, emptywts);\n    deconv32->setStrideNd(DimsHW{ 2, 2 });\n    deconv32->setNbGroups(8);\n\n    auto conv33 = convBlock(network, weightMap, *deconv32->getOutput(0), 2, 3, 1, 1, \"model.33\");\n    // segmentation output\n    ISliceLayer *slicelayer = network->addSlice(*conv33->getOutput(0), Dims3{ 0, (Yolo::INPUT_H - Yolo::IMG_H) / 2, 0 }, Dims3{ 2, Yolo::IMG_H, Yolo::IMG_W }, Dims3{ 1, 1, 1 });\n    auto segout = network->addTopK(*slicelayer->getOutput(0), TopKOperation::kMAX, 1, 1);\n    segout->getOutput(1)->setName(OUTPUT_SEG_NAME);\n\n    auto conv34 = convBlock(network, weightMap, *cat16->getOutput(0), 128, 3, 1, 1, \"model.34\");\n\n    // upsample35\n    Weights deconvwts35{ DataType::kFLOAT, deval, 128 * 2 * 2 };\n    IDeconvolutionLayer* deconv35 = network->addDeconvolutionNd(*conv34->getOutput(0), 128, DimsHW{ 2, 2 }, deconvwts35, emptywts);\n    deconv35->setStrideNd(DimsHW{ 2, 2 });\n    deconv35->setNbGroups(128);\n\n    auto bottleneck_csp36 = bottleneckCSP(network, weightMap, *deconv35->getOutput(0), 128, 64, 1, false, 1, 0.5, \"model.36\");\n    auto conv37 = convBlock(network, weightMap, *bottleneck_csp36->getOutput(0), 32, 3, 1, 1, \"model.37\");\n\n    // upsample38\n    Weights deconvwts38{ DataType::kFLOAT, deval, 32 * 2 * 2 };\n    IDeconvolutionLayer* deconv38 = network->addDeconvolutionNd(*conv37->getOutput(0), 32, DimsHW{ 2, 2 }, deconvwts38, emptywts);\n    deconv38->setStrideNd(DimsHW{ 2, 2 });\n    deconv38->setNbGroups(32);\n\n    auto conv39 = convBlock(network, weightMap, *deconv38->getOutput(0), 16, 3, 1, 1, \"model.39\");\n    auto bottleneck_csp40 = bottleneckCSP(network, weightMap, *conv39->getOutput(0), 16, 8, 1, false, 1, 0.5, \"model.40\");\n\n    // upsample41\n    Weights deconvwts41{ DataType::kFLOAT, deval, 8 * 2 * 2 };\n    IDeconvolutionLayer* deconv41 = network->addDeconvolutionNd(*bottleneck_csp40->getOutput(0), 8, DimsHW{ 2, 2 }, deconvwts41, emptywts);\n    deconv41->setStrideNd(DimsHW{ 2, 2 });\n    deconv41->setNbGroups(8);\n\n    auto conv42 = convBlock(network, weightMap, *deconv41->getOutput(0), 2, 3, 1, 1, \"model.42\");\n    // lane-det output\n    ISliceLayer *laneSlice = network->addSlice(*conv42->getOutput(0), Dims3{ 0, (Yolo::INPUT_H - Yolo::IMG_H) / 2, 0 }, Dims3{ 2, Yolo::IMG_H, Yolo::IMG_W }, Dims3{ 1, 1, 1 });\n    auto laneout = network->addTopK(*laneSlice->getOutput(0), TopKOperation::kMAX, 1, 1);\n    laneout->getOutput(1)->setName(OUTPUT_LANE_NAME);\n\n    // detection output\n    network->markOutput(*detect24->getOutput(0));\n    // segmentation output\n    network->markOutput(*segout->getOutput(1));\n    // lane output\n    network->markOutput(*laneout->getOutput(1));\n\n    assert(false);\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(2L * (1L << 30));  // 2GB\n#if defined(USE_FP16)\n    config->setFlag(BuilderFlag::kFP16);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*)(mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream, std::string& wts_name) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = build_engine(maxBatchSize, builder, config, DataType::kFLOAT, wts_name);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n    config->destroy();\n}\n\nvoid doInference(IExecutionContext& context, cudaStream_t& stream, void **buffers, float* det_output, int* seg_output, int* lane_output, int batchSize) {\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    // CUDA_CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CUDA_CHECK(cudaMemcpyAsync(det_output, buffers[1], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    CUDA_CHECK(cudaMemcpyAsync(seg_output, buffers[2], batchSize * IMG_H * IMG_W * sizeof(int), cudaMemcpyDeviceToHost, stream));\n    CUDA_CHECK(cudaMemcpyAsync(lane_output, buffers[3], batchSize * IMG_H * IMG_W * sizeof(int), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n}\n\nvoid doInferenceCpu(IExecutionContext& context, cudaStream_t& stream, void **buffers, float* input, float* det_output, int* seg_output, int* lane_output, int batchSize) {\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CUDA_CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CUDA_CHECK(cudaMemcpyAsync(det_output, buffers[1], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    CUDA_CHECK(cudaMemcpyAsync(seg_output, buffers[2], batchSize * IMG_H * IMG_W * sizeof(int), cudaMemcpyDeviceToHost, stream));\n    CUDA_CHECK(cudaMemcpyAsync(lane_output, buffers[3], batchSize * IMG_H * IMG_W * sizeof(int), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir) {\n    if (argc < 4) return false;\n    if (std::string(argv[1]) == \"-s\" && argc == 4) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n    } else if (std::string(argv[1]) == \"-d\" && argc == 4) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n    } else {\n        return false;\n    }\n    return true;\n}\n"
  },
  {
    "path": "yolop/yolop_trt.py",
    "content": "# 2022/10/26 by ausk\n\"\"\"\nAn example that uses TensorRT's Python api to make yolop inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLov5 project.\n    \"\"\"\n    tl = ( line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1)  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText( img, label,  (c1[0], c1[1] - 2), 0,  tl / 3, [225, 255, 255], thickness=tf,  lineType=cv2.LINE_AA)\n\nclass YolopTRT(object):\n    \"\"\"\n    description: Warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding: ', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        self.input_h = 384\n        self.input_w = 640\n        self.img_h = 360\n        self.img_w = 640\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n\n    def infer(self, raw_image_generator):\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        engine = self.engine\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        for i in range(len(host_outputs)):\n            cuda.memcpy_dtoh_async(host_outputs[i], cuda_outputs[i], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n\n        detout = host_outputs[0]\n        segout = host_outputs[1].reshape( (self.batch_size, self.img_h,self.img_w))\n        laneout = host_outputs[2].reshape( (self.batch_size, self.img_h,self.img_w))\n\n        # Do postprocess\n        for i in range(self.batch_size):\n            result_boxes, result_scores, result_classid = self.post_process(\n                detout[i * 6001: (i + 1) * 6001], batch_origin_h[i], batch_origin_w[i]\n            )\n\n            # Draw rectangles and labels on the original image\n            img = batch_image_raw[i]\n            nh = img.shape[0]\n            nw = img.shape[1]\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                label=\"{}:{:.2f}\".format( categories[int(result_classid[j])], result_scores[j])\n                plot_one_box( box, img, label=label)\n\n            seg  = cv2.resize(segout[i], (nw, nh), interpolation=cv2.INTER_NEAREST)\n            lane = cv2.resize(laneout[i], (nw, nh), interpolation=cv2.INTER_NEAREST)\n            color_area = np.zeros_like(img)\n            color_area[seg==1]  = (0,255,0)\n            color_area[lane==1] = (0,0,255)\n            color_mask = np.mean(color_area, 2)\n            img[color_mask != 0] = img[color_mask != 0] * 0.5 + color_area[color_mask != 0] * 0.5\n            img = img.astype(np.uint8)\n\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (114, 114, 114)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        image = (image - (0.485, 0.456, 0.406)) /(0.229, 0.224, 0.225)\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2\n            y /= r_h\n\n        return y\n\n    def post_process(self, output, origin_h, origin_w):\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        pred = np.reshape(output[1:], (-1, 6))[:num, :]\n        # Do nms\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        return result_boxes, result_scores, result_classid\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None) * \\\n                     np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None)\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\n        # clip the coordinates\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w -1)\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w -1)\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h -1)\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h -1)\n        # Object confidence\n        confs = boxes[:, 4]\n        # Sort by the confs\n        boxes = boxes[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_boxes = []\n        while boxes.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\n            label_match = boxes[0, -1] == boxes[:, -1]\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\n            invalid = large_overlap & label_match\n            keep_boxes += [boxes[0]]\n            boxes = boxes[~invalid]\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\n        return boxes\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"build/libmyplugins.so\"\n    engine_file_path = \"build/yolop.trt\"\n\n    print(\"usage: xxx <engine file> <plugin file> <image dir>\")\n    print(\"[WARN] preaprea you image_dir, such as: samples, or /home/user/jetson/tmp/YOLOP/inference/images\")\n    IMAGE_DIR =  \"/home/user/jetson/tmp/YOLOP/inference/images\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n    if len(sys.argv) > 3:\n        IMAGE_DIR = sys.argv[3]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    categories = [\"car\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n\n    # a YolopTRT instance\n    yolop_wrapper = YolopTRT(engine_file_path)\n\n    try:\n        print('batch size is', yolop_wrapper.batch_size)\n\n        image_dir = IMAGE_DIR\n        image_path_batches = get_img_path_batches(yolop_wrapper.batch_size, image_dir)\n\n        for i in range(1):\n            batch_image_raw, use_time = yolop_wrapper.infer(yolop_wrapper.get_raw_image_zeros())\n            print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n        for batch in image_path_batches:\n            batch_image_raw, use_time = yolop_wrapper.infer(yolop_wrapper.get_raw_image(batch))\n            for i, img_path in enumerate(batch):\n                parent, filename = os.path.split(img_path)\n                save_name = os.path.join('output', filename)\n                # Save image\n                cv2.imwrite(save_name, batch_image_raw[i])\n            print('input->{}, time->{:.2f}ms, saving into output/'.format(batch, use_time * 1000))\n\n    finally:\n        # destroy the instance\n        yolop_wrapper.destroy()\n\n    print(\"done!\")"
  },
  {
    "path": "yolov10/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\n\nproject(yolov10)\n\nadd_definitions(-std=c++11)\nadd_definitions(-DAPI_EXPORTS)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nset(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)\nenable_language(CUDA)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\ninclude_directories(${PROJECT_SOURCE_DIR}/plugin)\n\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\nif(CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n  message(\"embed_platform on\")\n  include_directories(/usr/local/cuda/targets/aarch64-linux/include)\n  link_directories(/usr/local/cuda/targets/aarch64-linux/lib)\nelse()\n  message(\"embed_platform off\")\n\n  # cuda\n  include_directories(/usr/local/cuda/include)\n  link_directories(/usr/local/cuda/lib64)\n\n  # tensorrt\n  include_directories(/workspace/shared/TensorRT-8.4.3.1/include)\n  link_directories(/workspace/shared/TensorRT-8.4.3.1/lib)\n\n  # include_directories(/home/lindsay/TensorRT-7.2.3.4/include)\n  # link_directories(/home/lindsay/TensorRT-7.2.3.4/lib)\nendif()\n\nadd_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/plugin/yololayer.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nfile(GLOB_RECURSE SRCS ${PROJECT_SOURCE_DIR}/src/*.cpp ${PROJECT_SOURCE_DIR}/src/*.cu)\nadd_executable(yolov10_det ${PROJECT_SOURCE_DIR}/yolov10_det.cpp ${SRCS})\n\ntarget_link_libraries(yolov10_det nvinfer)\ntarget_link_libraries(yolov10_det cudart)\ntarget_link_libraries(yolov10_det myplugins)\ntarget_link_libraries(yolov10_det ${OpenCV_LIBS})\n"
  },
  {
    "path": "yolov10/README.md",
    "content": "## Introduce\r\n\r\nYolov10 model supports TensorRT-8.\r\n\r\n## Environment\r\n\r\nCUDA: 11.8\r\n\r\nCUDNN: 8.9.1.23\r\n\r\nTensorRT: TensorRT-8.2.5.1   / GPU: RTX1650\r\n\r\nTensorRT: TensorRT-8.4.3.1   / GPU: RTX4070\r\n\r\n```\r\n# faq\r\nError Code 1: Internal Error (Unsupported SM: 0x809)\r\nThe architecture of the higher version does not support the use of the earlier version of TensorRT,\r\nand you need to upgrade the TensorRT version\r\n```\r\n\r\n## Support\r\n\r\n* [x] YOLOv10-det support FP32/FP16/INT8 and Python/C++ API\r\n\r\n## Config\r\n\r\n* Choose the YOLOv10 sub-model n/s/m/b/l/x from command line arguments.\r\n* Other configs please check [src/config.h](src/config.h)\r\n\r\n## Build and Run\r\n\r\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\r\n\r\n```shell\r\ngit clone https://github.com/THU-MIG/yolov10.git\r\ncd yolov10/\r\nwget https://github.com/THU-MIG/yolov10/releases/download/v1.1/yolov10n.pt\r\n\r\ngit clone https://github.com/wang-xinyu/tensorrtx.git\r\ncp [PATH-TO-TENSORRTX]/yolov10/gen_wts.py .\r\n\r\npython gen_wts.py -w yolov10n.pt -o yolov10n.wts\r\n# A file 'yolov10n.wts' will be generated.\r\n```\r\n\r\n2. build tensorrtx/yolov10 and run\r\n\r\n#### Detection\r\n\r\n```shell\r\ncd [PATH-TO-TENSORRTX]/yolov10\r\n\r\n# add test images\r\nmkdir images\r\ncp [PATH-TO-TENSORRTX]/yolov3-spp/samples/*.jpg ./images\r\n\r\n# Update kNumClass in src/config.h if your model is trained on custom dataset\r\nmkdir build\r\ncd build\r\ncp [PATH-TO-yolov10]/yolov10n.wts .\r\ncmake ..\r\nmake\r\n\r\n# Build and serialize TensorRT engine\r\n./yolov10_det -s yolov10n.wts yolov10n.engine [n/s/m/b/l/x]\r\n\r\n# Run inference\r\n./yolov10_det -d yolov10n.engine ../images\r\n# The results are displayed in the console\r\n```\r\n\r\n3. Optional, load and run the tensorrt model in Python\r\n```shell\r\n// Install python-tensorrt, pycuda, etc.\r\n// Ensure the yolov10n.engine\r\npython yolov10_det_trt.py ./build/yolov10n.engine ./build/libmyplugins.so\r\n```\r\n\r\n## INT8 Quantization\r\n1. Prepare calibration images, you can randomly select 1000s images from your train set. For coco, you can also download my calibration images `coco_calib` from [GoogleDrive](https://drive.google.com/drive/folders/1s7jE9DtOngZMzJC1uL307J2MiaGwdRSI?usp=sharing) or [BaiduPan](https://pan.baidu.com/s/1GOm_-JobpyLMAqZWCDUhKg) pwd: a9wh\r\n2. unzip it in yolov10/build\r\n3. set the macro `USE_INT8` in src/config.h and make again\r\n4. serialize the model and test\r\n\r\n## More Information\r\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\r\n"
  },
  {
    "path": "yolov10/gen_wts.py",
    "content": "# -*- coding: UTF-8 -*-\r\n\"\"\"\r\n  @Author: mpj\r\n  @Date  : 2024/7/22 下午9:17\r\n  @version V1.0\r\n\"\"\"\r\nimport sys  # noqa: F401\r\nimport argparse\r\nimport os\r\nimport struct\r\nimport torch\r\n\r\n\r\ndef parse_args():\r\n    parser = argparse.ArgumentParser(description='Convert .pt file to .wts')\r\n    parser.add_argument('-w', '--weights', default='./weights/yolov10n.pt',\r\n                        help='Input weights (.pt) file path (required)')\r\n    parser.add_argument(\r\n        '-o', '--output', help='Output (.wts) file path (optional)')\r\n    args = parser.parse_args()\r\n    if not os.path.isfile(args.weights):\r\n        raise SystemExit('Invalid input file')\r\n    if not args.output:\r\n        args.output = os.path.splitext(args.weights)[0] + '.wts'\r\n    elif os.path.isdir(args.output):\r\n        args.output = os.path.join(\r\n            args.output,\r\n            os.path.splitext(os.path.basename(args.weights))[0] + '.wts')\r\n    return args.weights, args.output\r\n\r\n\r\npt_file, wts_file = parse_args()\r\n\r\n# Load model\r\nprint(f'Loading {pt_file}')\r\n\r\n# Initialize\r\ndevice = 'cpu'\r\n\r\n# Load model\r\nmodel = torch.load(pt_file, map_location=device, weights_only=False)  # Load FP32 weights\r\nmodel = model['ema' if model.get('ema') else 'model'].float()\r\n\r\nmodel.to(device).eval()\r\n\r\nwith open(wts_file, 'w') as f:\r\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\r\n    for k, v in model.state_dict().items():\r\n        vr = v.reshape(-1).cpu().numpy()\r\n        f.write('{} {} '.format(k, len(vr)))\r\n        for vv in vr:\r\n            f.write(' ')\r\n            f.write(struct.pack('>f', float(vv)).hex())\r\n        f.write('\\n')\r\nprint(f'success {wts_file}!!!')\r\n"
  },
  {
    "path": "yolov10/include/block.h",
    "content": "#pragma once\n\n#include <map>\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file);\n\nnvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                      std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                      std::string lname, float eps);\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, int k, int s, std::string lname, int g = 1);\n\nnvinfer1::IElementWiseLayer* C2F(nvinfer1::INetworkDefinition* network,\n                                 std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                 int c2, int n, bool shortcut, float e, std::string lname);\n\nnvinfer1::IElementWiseLayer* C2(nvinfer1::INetworkDefinition* network,\n                                std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c1,\n                                int c2, int n, bool shortcut, float e, std::string lname);\n\nnvinfer1::IElementWiseLayer* SPPF(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int k, std::string lname);\n\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int grid, int k, int s, int p, std::string lname);\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network, std::vector<nvinfer1::ILayer*> dets,\n                                       const int* px_arry, int px_arry_num);\n\nnvinfer1::ILayer* SCDown(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int ch, int k, int s, std::string lname);\n\nnvinfer1::ILayer* PSA(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                      nvinfer1::ITensor& input, int ch, std::string lname);\n\nnvinfer1::ILayer* C2fCIB(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int c1, int c2, int n, bool shortcut, bool lk, float e,\n                         std::string lname);\n"
  },
  {
    "path": "yolov10/include/calibrator.h",
    "content": "#ifndef ENTROPY_CALIBRATOR_H\n#define ENTROPY_CALIBRATOR_H\n\n#include <NvInfer.h>\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\n//! \\class Int8EntropyCalibrator2\n//!\n//! \\brief Implements Entropy calibrator 2.\n//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.\n//!\nclass Int8EntropyCalibrator2 : public nvinfer1::IInt8EntropyCalibrator2 {\n   public:\n    Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name,\n                           const char* input_blob_name, bool read_cache = true);\n    virtual ~Int8EntropyCalibrator2();\n    int getBatchSize() const TRT_NOEXCEPT override;\n    bool getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT override;\n    const void* readCalibrationCache(size_t& length) TRT_NOEXCEPT override;\n    void writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT override;\n\n   private:\n    int batchsize_;\n    int input_w_;\n    int input_h_;\n    int img_idx_;\n    std::string img_dir_;\n    std::vector<std::string> img_files_;\n    size_t input_count_;\n    std::string calib_table_name_;\n    const char* input_blob_name_;\n    bool read_cache_;\n    void* device_input_;\n    std::vector<char> calib_cache_;\n};\n\n#endif  // ENTROPY_CALIBRATOR_H\n"
  },
  {
    "path": "yolov10/include/config.h",
    "content": "//#define USE_FP32\n#define USE_FP16\n// #define USE_INT8\n\nconst static char* kInputTensorName = \"images\";\nconst static char* kOutputTensorName = \"output\";\nconst static int kNumClass = 80;\nconst static int kBatchSize = 1;\nconst static int kGpuId = 0;\nconst static int kInputH = 640;\nconst static int kInputW = 640;\nconst static float kConfThresh = 0.5f;\nconst static int kMaxInputImageSize = 3000 * 3000;\nconst static int kMaxNumOutputBbox = 1000;\n//Quantization input image folder path\nconst static char* kInputQuantizationFolder = \"./coco_calib\";\n"
  },
  {
    "path": "yolov10/include/cuda_utils.h",
    "content": "#ifndef TRTX_CUDA_UTILS_H_\n#define TRTX_CUDA_UTILS_H_\n\n#include <cuda_runtime_api.h>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n#endif  // CUDA_CHECK\n\n#endif  // TRTX_CUDA_UTILS_H_\n"
  },
  {
    "path": "yolov10/include/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"NvInferRuntimeCommon.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "yolov10/include/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#include \"NvInfer.h\"\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "yolov10/include/model.h",
    "content": "#pragma once\n\n#include <assert.h>\n#include <string>\n#include \"NvInfer.h\"\n\nnvinfer1::IHostMemory* buildEngineYolov10DetN(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                              nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                              int& max_channels);\n\nnvinfer1::IHostMemory* buildEngineYolov10DetS(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                              nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                              int& max_channels);\n\nnvinfer1::IHostMemory* buildEngineYolov10DetM(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                              nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                              int& max_channels);\n\nnvinfer1::IHostMemory* buildEngineYolov10DetBL(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                               nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                               int& max_channels);\n\nnvinfer1::IHostMemory* buildEngineYolov10DetX(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                              nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                              int& max_channels);\n"
  },
  {
    "path": "yolov10/include/postprocess.h",
    "content": "#pragma once\n\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]);\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nvoid batch_topk(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size,\n                float conf_thresh, int topk = 300);\n"
  },
  {
    "path": "yolov10/include/preprocess.h",
    "content": "#pragma once\n\n#include <map>\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\nvoid cuda_preprocess_init(int max_image_size);\n\nvoid cuda_preprocess_destroy();\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream);\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream);\n"
  },
  {
    "path": "yolov10/include/types.h",
    "content": "#pragma once\n#include \"config.h\"\n\nstruct alignas(float) Detection {\n    //center_x center_y w h\n    float bbox[4];\n    float conf;  // bbox_conf * cls_conf\n    float class_id;\n};\n\nstruct AffineMatrix {\n    float value[6];\n};\n\nconst int bbox_element =\n        sizeof(Detection) / sizeof(float) + 1;  // left, top, right, bottom, confidence, class, keepflag\n"
  },
  {
    "path": "yolov10/include/utils.h",
    "content": "#pragma once\n#include <dirent.h>\n#include <fstream>\n#include <opencv2/opencv.hpp>\n\nstatic inline cv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h) {\n    int w, h, x, y;\n    float r_w = input_w / (img.cols * 1.0);\n    float r_h = input_h / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = input_w;\n        h = r_w * img.rows;\n        x = 0;\n        y = (input_h - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = input_h;\n        x = (input_w - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n    cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\nstatic inline int read_files_in_dir(const char* p_dir_name, std::vector<std::string>& file_names) {\n    DIR* p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 && strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            //            std::cout << \"Found file: \" << cur_file_name << std::endl;\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\n// Function to trim leading and trailing whitespace from a string\nstatic inline std::string trim_leading_whitespace(const std::string& str) {\n    size_t first = str.find_first_not_of(' ');\n    if (std::string::npos == first) {\n        return str;\n    }\n    size_t last = str.find_last_not_of(' ');\n    return str.substr(first, (last - first + 1));\n}\n\n// Src: https://stackoverflow.com/questions/16605967\nstatic inline std::string to_string_with_precision(const float a_value, const int n = 2) {\n    std::ostringstream out;\n    out.precision(n);\n    out << std::fixed << a_value;\n    return out.str();\n}\n\nstatic inline int read_labels(const std::string labels_filename, std::unordered_map<int, std::string>& labels_map) {\n    std::ifstream file(labels_filename);\n    // Read each line of the file\n    std::string line;\n    int index = 0;\n    while (std::getline(file, line)) {\n        // Strip the line of any leading or trailing whitespace\n        line = trim_leading_whitespace(line);\n\n        // Add the stripped line to the labels_map, using the loop index as the key\n        labels_map[index] = line;\n        index++;\n    }\n    // Close the file\n    file.close();\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov10/plugin/yololayer.cu",
    "content": "#include <assert.h>\n#include <math.h>\n#include <iostream>\n#include <vector>\n#include \"cuda_utils.h\"\n#include \"types.h\"\n#include \"yololayer.h\"\n\nnamespace Tn {\ntemplate <typename T>\nvoid write(char*& buffer, const T& val) {\n    *reinterpret_cast<T*>(buffer) = val;\n    buffer += sizeof(T);\n}\n\ntemplate <typename T>\nvoid read(const char*& buffer, T& val) {\n    val = *reinterpret_cast<const T*>(buffer);\n    buffer += sizeof(T);\n}\n}  // namespace Tn\n\n__device__ float sigmoid(float x) {\n    return 1.0f / (1.0f + exp(-x));\n}\n\nnamespace nvinfer1 {\nYoloLayerPlugin::YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, const int* strides,\n                                 int stridesLength) {\n\n    mClassCount = classCount;\n    mYoloV10NetWidth = netWidth;\n    mYoloV10netHeight = netHeight;\n    mMaxOutObject = maxOut;\n    mStridesLength = stridesLength;\n    mStrides = new int[stridesLength];\n    memcpy(mStrides, strides, stridesLength * sizeof(int));\n}\n\nYoloLayerPlugin::~YoloLayerPlugin() {\n    if (mStrides != nullptr) {\n        delete[] mStrides;\n        mStrides = nullptr;\n    }\n}\n\nYoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length) {\n    using namespace Tn;\n    const char *d = reinterpret_cast<const char*>(data), *a = d;\n    read(d, mClassCount);\n    read(d, mThreadCount);\n    read(d, mYoloV10NetWidth);\n    read(d, mYoloV10netHeight);\n    read(d, mMaxOutObject);\n    read(d, mStridesLength);\n    mStrides = new int[mStridesLength];\n    for (int i = 0; i < mStridesLength; ++i) {\n        read(d, mStrides[i]);\n    }\n\n    assert(d == a + length);\n}\n\nvoid YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT {\n\n    using namespace Tn;\n    char *d = static_cast<char*>(buffer), *a = d;\n    write(d, mClassCount);\n    write(d, mThreadCount);\n    write(d, mYoloV10NetWidth);\n    write(d, mYoloV10netHeight);\n    write(d, mMaxOutObject);\n    write(d, mStridesLength);\n    for (int i = 0; i < mStridesLength; ++i) {\n        write(d, mStrides[i]);\n    }\n\n    assert(d == a + getSerializationSize());\n}\n\nsize_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT {\n    return sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mYoloV10netHeight) + sizeof(mYoloV10NetWidth) +\n           sizeof(mMaxOutObject) + sizeof(mStridesLength) + sizeof(int) * mStridesLength;\n}\n\nint YoloLayerPlugin::initialize() TRT_NOEXCEPT {\n    return 0;\n}\n\nnvinfer1::Dims YoloLayerPlugin::getOutputDimensions(int index, const nvinfer1::Dims* inputs,\n                                                    int nbInputDims) TRT_NOEXCEPT {\n    int total_size = mMaxOutObject * sizeof(Detection) / sizeof(float);\n    return nvinfer1::Dims3(total_size + 1, 1, 1);\n}\n\nvoid YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT {\n    mPluginNamespace = pluginNamespace;\n}\n\nconst char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT {\n    return mPluginNamespace;\n}\n\nnvinfer1::DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes,\n                                                      int nbInputs) const TRT_NOEXCEPT {\n    return nvinfer1::DataType::kFLOAT;\n}\n\nbool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                                   int nbInputs) const TRT_NOEXCEPT {\n    return false;\n}\n\nbool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT {\n    return false;\n}\n\nvoid YoloLayerPlugin::configurePlugin(nvinfer1::PluginTensorDesc const* in, int nbInput,\n                                      nvinfer1::PluginTensorDesc const* out, int nbOutput) TRT_NOEXCEPT{};\n\nvoid YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                                      IGpuAllocator* gpuAllocator) TRT_NOEXCEPT{};\n\nvoid YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\nconst char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT {\n\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nvoid YoloLayerPlugin::destroy() TRT_NOEXCEPT {\n    delete this;\n}\n\nnvinfer1::IPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT {\n\n    YoloLayerPlugin* p = new YoloLayerPlugin(mClassCount, mYoloV10NetWidth, mYoloV10netHeight, mMaxOutObject, mStrides,\n                                             mStridesLength);\n    p->setPluginNamespace(mPluginNamespace);\n    return p;\n}\n\nint YoloLayerPlugin::enqueue(int batchSize, const void* TRT_CONST_ENQUEUE* inputs, void* const* outputs,\n                             void* workspace, cudaStream_t stream) TRT_NOEXCEPT {\n    forwardGpu((const float* const*)inputs, (float*)outputs[0], stream, mYoloV10netHeight, mYoloV10NetWidth, batchSize);\n    return 0;\n}\n\n__device__ float Logist(float data) {\n    return 1.0f / (1.0f + expf(-data));\n};\n\n__global__ void CalDetection(const float* input, float* output, int numElements, int maxoutobject, const int grid_h,\n                             int grid_w, const int stride, int classes, int outputElem) {\n    int idx = threadIdx.x + blockDim.x * blockIdx.x;\n    if (idx >= numElements)\n        return;\n\n    int total_grid = grid_h * grid_w;\n    int info_len = 4 + classes;\n    int batchIdx = idx / total_grid;\n    int elemIdx = idx % total_grid;\n    const float* curInput = input + batchIdx * total_grid * info_len;\n    int outputIdx = batchIdx * outputElem;\n\n    int class_id = 0;\n    float max_cls_prob = 0.0;\n    for (int i = 4; i < 4 + classes; i++) {\n        float p = Logist(curInput[elemIdx + i * total_grid]);\n        if (p > max_cls_prob) {\n            max_cls_prob = p;\n            class_id = i - 4;\n        }\n    }\n\n    if (max_cls_prob < 0.1)\n        return;\n\n    int count = (int)atomicAdd(output + outputIdx, 1);\n    if (count >= maxoutobject)\n        return;\n    char* data = (char*)(output + outputIdx) + sizeof(float) + count * sizeof(Detection);\n    Detection* det = (Detection*)(data);\n\n    int row = elemIdx / grid_w;\n    int col = elemIdx % grid_w;\n\n    det->conf = max_cls_prob;\n    det->class_id = class_id;\n    det->bbox[0] = (col + 0.5f - curInput[elemIdx + 0 * total_grid]) * stride;\n    det->bbox[1] = (row + 0.5f - curInput[elemIdx + 1 * total_grid]) * stride;\n    det->bbox[2] = (col + 0.5f + curInput[elemIdx + 2 * total_grid]) * stride;\n    det->bbox[3] = (row + 0.5f + curInput[elemIdx + 3 * total_grid]) * stride;\n}\n\nvoid YoloLayerPlugin::forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV10netHeight,\n                                 int mYoloV10NetWidth, int batchSize) {\n    int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);\n    cudaMemsetAsync(output, 0, sizeof(float), stream);\n    for (int idx = 0; idx < batchSize; ++idx) {\n        CUDA_CHECK(cudaMemsetAsync(output + idx * outputElem, 0, sizeof(float), stream));\n    }\n    int numElem = 0;\n\n    //    const int maxGrids = mStridesLength;\n    //    int grids[maxGrids][2];\n    //    for (int i = 0; i < maxGrids; ++i) {\n    //        grids[i][0] = mYoloV10netHeight / mStrides[i];\n    //        grids[i][1] = mYoloV10NetWidth / mStrides[i];\n    //    }\n\n    int maxGrids = mStridesLength;\n    int flatGridsLen = 2 * maxGrids;\n    int* flatGrids = new int[flatGridsLen];\n\n    for (int i = 0; i < maxGrids; ++i) {\n        flatGrids[2 * i] = mYoloV10netHeight / mStrides[i];\n        flatGrids[2 * i + 1] = mYoloV10NetWidth / mStrides[i];\n    }\n\n    for (unsigned int i = 0; i < maxGrids; i++) {\n        // Access the elements of the original 2D array from the flattened 1D array\n        int grid_h = flatGrids[2 * i];      // Corresponds to the access of grids[i][0]\n        int grid_w = flatGrids[2 * i + 1];  // Corresponds to the access of grids[i][1]\n        int stride = mStrides[i];\n        numElem = grid_h * grid_w * batchSize;  // Calculate the total number of elements\n        if (numElem < mThreadCount)             // Adjust the thread count if needed\n            mThreadCount = numElem;\n\n        // The CUDA kernel call remains unchanged\n        CalDetection<<<(numElem + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream>>>(\n                inputs[i], output, numElem, mMaxOutObject, grid_h, grid_w, stride, mClassCount, outputElem);\n    }\n\n    delete[] flatGrids;\n}\n\nPluginFieldCollection YoloPluginCreator::mFC{};\nstd::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\nYoloPluginCreator::YoloPluginCreator() {\n    mPluginAttributes.clear();\n    mFC.nbFields = mPluginAttributes.size();\n    mFC.fields = mPluginAttributes.data();\n}\n\nconst char* YoloPluginCreator::getPluginName() const TRT_NOEXCEPT {\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloPluginCreator::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nconst PluginFieldCollection* YoloPluginCreator::getFieldNames() TRT_NOEXCEPT {\n    return &mFC;\n}\n\nIPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT {\n    assert(fc->nbFields == 1);\n    assert(strcmp(fc->fields[0].name, \"combinedInfo\") == 0);\n    const int* combinedInfo = static_cast<const int*>(fc->fields[0].data);\n    int netinfo_count = 4;\n    int class_count = combinedInfo[0];\n    int input_w = combinedInfo[1];\n    int input_h = combinedInfo[2];\n    int max_output_object_count = combinedInfo[3];\n    const int* px_arry = combinedInfo + netinfo_count;\n    int px_arry_length = fc->fields[0].length - netinfo_count;\n    YoloLayerPlugin* obj =\n            new YoloLayerPlugin(class_count, input_w, input_h, max_output_object_count, px_arry, px_arry_length);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\nIPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData,\n                                                     size_t serialLength) TRT_NOEXCEPT {\n    // This object will be deleted when the network is destroyed, which will\n    // call YoloLayerPlugin::destroy()\n    YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\n}  // namespace nvinfer1\n"
  },
  {
    "path": "yolov10/plugin/yololayer.h",
    "content": "#pragma once\n\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n#include \"macros.h\"\n\nnamespace nvinfer1 {\nclass API YoloLayerPlugin : public IPluginV2IOExt {\n   public:\n    YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, const int* strides, int stridesLength);\n\n    YoloLayerPlugin(const void* data, size_t length);\n\n    ~YoloLayerPlugin();\n\n    int getNbOutputs() const TRT_NOEXCEPT override { return 1; }\n\n    nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n    int initialize() TRT_NOEXCEPT override;\n\n    virtual void terminate() TRT_NOEXCEPT override {}\n\n    virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0; }\n\n    virtual int enqueue(int batchSize, const void* const* inputs, void* TRT_CONST_ENQUEUE* outputs, void* workspace,\n                        cudaStream_t stream) TRT_NOEXCEPT override;\n\n    virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n    virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n    bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs,\n                                   int nbOutputs) const TRT_NOEXCEPT override {\n        return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n    }\n\n    const char* getPluginType() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    void destroy() TRT_NOEXCEPT override;\n\n    IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n    nvinfer1::DataType getOutputDataType(int32_t index, nvinfer1::DataType const* inputTypes,\n                                         int32_t nbInputs) const TRT_NOEXCEPT;\n\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                      int nbInputs) const TRT_NOEXCEPT override;\n\n    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n    void attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                         IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n    void configurePlugin(PluginTensorDesc const* in, int32_t nbInput, PluginTensorDesc const* out,\n                         int32_t nbOutput) TRT_NOEXCEPT override;\n\n    void detachFromContext() TRT_NOEXCEPT override;\n\n   private:\n    void forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV10netHeight,\n                    int mYoloV10NetWidth, int batchSize);\n\n    int mThreadCount = 256;\n    const char* mPluginNamespace;\n    int mClassCount;\n    int mYoloV10NetWidth;\n    int mYoloV10netHeight;\n    int mMaxOutObject;\n    int* mStrides;\n    int mStridesLength;\n};\n\nclass API YoloPluginCreator : public IPluginCreator {\n   public:\n    YoloPluginCreator();\n\n    ~YoloPluginCreator() override = default;\n\n    const char* getPluginName() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    const nvinfer1::PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n    nvinfer1::IPluginV2IOExt* createPlugin(const char* name,\n                                           const nvinfer1::PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n    nvinfer1::IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData,\n                                                size_t serialLength) TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override { mNamespace = libNamespace; }\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override { return mNamespace.c_str(); }\n\n   private:\n    std::string mNamespace;\n    static PluginFieldCollection mFC;\n    static std::vector<PluginField> mPluginAttributes;\n};\n\nREGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n}  // namespace nvinfer1\n"
  },
  {
    "path": "yolov10/src/block.cpp",
    "content": "#include \"block.h\"\n#include <assert.h>\n#include <math.h>\n#include <fstream>\n#include <iostream>\n#include \"config.h\"\n#include \"yololayer.h\"\n\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, nvinfer1::Weights> WeightMap;\n\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = nvinfer1::DataType::kFLOAT;\n\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; x++) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        wt.count = size;\n        WeightMap[name] = wt;\n    }\n    return WeightMap;\n}\n\nnvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                      std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                      std::string lname, float eps) {\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights scale{nvinfer1::DataType::kFLOAT, scval, len};\n\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights shift{nvinfer1::DataType::kFLOAT, shval, len};\n\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    nvinfer1::Weights power{nvinfer1::DataType::kFLOAT, pval, len};\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    nvinfer1::IScaleLayer* output = network->addScale(input, nvinfer1::ScaleMode::kCHANNEL, shift, scale, power);\n    assert(output);\n    return output;\n}\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, int k, int s, std::string lname, int g) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k, k}, weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    int p = k / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n    conv->setNbGroups(g);\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n\nnvinfer1::ILayer* convBn(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int ch, int k, int s, std::string lname, int g = 1) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k, k}, weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    int p = k / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n    conv->setNbGroups(g);\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n    return bn;\n}\n\nnvinfer1::ILayer* bottleneck(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int c1, int c2, bool shortcut, float e, std::string lname) {\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, c2, 3, 1, lname + \".cv1\");\n    nvinfer1::IElementWiseLayer* conv2 = convBnSiLU(network, weightMap, *conv1->getOutput(0), c2, 3, 1, lname + \".cv2\");\n\n    if (shortcut && c1 == c2) {\n        nvinfer1::IElementWiseLayer* ew =\n                network->addElementWise(input, *conv2->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n        return ew;\n    }\n    return conv2;\n}\n\nnvinfer1::IElementWiseLayer* C2F(nvinfer1::INetworkDefinition* network,\n                                 std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                 int c2, int n, bool shortcut, float e, std::string lname) {\n    int c_ = (float)c2 * e;\n\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, 2 * c_, 1, 1, lname + \".cv1\");\n    nvinfer1::Dims d = conv1->getOutput(0)->getDimensions();\n\n    nvinfer1::ISliceLayer* split1 =\n            network->addSlice(*conv1->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n                              nvinfer1::Dims4{d.d[0], d.d[1] / 2, d.d[2], d.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* split2 =\n            network->addSlice(*conv1->getOutput(0), nvinfer1::Dims4{0, d.d[1] / 2, 0, 0},\n                              nvinfer1::Dims4{d.d[0], d.d[1] / 2, d.d[2], d.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ITensor* inputTensor0[] = {split1->getOutput(0), split2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensor0, 2);\n    nvinfer1::ITensor* y1 = split2->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        auto* b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, 1.0, lname + \".m.\" + std::to_string(i));\n        y1 = b->getOutput(0);\n\n        nvinfer1::ITensor* inputTensors[] = {cat->getOutput(0), b->getOutput(0)};\n        cat = network->addConcatenation(inputTensors, 2);\n    }\n\n    nvinfer1::IElementWiseLayer* conv2 = convBnSiLU(network, weightMap, *cat->getOutput(0), c2, 1, 1, lname + \".cv2\");\n\n    return conv2;\n}\n\nnvinfer1::IElementWiseLayer* C2(nvinfer1::INetworkDefinition* network,\n                                std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c1,\n                                int c2, int n, bool shortcut, float e, std::string lname) {\n    assert(network != nullptr);\n    int hidden_channels = static_cast<int>(c2 * e);\n\n    // cv1 branch\n    nvinfer1::IElementWiseLayer* conv1 =\n            convBnSiLU(network, weightMap, input, 2 * hidden_channels, 1, 1, lname + \".cv1\");\n    nvinfer1::ITensor* cv1_out = conv1->getOutput(0);\n\n    // Split the output of cv1 into two tensors\n    nvinfer1::Dims dims = cv1_out->getDimensions();\n    nvinfer1::ISliceLayer* split1 = network->addSlice(*cv1_out, nvinfer1::Dims4{0, 0, 0, 0},\n                                                      nvinfer1::Dims4{dims.d[0], dims.d[1] / 2, dims.d[2], dims.d[3]},\n                                                      nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* split2 = network->addSlice(*cv1_out, nvinfer1::Dims4{0, dims.d[1] / 2, 0, 0},\n                                                      nvinfer1::Dims4{dims.d[0], dims.d[1] / 2, dims.d[2], dims.d[3]},\n                                                      nvinfer1::Dims4{1, 1, 1, 1});\n\n    // Create y1 bottleneck sequence\n    nvinfer1::ITensor* y1 = split1->getOutput(0);\n    for (int i = 0; i < n; ++i) {\n        auto* bottleneck_layer = bottleneck(network, weightMap, *y1, hidden_channels, hidden_channels, shortcut, 1.0,\n                                            lname + \".m.\" + std::to_string(i));\n        y1 = bottleneck_layer->getOutput(0);  // update 'y1' to be the output of the current bottleneck\n    }\n\n    // Concatenate y1 with the second split of cv1\n    nvinfer1::ITensor* concatInputs[2] = {y1, split2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(concatInputs, 2);\n\n    // cv2 to produce the final output\n    nvinfer1::IElementWiseLayer* conv2 = convBnSiLU(network, weightMap, *cat->getOutput(0), c2, 1, 1, lname + \".cv2\");\n\n    return conv2;\n}\n\nnvinfer1::IElementWiseLayer* SPPF(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int k, std::string lname) {\n    int c_ = c1 / 2;\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, c_, 1, 1, lname + \".cv1\");\n    nvinfer1::IPoolingLayer* pool1 =\n            network->addPoolingNd(*conv1->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool1->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool1->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::IPoolingLayer* pool2 =\n            network->addPoolingNd(*pool1->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool2->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::IPoolingLayer* pool3 =\n            network->addPoolingNd(*pool2->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool3->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool3->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::ITensor* inputTensors[] = {conv1->getOutput(0), pool1->getOutput(0), pool2->getOutput(0),\n                                         pool3->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensors, 4);\n    nvinfer1::IElementWiseLayer* conv2 = convBnSiLU(network, weightMap, *cat->getOutput(0), c2, 1, 1, lname + \".cv2\");\n    return conv2;\n}\n\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int grid, int k, int s, int p, std::string lname) {\n\n    nvinfer1::IShuffleLayer* shuffle1 = network->addShuffle(input);\n    shuffle1->setReshapeDimensions(nvinfer1::Dims4{kBatchSize, 4, 16, grid});\n    shuffle1->setSecondTranspose(nvinfer1::Permutation{0, 2, 1, 3});\n    nvinfer1::ISoftMaxLayer* softmax = network->addSoftMax(*shuffle1->getOutput(0));\n    softmax->setAxes(1 << 1);\n\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(*softmax->getOutput(0), 1, nvinfer1::DimsHW{1, 1}, weightMap[lname], bias_empty);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n\n    nvinfer1::IShuffleLayer* shuffle2 = network->addShuffle(*conv->getOutput(0));\n    shuffle2->setReshapeDimensions(nvinfer1::Dims3{kBatchSize, 4, grid});\n\n    return shuffle2;\n}\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network, std::vector<nvinfer1::ILayer*> dets,\n                                       const int* px_arry, int px_arry_num) {\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    const int netinfo_count = 4;  // Assuming the first 5 elements are for netinfo as per existing code.\n    const int total_count = netinfo_count + px_arry_num;  // Total number of elements for netinfo and px_arry combined.\n\n    std::vector<int> combinedInfo(total_count);\n    // Fill in the first 5 elements as per existing netinfo.\n    combinedInfo[0] = kNumClass;\n    combinedInfo[1] = kInputW;\n    combinedInfo[2] = kInputH;\n    combinedInfo[3] = kMaxNumOutputBbox;\n\n    // Copy the contents of px_arry into the combinedInfo vector after the initial 5 elements.\n    std::copy(px_arry, px_arry + px_arry_num, combinedInfo.begin() + netinfo_count);\n\n    // Now let's create the PluginField object to hold this combined information.\n    nvinfer1::PluginField pluginField;\n    pluginField.name = \"combinedInfo\";  // This can be any name that the plugin will recognize\n    pluginField.data = combinedInfo.data();\n    pluginField.type = nvinfer1::PluginFieldType::kINT32;\n    pluginField.length = combinedInfo.size();\n\n    // Create the PluginFieldCollection to hold the PluginField object.\n    nvinfer1::PluginFieldCollection pluginFieldCollection{};\n    pluginFieldCollection.nbFields = 1;  // We have just one field, but it's a combined array\n    pluginFieldCollection.fields = &pluginField;\n\n    // Create the plugin object using the PluginFieldCollection.\n    nvinfer1::IPluginV2* pluginObject = creator->createPlugin(\"yololayer\", &pluginFieldCollection);\n\n    // We assume that the plugin is to be added onto the network.\n    // Prepare input tensors for the YOLO Layer.\n    std::vector<nvinfer1::ITensor*> inputTensors;\n    for (auto det : dets) {\n        inputTensors.push_back(det->getOutput(0));  // Assuming each IConcatenationLayer has one output tensor.\n    }\n\n    // Add the plugin to the network using the prepared input tensors.\n    nvinfer1::IPluginV2Layer* yoloLayer = network->addPluginV2(inputTensors.data(), inputTensors.size(), *pluginObject);\n\n    return yoloLayer;  // Return the added YOLO layer.\n}\n\nnvinfer1::ILayer* SCDown(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int ch, int k, int s, std::string lname) {\n    auto* conv1 = convBnSiLU(network, weightMap, input, ch, 1, 1, lname + \".cv1\");\n\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv2 = network->addConvolutionNd(*conv1->getOutput(0), ch, nvinfer1::DimsHW{k, k},\n                                                                   weightMap[lname + \".cv2.conv.weight\"], bias_empty);\n    assert(conv2);\n    conv2->setStrideNd(nvinfer1::DimsHW{s, s});\n    int p = k / 2;\n    conv2->setPaddingNd(nvinfer1::DimsHW{p, p});\n    conv2->setNbGroups(ch);\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \".cv2.bn\", 1e-3);\n    assert(bn);\n    return bn;\n}\n\nnvinfer1::ILayer* Attention(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                            nvinfer1::ITensor& input, int dim, int num_heads, float attn_ratio, std::string lname) {\n    int head_dim = dim / num_heads;\n    int key_dim = head_dim * attn_ratio;\n    float scale = pow(key_dim, -0.5);\n    int nh_kd = key_dim * num_heads;\n    int h = dim + nh_kd * 2;\n\n    auto d = input.getDimensions();\n    int B = d.d[0];\n    int H = d.d[2];\n    int W = d.d[3];\n    int N = H * W;\n    auto* qkv = convBn(network, weightMap, input, h, 1, 1, lname + \".qkv\");\n    // qkv.view(B, self.num_heads, -1, N)\n    auto shuffle = network->addShuffle(*qkv->getOutput(0));\n    shuffle->setReshapeDimensions(nvinfer1::Dims4{B, num_heads, -1, N});\n    // q, k, v = .split([self.key_dim, self.key_dim, self.head_dim], dim=2)\n    auto d1 = shuffle->getOutput(0)->getDimensions();\n    auto q = network->addSlice(*shuffle->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n                               nvinfer1::Dims4{d1.d[0], d1.d[1], key_dim, d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    auto k = network->addSlice(*shuffle->getOutput(0), nvinfer1::Dims4{0, 0, key_dim, 0},\n                               nvinfer1::Dims4{d1.d[0], d1.d[1], key_dim, d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    auto v = network->addSlice(*shuffle->getOutput(0), nvinfer1::Dims4{0, 0, key_dim * 2, 0},\n                               nvinfer1::Dims4{d1.d[0], d1.d[1], head_dim, d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    // attn = ((q.transpose(-2, -1) @ k) * self.scale)\n    auto qT = network->addShuffle(*q->getOutput(0));\n    qT->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});\n    auto matmul = network->addMatrixMultiply(*qT->getOutput(0), nvinfer1::MatrixOperation::kNONE, *k->getOutput(0),\n                                             nvinfer1::MatrixOperation::kNONE);\n    // There are not many memory leaks, and I will change it when I have time\n    float* scale_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    scale_val[0] = scale;\n    nvinfer1::Weights s_w{nvinfer1::DataType::kFLOAT, scale_val, 1};\n    float* shift_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    shift_val[0] = 0;\n    nvinfer1::Weights sh_w{nvinfer1::DataType::kFLOAT, shift_val, 1};\n    float* power_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    power_val[0] = 1;\n    nvinfer1::Weights p_w{nvinfer1::DataType::kFLOAT, power_val, 1};\n    nvinfer1::IScaleLayer* scaleLayer =\n            network->addScale(*matmul->getOutput(0), nvinfer1::ScaleMode::kUNIFORM, sh_w, s_w, p_w);\n    // attn = attn.softmax(dim=-1)\n    nvinfer1::ISoftMaxLayer* softmax = network->addSoftMax(*scaleLayer->getOutput(0));\n    softmax->setAxes(1 << 3);\n    // x = (v @ attn.transpose(-2, -1)).view(B, -1, H, W) + self.pe(v.reshape(B, -1, H, W))\n    auto attnT = network->addShuffle(*softmax->getOutput(0));\n    attnT->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});\n    auto matmul2 = network->addMatrixMultiply(*v->getOutput(0), nvinfer1::MatrixOperation::kNONE, *attnT->getOutput(0),\n                                              nvinfer1::MatrixOperation::kNONE);\n    auto reshape = network->addShuffle(*matmul2->getOutput(0));\n    reshape->setReshapeDimensions(nvinfer1::Dims4{B, -1, H, W});\n    auto v_reshape = network->addShuffle(*v->getOutput(0));\n    v_reshape->setReshapeDimensions(nvinfer1::Dims4{B, -1, H, W});\n    // self.pe = Conv(dim, dim, 3, 1, g=dim, act=False)\n    auto pe = convBn(network, weightMap, *v_reshape->getOutput(0), dim, 3, 1, lname + \".pe\", dim);\n    auto sum = network->addElementWise(*reshape->getOutput(0), *pe->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    // x = self.proj(x)\n    // self.proj = Conv(dim, dim, 1, act=False)\n    auto proj = convBn(network, weightMap, *sum->getOutput(0), dim, 1, 1, lname + \".proj\");\n    return proj;\n}\n\nnvinfer1::ILayer* PSA(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                      nvinfer1::ITensor& input, int ch, std::string lname) {\n    int c = int(ch * 0.5);\n    auto conv1 = convBnSiLU(network, weightMap, input, c * 2, 1, 1, lname + \".cv1\");\n    // a, b = split((self.c, self.c), dim=1)\n    auto d1 = conv1->getOutput(0)->getDimensions();\n    auto a = network->addSlice(*conv1->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n                               nvinfer1::Dims4{d1.d[0], c, d1.d[2], d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    auto b = network->addSlice(*conv1->getOutput(0), nvinfer1::Dims4{0, c, 0, 0},\n                               nvinfer1::Dims4{d1.d[0], c, d1.d[2], d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    // b = b + self.attn(b)\n    auto attn = Attention(network, weightMap, *b->getOutput(0), c, c / 64, 0.5f, lname + \".attn\");\n    auto sum = network->addElementWise(*b->getOutput(0), *attn->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    // b = b + self.ffn(b)\n    // self.ffn = nn.Sequential(\n    //\t\t\tConv(self.c, self.c * 2, 1),\n    //\t\t\tConv(self.c * 2, self.c, 1, act=False)\n    //\t\t)\n    auto ffn1 = convBnSiLU(network, weightMap, *sum->getOutput(0), c * 2, 1, 1, lname + \".ffn.0\");\n    auto ffn2 = convBn(network, weightMap, *ffn1->getOutput(0), c, 1, 1, lname + \".ffn.1\");\n    auto sum2 = network->addElementWise(*sum->getOutput(0), *ffn2->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    // self.cv2(torch.cat((a, b), 1))\n    nvinfer1::ITensor* inputTensors[] = {a->getOutput(0), sum2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensors, 2);\n    auto conv2 = convBnSiLU(network, weightMap, *cat->getOutput(0), ch, 1, 1, lname + \".cv2\");\n    return conv2;\n}\n\nnvinfer1::ILayer* RepVGGDW(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                           nvinfer1::ITensor& input, int ch, std::string lname) {\n    // self.conv = Conv(ed, ed, 7, 1, 3, g=ed, act=False)\n    // self.conv1 = Conv(ed, ed, 3, 1, 1, g=ed, act=False)\n    // self.dim = ed\n    // self.act = nn.SiLU()\n    // return self.act(self.conv(x) + self.conv1(x))\n    auto conv = convBn(network, weightMap, input, ch, 7, 1, lname + \".conv\", ch);\n    auto conv1 = convBn(network, weightMap, input, ch, 3, 1, lname + \".conv1\", ch);\n    auto ew = network->addElementWise(*conv->getOutput(0), *conv1->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    auto sigmoid = network->addActivation(*ew->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    auto ew_silu =\n            network->addElementWise(*ew->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew_silu);\n    return ew_silu;\n}\n\nnvinfer1::ILayer* CIB(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                      nvinfer1::ITensor& input, int c1, int c2, bool shortcut, float e, bool lk, std::string lname) {\n    // self.cv1 = nn.Sequential(\n    //\t\t\tConv(c1, c1, 3, g=c1),\n    //\t\t\tConv(c1, 2 * c_, 1),\n    //\t\t\tConv(2 * c_, 2 * c_, 3, g=2 * c_) if not lk else RepVGGDW(2 * c_),\n    //\t\t\tConv(2 * c_, c2, 1),\n    //\t\t\tConv(c2, c2, 3, g=c2),\n    //\t\t)\n    int c_ = (float)c2 * e;\n    auto* conv1 = convBnSiLU(network, weightMap, input, c1, 3, 1, lname + \".cv1.0\", c1);\n    auto* conv2 = convBnSiLU(network, weightMap, *conv1->getOutput(0), 2 * c_, 1, 1, lname + \".cv1.1\");\n    nvinfer1::ILayer* conv3;\n    if (!lk) {\n        conv3 = convBnSiLU(network, weightMap, *conv2->getOutput(0), 2 * c_, 3, 1, lname + \".cv1.2\", 2 * c_);\n    } else {\n        conv3 = RepVGGDW(network, weightMap, *conv2->getOutput(0), 2 * c_, lname + \".cv1.2\");\n    }\n    auto* conv4 = convBnSiLU(network, weightMap, *conv3->getOutput(0), c2, 1, 1, lname + \".cv1.3\");\n    auto* conv5 = convBnSiLU(network, weightMap, *conv4->getOutput(0), c2, 3, 1, lname + \".cv1.4\", c2);\n    if (shortcut && c1 == c2) {\n        auto* ew = network->addElementWise(input, *conv5->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n        return ew;\n    } else {\n        return conv5;\n    }\n}\n\nnvinfer1::ILayer* C2fCIB(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int c1, int c2, int n, bool shortcut, bool lk, float e,\n                         std::string lname) {\n    int c_ = (float)c2 * e;\n\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, 2 * c_, 1, 1, lname + \".cv1\");\n    nvinfer1::Dims d = conv1->getOutput(0)->getDimensions();\n\n    nvinfer1::ISliceLayer* split1 =\n            network->addSlice(*conv1->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n                              nvinfer1::Dims4{d.d[0], d.d[1] / 2, d.d[2], d.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* split2 =\n            network->addSlice(*conv1->getOutput(0), nvinfer1::Dims4{0, d.d[1] / 2, 0, 0},\n                              nvinfer1::Dims4{d.d[0], d.d[1] / 2, d.d[2], d.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ITensor* inputTensor0[] = {split1->getOutput(0), split2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensor0, 2);\n    nvinfer1::ITensor* y1 = split2->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        auto* b = CIB(network, weightMap, *y1, c_, c_, shortcut, 1.0, lk, lname + \".m.\" + std::to_string(i));\n        y1 = b->getOutput(0);\n\n        nvinfer1::ITensor* inputTensors[] = {cat->getOutput(0), b->getOutput(0)};\n        cat = network->addConcatenation(inputTensors, 2);\n    }\n\n    nvinfer1::IElementWiseLayer* conv2 = convBnSiLU(network, weightMap, *cat->getOutput(0), c2, 1, 1, lname + \".cv2\");\n\n    return conv2;\n}\n"
  },
  {
    "path": "yolov10/src/calibrator.cpp",
    "content": "#include \"calibrator.h\"\n#include <fstream>\n#include <iostream>\n#include <iterator>\n#include <opencv2/dnn/dnn.hpp>\n#include \"cuda_utils.h\"\n#include \"utils.h\"\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir,\n                                               const char* calib_table_name, const char* input_blob_name,\n                                               bool read_cache)\n    : batchsize_(batchsize),\n      input_w_(input_w),\n      input_h_(input_h),\n      img_idx_(0),\n      img_dir_(img_dir),\n      calib_table_name_(calib_table_name),\n      input_blob_name_(input_blob_name),\n      read_cache_(read_cache) {\n    input_count_ = 3 * input_w * input_h * batchsize;\n    CUDA_CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n    read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2() {\n    CUDA_CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const TRT_NOEXCEPT {\n    return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT {\n    if (img_idx_ + batchsize_ > (int)img_files_.size()) {\n        return false;\n    }\n\n    std::vector<cv::Mat> input_imgs_;\n    for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n        std::cout << img_files_[i] << \"  \" << i << std::endl;\n        cv::Mat temp = cv::imread(img_dir_ + \"/\" + img_files_[i]);\n        if (temp.empty()) {\n            std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n            return false;\n        }\n        cv::Mat pr_img = preprocess_img(temp, input_w_, input_h_);\n        input_imgs_.push_back(pr_img);\n    }\n    img_idx_ += batchsize_;\n    cv::Mat blob = cv::dnn::blobFromImages(input_imgs_, 1.0 / 255.0, cv::Size(input_w_, input_h_), cv::Scalar(0, 0, 0),\n                                           true, false);\n    CUDA_CHECK(cudaMemcpy(device_input_, blob.ptr<float>(0), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n    assert(!strcmp(names[0], input_blob_name_));\n    bindings[0] = device_input_;\n    return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length) TRT_NOEXCEPT {\n    std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n    calib_cache_.clear();\n    std::ifstream input(calib_table_name_, std::ios::binary);\n    input >> std::noskipws;\n    if (read_cache_ && input.good()) {\n        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n    }\n    length = calib_cache_.size();\n    return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT {\n    std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n    std::ofstream output(calib_table_name_, std::ios::binary);\n    output.write(reinterpret_cast<const char*>(cache), length);\n}\n"
  },
  {
    "path": "yolov10/src/model.cpp",
    "content": "#include <cmath>\n#include <iostream>\n\n#include \"block.h\"\n#include \"calibrator.h\"\n#include \"config.h\"\n#include \"model.h\"\n\nstatic int get_width(int x, float gw, int max_channels, int divisor = 8) {\n    int c = std::min(x, max_channels);\n    auto channel = int(ceil((c * gw) / divisor)) * divisor;\n    return channel;\n}\n\nstatic int get_depth(int x, float gd) {\n    if (x == 1)\n        return 1;\n    int r = round(x * gd);\n    if (x * gd - int(x * gd) == 0.5 && (int(x * gd) % 2) == 0)\n        --r;\n    return std::max<int>(r, 1);\n}\n\nvoid calculateStrides(nvinfer1::ILayer* conv_layers[], int size, int reference_size, int strides[]) {\n    for (int i = 0; i < size; ++i) {\n        nvinfer1::ILayer* layer = conv_layers[i];\n        nvinfer1::Dims dims = layer->getOutput(0)->getDimensions();\n        int feature_map_size = dims.d[2];\n        strides[i] = reference_size / feature_map_size;\n    }\n}\n\nnvinfer1::IHostMemory* buildEngineYolov10DetN(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                              nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                              int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    //\tnvinfer1::INetworkDefinition *network = builder->createNetworkV2(0U);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    /*******************************************************************************************************\n    ******************************************  YOLOV10 INPUT  **********************************************\n    *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n    *****************************************  YOLOV10 BACKBONE  ********************************************\n    *******************************************************************************************************/\n    auto* conv0 = convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), 3, 2, \"model.0\");\n    auto* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), 3, 2, \"model.1\");\n    // 11233\n    auto* conv2 = C2F(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                      get_width(128, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.2\");\n    auto* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), 3, 2, \"model.3\");\n    // 22466\n    auto* conv4 = C2F(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                      get_width(256, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.4\");\n    auto* conv5 = SCDown(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), 3, 2, \"model.5\");\n    // 22466\n    auto* conv6 = C2F(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                      get_width(512, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.6\");\n    auto* conv7 = SCDown(network, weightMap, *conv6->getOutput(0), get_width(1024, gw, max_channels), 3, 2, \"model.7\");\n    // 11233\n    auto* conv8 = C2F(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                      get_width(1024, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.8\");\n    auto* conv9 = SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                       get_width(1024, gw, max_channels), 5, \"model.9\");\n    auto* conv10 = PSA(network, weightMap, *conv9->getOutput(0), get_width(1024, gw, max_channels), \"model.10\");\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 HEAD  ********************************************\n    *******************************************************************************************************/\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample11 = network->addResize(*conv10->getOutput(0));\n    assert(upsample11);\n    upsample11->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample11->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor12[] = {upsample11->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat12 = network->addConcatenation(inputTensor12, 2);\n\n    auto* conv13 = C2F(network, weightMap, *cat12->getOutput(0), get_width(512, gw, max_channels),\n                       get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.13\");\n\n    nvinfer1::IResizeLayer* upsample14 = network->addResize(*conv13->getOutput(0));\n    assert(upsample14);\n    upsample14->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample14->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor15[] = {upsample14->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat15 = network->addConcatenation(inputTensor15, 2);\n\n    auto* conv16 = C2F(network, weightMap, *cat15->getOutput(0), get_width(256, gw, max_channels),\n                       get_width(256, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.16\");\n    auto* conv17 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), 3, 2, \"model.17\");\n    nvinfer1::ITensor* inputTensor18[] = {conv17->getOutput(0), conv13->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat18 = network->addConcatenation(inputTensor18, 2);\n    auto* conv19 = C2F(network, weightMap, *cat18->getOutput(0), get_width(512, gw, max_channels),\n                       get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.19\");\n    auto* conv20 =\n            SCDown(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), 3, 2, \"model.20\");\n    nvinfer1::ITensor* inputTensor21[] = {conv20->getOutput(0), conv10->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21 = network->addConcatenation(inputTensor21, 2);\n    auto* conv22 = C2fCIB(network, weightMap, *cat21->getOutput(0), get_width(1024, gw, max_channels),\n                          get_width(1024, gw, max_channels), get_depth(3, gd), true, true, 0.5, \"model.22\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 OUTPUT  ******************************************\n    *******************************************************************************************************/\n    auto d = conv16->getOutput(0)->getDimensions();\n    assert(d.nbDims == 4);\n    int ch_0 = d.d[1];\n    int base_in_channel = std::max(16, std::max(ch_0 / 4, 16 * 4));\n    int base_out_channel = std::max(ch_0, std::min(kNumClass, 100));\n\n    // output0\n    auto* conv23_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.0.0\");\n    auto* conv23_cv2_0_1 = convBnSiLU(network, weightMap, *conv23_cv2_0_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_0_2 = network->addConvolutionNd(\n            *conv23_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.0.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.0.2.bias\"]);\n    conv23_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_0_0_0 = convBnSiLU(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.0.0.0\", get_width(256, gw, max_channels));\n    auto* conv23_cv3_0_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_0_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.0.0.1\");\n    auto* conv23_cv3_0_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_0_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.0.1.0\", base_out_channel);\n    auto* conv23_cv3_0_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_0_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.0.1.1\");\n    auto* conv23_cv3_0_2 = network->addConvolutionNd(*conv23_cv3_0_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.0.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.0.2.bias\"]);\n    conv23_cv3_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_0[] = {conv23_cv2_0_2->getOutput(0), conv23_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_0 = network->addConcatenation(inputTensor23_0, 2);\n\n    // output1\n    auto* conv23_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv19->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.1.0\");\n    auto* conv23_cv2_1_1 = convBnSiLU(network, weightMap, *conv23_cv2_1_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_1_2 = network->addConvolutionNd(\n            *conv23_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.1.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.1.2.bias\"]);\n    conv23_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_1_0_0 = convBnSiLU(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.1.0.0\", get_width(512, gw, max_channels));\n    auto* conv23_cv3_1_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_1_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.1.0.1\");\n    auto* conv23_cv3_1_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_1_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.1.1.0\", base_out_channel);\n    auto* conv23_cv3_1_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_1_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.1.1.1\");\n    auto* conv23_cv3_1_2 = network->addConvolutionNd(*conv23_cv3_1_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.1.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.1.2.bias\"]);\n    conv23_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_1[] = {conv23_cv2_1_2->getOutput(0), conv23_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_1 = network->addConcatenation(inputTensor23_1, 2);\n\n    // output2\n    auto* conv23_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv22->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.2.0\");\n    auto* conv23_cv2_2_1 = convBnSiLU(network, weightMap, *conv23_cv2_2_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_2_2 = network->addConvolutionNd(\n            *conv23_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.2.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.2.2.bias\"]);\n    auto* conv23_cv3_2_0_0 = convBnSiLU(network, weightMap, *conv22->getOutput(0), get_width(1024, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.2.0.0\", get_width(1024, gw, max_channels));\n    auto* conv23_cv3_2_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_2_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.2.0.1\");\n    auto* conv23_cv3_2_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_2_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.2.1.0\", base_out_channel);\n    auto* conv23_cv3_2_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_2_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.2.1.1\");\n    auto* conv23_cv3_2_2 = network->addConvolutionNd(*conv23_cv3_2_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.2.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.2.2.bias\"]);\n    nvinfer1::ITensor* inputTensor23_2[] = {conv23_cv2_2_2->getOutput(0), conv23_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_2 = network->addConcatenation(inputTensor23_2, 2);\n\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 DETECT  ******************************************\n    *******************************************************************************************************/\n\n    nvinfer1::ILayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle23_0 = network->addShuffle(*cat23_0->getOutput(0));\n    shuffle23_0->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split23_0_0 = network->addSlice(\n            *shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_0_1 =\n            network->addSlice(*shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])},\n                              nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* dfl23_0 =\n            DFL(network, weightMap, *split23_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_0[] = {dfl23_0->getOutput(0), split23_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_0 = network->addConcatenation(inputTensor23_dfl_0, 2);\n    cat23_dfl_0->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle23_1 = network->addShuffle(*cat23_1->getOutput(0));\n    shuffle23_1->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split23_1_0 = network->addSlice(\n            *shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_1_1 =\n            network->addSlice(*shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_1 =\n            DFL(network, weightMap, *split23_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_1[] = {dfl23_1->getOutput(0), split23_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_1 = network->addConcatenation(inputTensor23_dfl_1, 2);\n    cat23_dfl_1->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle23_2 = network->addShuffle(*cat23_2->getOutput(0));\n    shuffle23_2->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split23_2_0 = network->addSlice(\n            *shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_2_1 =\n            network->addSlice(*shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_2 =\n            DFL(network, weightMap, *split23_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_2[] = {dfl23_2->getOutput(0), split23_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_2 = network->addConcatenation(inputTensor23_dfl_2, 2);\n    cat23_dfl_2->setAxis(1);\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(\n            network, std::vector<nvinfer1::ILayer*>{cat23_dfl_0, cat23_dfl_1, cat23_dfl_2}, strides, stridesLength);\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov10DetS(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                              nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                              int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    //\tnvinfer1::INetworkDefinition *network = builder->createNetworkV2(0U);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    /*******************************************************************************************************\n    ******************************************  YOLOV10 INPUT  **********************************************\n    *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n    *****************************************  YOLOV10 BACKBONE  ********************************************\n    *******************************************************************************************************/\n    auto* conv0 = convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), 3, 2, \"model.0\");\n    auto* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), 3, 2, \"model.1\");\n    // 11233\n    auto* conv2 = C2F(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                      get_width(128, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.2\");\n    auto* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), 3, 2, \"model.3\");\n    // 22466\n    auto* conv4 = C2F(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                      get_width(256, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.4\");\n    auto* conv5 = SCDown(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), 3, 2, \"model.5\");\n    // 22466\n    auto* conv6 = C2F(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                      get_width(512, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.6\");\n    auto* conv7 = SCDown(network, weightMap, *conv6->getOutput(0), get_width(1024, gw, max_channels), 3, 2, \"model.7\");\n    // 11233\n    auto* conv8 = C2fCIB(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                         get_width(1024, gw, max_channels), get_depth(3, gd), true, true, 0.5, \"model.8\");\n    auto* conv9 = SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                       get_width(1024, gw, max_channels), 5, \"model.9\");\n    auto* conv10 = PSA(network, weightMap, *conv9->getOutput(0), get_width(1024, gw, max_channels), \"model.10\");\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 HEAD  ********************************************\n    *******************************************************************************************************/\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample11 = network->addResize(*conv10->getOutput(0));\n    assert(upsample11);\n    upsample11->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample11->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor12[] = {upsample11->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat12 = network->addConcatenation(inputTensor12, 2);\n\n    auto* conv13 = C2F(network, weightMap, *cat12->getOutput(0), get_width(512, gw, max_channels),\n                       get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.13\");\n\n    nvinfer1::IResizeLayer* upsample14 = network->addResize(*conv13->getOutput(0));\n    assert(upsample14);\n    upsample14->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample14->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor15[] = {upsample14->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat15 = network->addConcatenation(inputTensor15, 2);\n\n    auto* conv16 = C2F(network, weightMap, *cat15->getOutput(0), get_width(256, gw, max_channels),\n                       get_width(256, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.16\");\n    auto* conv17 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), 3, 2, \"model.17\");\n    nvinfer1::ITensor* inputTensor18[] = {conv17->getOutput(0), conv13->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat18 = network->addConcatenation(inputTensor18, 2);\n    auto* conv19 = C2F(network, weightMap, *cat18->getOutput(0), get_width(512, gw, max_channels),\n                       get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.19\");\n    auto* conv20 =\n            SCDown(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), 3, 2, \"model.20\");\n    nvinfer1::ITensor* inputTensor21[] = {conv20->getOutput(0), conv10->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21 = network->addConcatenation(inputTensor21, 2);\n    auto* conv22 = C2fCIB(network, weightMap, *cat21->getOutput(0), get_width(1024, gw, max_channels),\n                          get_width(1024, gw, max_channels), get_depth(3, gd), true, true, 0.5, \"model.22\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 OUTPUT  ******************************************\n    *******************************************************************************************************/\n    auto d = conv16->getOutput(0)->getDimensions();\n    assert(d.nbDims == 4);\n    int ch_0 = d.d[1];\n    int base_in_channel = std::max(16, std::max(ch_0 / 4, 16 * 4));\n    int base_out_channel = std::max(ch_0, std::min(kNumClass, 100));\n\n    // output0\n    auto* conv23_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.0.0\");\n    auto* conv23_cv2_0_1 = convBnSiLU(network, weightMap, *conv23_cv2_0_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_0_2 = network->addConvolutionNd(\n            *conv23_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.0.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.0.2.bias\"]);\n    conv23_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_0_0_0 = convBnSiLU(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.0.0.0\", get_width(256, gw, max_channels));\n    auto* conv23_cv3_0_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_0_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.0.0.1\");\n    auto* conv23_cv3_0_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_0_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.0.1.0\", base_out_channel);\n    auto* conv23_cv3_0_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_0_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.0.1.1\");\n    auto* conv23_cv3_0_2 = network->addConvolutionNd(*conv23_cv3_0_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.0.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.0.2.bias\"]);\n    conv23_cv3_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_0[] = {conv23_cv2_0_2->getOutput(0), conv23_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_0 = network->addConcatenation(inputTensor23_0, 2);\n\n    // output1\n    auto* conv23_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv19->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.1.0\");\n    auto* conv23_cv2_1_1 = convBnSiLU(network, weightMap, *conv23_cv2_1_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_1_2 = network->addConvolutionNd(\n            *conv23_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.1.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.1.2.bias\"]);\n    conv23_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_1_0_0 = convBnSiLU(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.1.0.0\", get_width(512, gw, max_channels));\n    auto* conv23_cv3_1_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_1_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.1.0.1\");\n    auto* conv23_cv3_1_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_1_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.1.1.0\", base_out_channel);\n    auto* conv23_cv3_1_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_1_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.1.1.1\");\n    auto* conv23_cv3_1_2 = network->addConvolutionNd(*conv23_cv3_1_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.1.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.1.2.bias\"]);\n    conv23_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_1[] = {conv23_cv2_1_2->getOutput(0), conv23_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_1 = network->addConcatenation(inputTensor23_1, 2);\n\n    // output2\n    auto* conv23_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv22->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.2.0\");\n    auto* conv23_cv2_2_1 = convBnSiLU(network, weightMap, *conv23_cv2_2_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_2_2 = network->addConvolutionNd(\n            *conv23_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.2.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.2.2.bias\"]);\n    auto* conv23_cv3_2_0_0 = convBnSiLU(network, weightMap, *conv22->getOutput(0), get_width(1024, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.2.0.0\", get_width(1024, gw, max_channels));\n    auto* conv23_cv3_2_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_2_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.2.0.1\");\n    auto* conv23_cv3_2_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_2_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.2.1.0\", base_out_channel);\n    auto* conv23_cv3_2_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_2_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.2.1.1\");\n    auto* conv23_cv3_2_2 = network->addConvolutionNd(*conv23_cv3_2_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.2.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.2.2.bias\"]);\n    nvinfer1::ITensor* inputTensor23_2[] = {conv23_cv2_2_2->getOutput(0), conv23_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_2 = network->addConcatenation(inputTensor23_2, 2);\n\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 DETECT  ******************************************\n    *******************************************************************************************************/\n\n    nvinfer1::ILayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle23_0 = network->addShuffle(*cat23_0->getOutput(0));\n    shuffle23_0->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split23_0_0 = network->addSlice(\n            *shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_0_1 =\n            network->addSlice(*shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])},\n                              nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* dfl23_0 =\n            DFL(network, weightMap, *split23_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_0[] = {dfl23_0->getOutput(0), split23_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_0 = network->addConcatenation(inputTensor23_dfl_0, 2);\n    cat23_dfl_0->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle23_1 = network->addShuffle(*cat23_1->getOutput(0));\n    shuffle23_1->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split23_1_0 = network->addSlice(\n            *shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_1_1 =\n            network->addSlice(*shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_1 =\n            DFL(network, weightMap, *split23_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_1[] = {dfl23_1->getOutput(0), split23_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_1 = network->addConcatenation(inputTensor23_dfl_1, 2);\n    cat23_dfl_1->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle23_2 = network->addShuffle(*cat23_2->getOutput(0));\n    shuffle23_2->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split23_2_0 = network->addSlice(\n            *shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_2_1 =\n            network->addSlice(*shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_2 =\n            DFL(network, weightMap, *split23_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_2[] = {dfl23_2->getOutput(0), split23_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_2 = network->addConcatenation(inputTensor23_dfl_2, 2);\n    cat23_dfl_2->setAxis(1);\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(\n            network, std::vector<nvinfer1::ILayer*>{cat23_dfl_0, cat23_dfl_1, cat23_dfl_2}, strides, stridesLength);\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov10DetM(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                              nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                              int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    //\tnvinfer1::INetworkDefinition *network = builder->createNetworkV2(0U);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    /*******************************************************************************************************\n    ******************************************  YOLOV10 INPUT  **********************************************\n    *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n    *****************************************  YOLOV10 BACKBONE  ********************************************\n    *******************************************************************************************************/\n    auto* conv0 = convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), 3, 2, \"model.0\");\n    auto* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), 3, 2, \"model.1\");\n    // 11233\n    auto* conv2 = C2F(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                      get_width(128, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.2\");\n    auto* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), 3, 2, \"model.3\");\n    // 22466\n    auto* conv4 = C2F(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                      get_width(256, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.4\");\n    auto* conv5 = SCDown(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), 3, 2, \"model.5\");\n    // 22466\n    auto* conv6 = C2F(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                      get_width(512, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.6\");\n    auto* conv7 = SCDown(network, weightMap, *conv6->getOutput(0), get_width(1024, gw, max_channels), 3, 2, \"model.7\");\n    // 11233\n    auto* conv8 = C2fCIB(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                         get_width(1024, gw, max_channels), get_depth(3, gd), true, false, 0.5, \"model.8\");\n    auto* conv9 = SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                       get_width(1024, gw, max_channels), 5, \"model.9\");\n    auto* conv10 = PSA(network, weightMap, *conv9->getOutput(0), get_width(1024, gw, max_channels), \"model.10\");\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 HEAD  ********************************************\n    *******************************************************************************************************/\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample11 = network->addResize(*conv10->getOutput(0));\n    assert(upsample11);\n    upsample11->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample11->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor12[] = {upsample11->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat12 = network->addConcatenation(inputTensor12, 2);\n\n    auto* conv13 = C2F(network, weightMap, *cat12->getOutput(0), get_width(512, gw, max_channels),\n                       get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.13\");\n\n    nvinfer1::IResizeLayer* upsample14 = network->addResize(*conv13->getOutput(0));\n    assert(upsample14);\n    upsample14->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample14->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor15[] = {upsample14->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat15 = network->addConcatenation(inputTensor15, 2);\n\n    auto* conv16 = C2F(network, weightMap, *cat15->getOutput(0), get_width(256, gw, max_channels),\n                       get_width(256, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.16\");\n    auto* conv17 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), 3, 2, \"model.17\");\n    nvinfer1::ITensor* inputTensor18[] = {conv17->getOutput(0), conv13->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat18 = network->addConcatenation(inputTensor18, 2);\n    auto* conv19 = C2fCIB(network, weightMap, *cat18->getOutput(0), get_width(512, gw, max_channels),\n                          get_width(512, gw, max_channels), get_depth(3, gd), true, false, 0.5, \"model.19\");\n    auto* conv20 =\n            SCDown(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), 3, 2, \"model.20\");\n    nvinfer1::ITensor* inputTensor21[] = {conv20->getOutput(0), conv10->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21 = network->addConcatenation(inputTensor21, 2);\n    auto* conv22 = C2fCIB(network, weightMap, *cat21->getOutput(0), get_width(1024, gw, max_channels),\n                          get_width(1024, gw, max_channels), get_depth(3, gd), true, false, 0.5, \"model.22\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 OUTPUT  ******************************************\n    *******************************************************************************************************/\n    auto d = conv16->getOutput(0)->getDimensions();\n    assert(d.nbDims == 4);\n    int ch_0 = d.d[1];\n    int base_in_channel = std::max(16, std::max(ch_0 / 4, 16 * 4));\n    int base_out_channel = std::max(ch_0, std::min(kNumClass, 100));\n\n    // output0\n    auto* conv23_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.0.0\");\n    auto* conv23_cv2_0_1 = convBnSiLU(network, weightMap, *conv23_cv2_0_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_0_2 = network->addConvolutionNd(\n            *conv23_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.0.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.0.2.bias\"]);\n    conv23_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_0_0_0 = convBnSiLU(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.0.0.0\", get_width(256, gw, max_channels));\n    auto* conv23_cv3_0_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_0_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.0.0.1\");\n    auto* conv23_cv3_0_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_0_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.0.1.0\", base_out_channel);\n    auto* conv23_cv3_0_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_0_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.0.1.1\");\n    auto* conv23_cv3_0_2 = network->addConvolutionNd(*conv23_cv3_0_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.0.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.0.2.bias\"]);\n    conv23_cv3_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_0[] = {conv23_cv2_0_2->getOutput(0), conv23_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_0 = network->addConcatenation(inputTensor23_0, 2);\n\n    // output1\n    auto* conv23_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv19->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.1.0\");\n    auto* conv23_cv2_1_1 = convBnSiLU(network, weightMap, *conv23_cv2_1_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_1_2 = network->addConvolutionNd(\n            *conv23_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.1.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.1.2.bias\"]);\n    conv23_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_1_0_0 = convBnSiLU(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.1.0.0\", get_width(512, gw, max_channels));\n    auto* conv23_cv3_1_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_1_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.1.0.1\");\n    auto* conv23_cv3_1_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_1_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.1.1.0\", base_out_channel);\n    auto* conv23_cv3_1_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_1_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.1.1.1\");\n    auto* conv23_cv3_1_2 = network->addConvolutionNd(*conv23_cv3_1_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.1.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.1.2.bias\"]);\n    conv23_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_1[] = {conv23_cv2_1_2->getOutput(0), conv23_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_1 = network->addConcatenation(inputTensor23_1, 2);\n\n    // output2\n    auto* conv23_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv22->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.2.0\");\n    auto* conv23_cv2_2_1 = convBnSiLU(network, weightMap, *conv23_cv2_2_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_2_2 = network->addConvolutionNd(\n            *conv23_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.2.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.2.2.bias\"]);\n    auto* conv23_cv3_2_0_0 = convBnSiLU(network, weightMap, *conv22->getOutput(0), get_width(1024, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.2.0.0\", get_width(1024, gw, max_channels));\n    auto* conv23_cv3_2_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_2_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.2.0.1\");\n    auto* conv23_cv3_2_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_2_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.2.1.0\", base_out_channel);\n    auto* conv23_cv3_2_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_2_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.2.1.1\");\n    auto* conv23_cv3_2_2 = network->addConvolutionNd(*conv23_cv3_2_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.2.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.2.2.bias\"]);\n    nvinfer1::ITensor* inputTensor23_2[] = {conv23_cv2_2_2->getOutput(0), conv23_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_2 = network->addConcatenation(inputTensor23_2, 2);\n\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 DETECT  ******************************************\n    *******************************************************************************************************/\n\n    nvinfer1::ILayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle23_0 = network->addShuffle(*cat23_0->getOutput(0));\n    shuffle23_0->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split23_0_0 = network->addSlice(\n            *shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_0_1 =\n            network->addSlice(*shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])},\n                              nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* dfl23_0 =\n            DFL(network, weightMap, *split23_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_0[] = {dfl23_0->getOutput(0), split23_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_0 = network->addConcatenation(inputTensor23_dfl_0, 2);\n    cat23_dfl_0->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle23_1 = network->addShuffle(*cat23_1->getOutput(0));\n    shuffle23_1->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split23_1_0 = network->addSlice(\n            *shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_1_1 =\n            network->addSlice(*shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_1 =\n            DFL(network, weightMap, *split23_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_1[] = {dfl23_1->getOutput(0), split23_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_1 = network->addConcatenation(inputTensor23_dfl_1, 2);\n    cat23_dfl_1->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle23_2 = network->addShuffle(*cat23_2->getOutput(0));\n    shuffle23_2->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split23_2_0 = network->addSlice(\n            *shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_2_1 =\n            network->addSlice(*shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_2 =\n            DFL(network, weightMap, *split23_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_2[] = {dfl23_2->getOutput(0), split23_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_2 = network->addConcatenation(inputTensor23_dfl_2, 2);\n    cat23_dfl_2->setAxis(1);\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(\n            network, std::vector<nvinfer1::ILayer*>{cat23_dfl_0, cat23_dfl_1, cat23_dfl_2}, strides, stridesLength);\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov10DetBL(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                               nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                               int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    //\tnvinfer1::INetworkDefinition *network = builder->createNetworkV2(0U);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    /*******************************************************************************************************\n    ******************************************  YOLOV10 INPUT  **********************************************\n    *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n    *****************************************  YOLOV10 BACKBONE  ********************************************\n    *******************************************************************************************************/\n    auto* conv0 = convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), 3, 2, \"model.0\");\n    auto* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), 3, 2, \"model.1\");\n    // 11233\n    auto* conv2 = C2F(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                      get_width(128, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.2\");\n    auto* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), 3, 2, \"model.3\");\n    // 22466\n    auto* conv4 = C2F(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                      get_width(256, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.4\");\n    auto* conv5 = SCDown(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), 3, 2, \"model.5\");\n    // 22466\n    auto* conv6 = C2F(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                      get_width(512, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.6\");\n    auto* conv7 = SCDown(network, weightMap, *conv6->getOutput(0), get_width(1024, gw, max_channels), 3, 2, \"model.7\");\n    // 11233\n    auto* conv8 = C2fCIB(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                         get_width(1024, gw, max_channels), get_depth(3, gd), true, false, 0.5, \"model.8\");\n    auto* conv9 = SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                       get_width(1024, gw, max_channels), 5, \"model.9\");\n    auto* conv10 = PSA(network, weightMap, *conv9->getOutput(0), get_width(1024, gw, max_channels), \"model.10\");\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 HEAD  ********************************************\n    *******************************************************************************************************/\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample11 = network->addResize(*conv10->getOutput(0));\n    assert(upsample11);\n    upsample11->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample11->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor12[] = {upsample11->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat12 = network->addConcatenation(inputTensor12, 2);\n\n    auto* conv13 = C2fCIB(network, weightMap, *cat12->getOutput(0), get_width(512, gw, max_channels),\n                          get_width(512, gw, max_channels), get_depth(3, gd), true, false, 0.5, \"model.13\");\n\n    nvinfer1::IResizeLayer* upsample14 = network->addResize(*conv13->getOutput(0));\n    assert(upsample14);\n    upsample14->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample14->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor15[] = {upsample14->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat15 = network->addConcatenation(inputTensor15, 2);\n\n    auto* conv16 = C2F(network, weightMap, *cat15->getOutput(0), get_width(256, gw, max_channels),\n                       get_width(256, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.16\");\n    auto* conv17 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), 3, 2, \"model.17\");\n    nvinfer1::ITensor* inputTensor18[] = {conv17->getOutput(0), conv13->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat18 = network->addConcatenation(inputTensor18, 2);\n    auto* conv19 = C2fCIB(network, weightMap, *cat18->getOutput(0), get_width(512, gw, max_channels),\n                          get_width(512, gw, max_channels), get_depth(3, gd), true, false, 0.5, \"model.19\");\n    auto* conv20 =\n            SCDown(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), 3, 2, \"model.20\");\n    nvinfer1::ITensor* inputTensor21[] = {conv20->getOutput(0), conv10->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21 = network->addConcatenation(inputTensor21, 2);\n    auto* conv22 = C2fCIB(network, weightMap, *cat21->getOutput(0), get_width(1024, gw, max_channels),\n                          get_width(1024, gw, max_channels), get_depth(3, gd), true, false, 0.5, \"model.22\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 OUTPUT  ******************************************\n    *******************************************************************************************************/\n    auto d = conv16->getOutput(0)->getDimensions();\n    assert(d.nbDims == 4);\n    int ch_0 = d.d[1];\n    int base_in_channel = std::max(16, std::max(ch_0 / 4, 16 * 4));\n    int base_out_channel = std::max(ch_0, std::min(kNumClass, 100));\n\n    // output0\n    auto* conv23_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.0.0\");\n    auto* conv23_cv2_0_1 = convBnSiLU(network, weightMap, *conv23_cv2_0_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_0_2 = network->addConvolutionNd(\n            *conv23_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.0.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.0.2.bias\"]);\n    conv23_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_0_0_0 = convBnSiLU(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.0.0.0\", get_width(256, gw, max_channels));\n    auto* conv23_cv3_0_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_0_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.0.0.1\");\n    auto* conv23_cv3_0_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_0_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.0.1.0\", base_out_channel);\n    auto* conv23_cv3_0_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_0_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.0.1.1\");\n    auto* conv23_cv3_0_2 = network->addConvolutionNd(*conv23_cv3_0_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.0.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.0.2.bias\"]);\n    conv23_cv3_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_0[] = {conv23_cv2_0_2->getOutput(0), conv23_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_0 = network->addConcatenation(inputTensor23_0, 2);\n\n    // output1\n    auto* conv23_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv19->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.1.0\");\n    auto* conv23_cv2_1_1 = convBnSiLU(network, weightMap, *conv23_cv2_1_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_1_2 = network->addConvolutionNd(\n            *conv23_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.1.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.1.2.bias\"]);\n    conv23_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_1_0_0 = convBnSiLU(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.1.0.0\", get_width(512, gw, max_channels));\n    auto* conv23_cv3_1_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_1_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.1.0.1\");\n    auto* conv23_cv3_1_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_1_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.1.1.0\", base_out_channel);\n    auto* conv23_cv3_1_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_1_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.1.1.1\");\n    auto* conv23_cv3_1_2 = network->addConvolutionNd(*conv23_cv3_1_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.1.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.1.2.bias\"]);\n    conv23_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_1[] = {conv23_cv2_1_2->getOutput(0), conv23_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_1 = network->addConcatenation(inputTensor23_1, 2);\n\n    // output2\n    auto* conv23_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv22->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.2.0\");\n    auto* conv23_cv2_2_1 = convBnSiLU(network, weightMap, *conv23_cv2_2_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_2_2 = network->addConvolutionNd(\n            *conv23_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.2.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.2.2.bias\"]);\n    auto* conv23_cv3_2_0_0 = convBnSiLU(network, weightMap, *conv22->getOutput(0), get_width(1024, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.2.0.0\", get_width(1024, gw, max_channels));\n    auto* conv23_cv3_2_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_2_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.2.0.1\");\n    auto* conv23_cv3_2_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_2_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.2.1.0\", base_out_channel);\n    auto* conv23_cv3_2_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_2_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.2.1.1\");\n    auto* conv23_cv3_2_2 = network->addConvolutionNd(*conv23_cv3_2_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.2.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.2.2.bias\"]);\n    nvinfer1::ITensor* inputTensor23_2[] = {conv23_cv2_2_2->getOutput(0), conv23_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_2 = network->addConcatenation(inputTensor23_2, 2);\n\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 DETECT  ******************************************\n    *******************************************************************************************************/\n\n    nvinfer1::ILayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle23_0 = network->addShuffle(*cat23_0->getOutput(0));\n    shuffle23_0->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split23_0_0 = network->addSlice(\n            *shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_0_1 =\n            network->addSlice(*shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])},\n                              nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* dfl23_0 =\n            DFL(network, weightMap, *split23_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_0[] = {dfl23_0->getOutput(0), split23_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_0 = network->addConcatenation(inputTensor23_dfl_0, 2);\n    cat23_dfl_0->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle23_1 = network->addShuffle(*cat23_1->getOutput(0));\n    shuffle23_1->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split23_1_0 = network->addSlice(\n            *shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_1_1 =\n            network->addSlice(*shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_1 =\n            DFL(network, weightMap, *split23_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_1[] = {dfl23_1->getOutput(0), split23_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_1 = network->addConcatenation(inputTensor23_dfl_1, 2);\n    cat23_dfl_1->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle23_2 = network->addShuffle(*cat23_2->getOutput(0));\n    shuffle23_2->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split23_2_0 = network->addSlice(\n            *shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_2_1 =\n            network->addSlice(*shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_2 =\n            DFL(network, weightMap, *split23_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_2[] = {dfl23_2->getOutput(0), split23_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_2 = network->addConcatenation(inputTensor23_dfl_2, 2);\n    cat23_dfl_2->setAxis(1);\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(\n            network, std::vector<nvinfer1::ILayer*>{cat23_dfl_0, cat23_dfl_1, cat23_dfl_2}, strides, stridesLength);\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov10DetX(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                              nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                              int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    //\tnvinfer1::INetworkDefinition *network = builder->createNetworkV2(0U);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    /*******************************************************************************************************\n    ******************************************  YOLOV10 INPUT  **********************************************\n    *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n    *****************************************  YOLOV10 BACKBONE  ********************************************\n    *******************************************************************************************************/\n    auto* conv0 = convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), 3, 2, \"model.0\");\n    auto* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), 3, 2, \"model.1\");\n    // 11233\n    auto* conv2 = C2F(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                      get_width(128, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.2\");\n    auto* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), 3, 2, \"model.3\");\n    // 22466\n    auto* conv4 = C2F(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                      get_width(256, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.4\");\n    auto* conv5 = SCDown(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), 3, 2, \"model.5\");\n    // 22466\n    auto* conv6 = C2fCIB(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                         get_width(512, gw, max_channels), get_depth(6, gd), true, false, 0.5, \"model.6\");\n    auto* conv7 = SCDown(network, weightMap, *conv6->getOutput(0), get_width(1024, gw, max_channels), 3, 2, \"model.7\");\n    // 11233\n    auto* conv8 = C2fCIB(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                         get_width(1024, gw, max_channels), get_depth(3, gd), true, false, 0.5, \"model.8\");\n    auto* conv9 = SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                       get_width(1024, gw, max_channels), 5, \"model.9\");\n    auto* conv10 = PSA(network, weightMap, *conv9->getOutput(0), get_width(1024, gw, max_channels), \"model.10\");\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 HEAD  ********************************************\n    *******************************************************************************************************/\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample11 = network->addResize(*conv10->getOutput(0));\n    assert(upsample11);\n    upsample11->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample11->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor12[] = {upsample11->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat12 = network->addConcatenation(inputTensor12, 2);\n\n    auto* conv13 = C2fCIB(network, weightMap, *cat12->getOutput(0), get_width(512, gw, max_channels),\n                          get_width(512, gw, max_channels), get_depth(3, gd), true, false, 0.5, \"model.13\");\n\n    nvinfer1::IResizeLayer* upsample14 = network->addResize(*conv13->getOutput(0));\n    assert(upsample14);\n    upsample14->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample14->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor15[] = {upsample14->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat15 = network->addConcatenation(inputTensor15, 2);\n\n    auto* conv16 = C2F(network, weightMap, *cat15->getOutput(0), get_width(256, gw, max_channels),\n                       get_width(256, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.16\");\n    auto* conv17 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), 3, 2, \"model.17\");\n    nvinfer1::ITensor* inputTensor18[] = {conv17->getOutput(0), conv13->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat18 = network->addConcatenation(inputTensor18, 2);\n    auto* conv19 = C2fCIB(network, weightMap, *cat18->getOutput(0), get_width(512, gw, max_channels),\n                          get_width(512, gw, max_channels), get_depth(3, gd), true, false, 0.5, \"model.19\");\n    auto* conv20 =\n            SCDown(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), 3, 2, \"model.20\");\n    nvinfer1::ITensor* inputTensor21[] = {conv20->getOutput(0), conv10->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21 = network->addConcatenation(inputTensor21, 2);\n    auto* conv22 = C2fCIB(network, weightMap, *cat21->getOutput(0), get_width(1024, gw, max_channels),\n                          get_width(1024, gw, max_channels), get_depth(3, gd), true, false, 0.5, \"model.22\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 OUTPUT  ******************************************\n    *******************************************************************************************************/\n    auto d = conv16->getOutput(0)->getDimensions();\n    assert(d.nbDims == 4);\n    int ch_0 = d.d[1];\n    int base_in_channel = std::max(16, std::max(ch_0 / 4, 16 * 4));\n    int base_out_channel = std::max(ch_0, std::min(kNumClass, 100));\n\n    // output0\n    auto* conv23_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv16->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.0.0\");\n    auto* conv23_cv2_0_1 = convBnSiLU(network, weightMap, *conv23_cv2_0_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_0_2 = network->addConvolutionNd(\n            *conv23_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.0.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.0.2.bias\"]);\n    conv23_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_0_0_0 = convBnSiLU(network, weightMap, *conv16->getOutput(0), get_width(256, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.0.0.0\", get_width(256, gw, max_channels));\n    auto* conv23_cv3_0_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_0_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.0.0.1\");\n    auto* conv23_cv3_0_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_0_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.0.1.0\", base_out_channel);\n    auto* conv23_cv3_0_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_0_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.0.1.1\");\n    auto* conv23_cv3_0_2 = network->addConvolutionNd(*conv23_cv3_0_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.0.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.0.2.bias\"]);\n    conv23_cv3_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_0[] = {conv23_cv2_0_2->getOutput(0), conv23_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_0 = network->addConcatenation(inputTensor23_0, 2);\n\n    // output1\n    auto* conv23_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv19->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.1.0\");\n    auto* conv23_cv2_1_1 = convBnSiLU(network, weightMap, *conv23_cv2_1_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_1_2 = network->addConvolutionNd(\n            *conv23_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.1.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.1.2.bias\"]);\n    conv23_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv23_cv3_1_0_0 = convBnSiLU(network, weightMap, *conv19->getOutput(0), get_width(512, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.1.0.0\", get_width(512, gw, max_channels));\n    auto* conv23_cv3_1_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_1_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.1.0.1\");\n    auto* conv23_cv3_1_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_1_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.1.1.0\", base_out_channel);\n    auto* conv23_cv3_1_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_1_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.1.1.1\");\n    auto* conv23_cv3_1_2 = network->addConvolutionNd(*conv23_cv3_1_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.1.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.1.2.bias\"]);\n    conv23_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv23_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor23_1[] = {conv23_cv2_1_2->getOutput(0), conv23_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_1 = network->addConcatenation(inputTensor23_1, 2);\n\n    // output2\n    auto* conv23_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv22->getOutput(0), base_in_channel, 3, 1, \"model.23.one2one_cv2.2.0\");\n    auto* conv23_cv2_2_1 = convBnSiLU(network, weightMap, *conv23_cv2_2_0->getOutput(0), base_in_channel, 3, 1,\n                                      \"model.23.one2one_cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv23_cv2_2_2 = network->addConvolutionNd(\n            *conv23_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1}, weightMap[\"model.23.one2one_cv2.2.2.weight\"],\n            weightMap[\"model.23.one2one_cv2.2.2.bias\"]);\n    auto* conv23_cv3_2_0_0 = convBnSiLU(network, weightMap, *conv22->getOutput(0), get_width(1024, gw, max_channels), 3,\n                                        1, \"model.23.one2one_cv3.2.0.0\", get_width(1024, gw, max_channels));\n    auto* conv23_cv3_2_0_1 = convBnSiLU(network, weightMap, *conv23_cv3_2_0_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.2.0.1\");\n    auto* conv23_cv3_2_1_0 = convBnSiLU(network, weightMap, *conv23_cv3_2_0_1->getOutput(0), base_out_channel, 3, 1,\n                                        \"model.23.one2one_cv3.2.1.0\", base_out_channel);\n    auto* conv23_cv3_2_1_1 = convBnSiLU(network, weightMap, *conv23_cv3_2_1_0->getOutput(0), base_out_channel, 1, 1,\n                                        \"model.23.one2one_cv3.2.1.1\");\n    auto* conv23_cv3_2_2 = network->addConvolutionNd(*conv23_cv3_2_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                                     weightMap[\"model.23.one2one_cv3.2.2.weight\"],\n                                                     weightMap[\"model.23.one2one_cv3.2.2.bias\"]);\n    nvinfer1::ITensor* inputTensor23_2[] = {conv23_cv2_2_2->getOutput(0), conv23_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_2 = network->addConcatenation(inputTensor23_2, 2);\n\n    /*******************************************************************************************************\n    *********************************************  YOLOV10 DETECT  ******************************************\n    *******************************************************************************************************/\n\n    nvinfer1::ILayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle23_0 = network->addShuffle(*cat23_0->getOutput(0));\n    shuffle23_0->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split23_0_0 = network->addSlice(\n            *shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_0_1 =\n            network->addSlice(*shuffle23_0->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])},\n                              nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* dfl23_0 =\n            DFL(network, weightMap, *split23_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_0[] = {dfl23_0->getOutput(0), split23_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_0 = network->addConcatenation(inputTensor23_dfl_0, 2);\n    cat23_dfl_0->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle23_1 = network->addShuffle(*cat23_1->getOutput(0));\n    shuffle23_1->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split23_1_0 = network->addSlice(\n            *shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_1_1 =\n            network->addSlice(*shuffle23_1->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_1 =\n            DFL(network, weightMap, *split23_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_1[] = {dfl23_1->getOutput(0), split23_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_1 = network->addConcatenation(inputTensor23_dfl_1, 2);\n    cat23_dfl_1->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle23_2 = network->addShuffle(*cat23_2->getOutput(0));\n    shuffle23_2->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split23_2_0 = network->addSlice(\n            *shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split23_2_1 =\n            network->addSlice(*shuffle23_2->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl23_2 =\n            DFL(network, weightMap, *split23_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.23.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor23_dfl_2[] = {dfl23_2->getOutput(0), split23_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat23_dfl_2 = network->addConcatenation(inputTensor23_dfl_2, 2);\n    cat23_dfl_2->setAxis(1);\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(\n            network, std::vector<nvinfer1::ILayer*>{cat23_dfl_0, cat23_dfl_1, cat23_dfl_2}, strides, stridesLength);\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n"
  },
  {
    "path": "yolov10/src/postprocess.cpp",
    "content": "#include \"postprocess.h\"\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    float l, r, t, b;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n\n    if (r_h > r_w) {\n        l = bbox[0];\n        r = bbox[2];\n        t = bbox[1] - (kInputH - r_w * img.rows) / 2;\n        b = bbox[3] - (kInputH - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - (kInputW - r_h * img.cols) / 2;\n        r = bbox[2] - (kInputW - r_h * img.cols) / 2;\n        t = bbox[1];\n        b = bbox[3];\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\nvoid get_topk(std::vector<Detection>& res, float* output, float conf_thresh, int tokp) {\n    int det_size = sizeof(Detection) / sizeof(float);\n    for (int i = 0; i < output[0]; i++) {\n        if (output[1 + det_size * i + 4] <= conf_thresh)\n            continue;\n        Detection det{};\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        res.push_back(det);\n    }\n}\n\nvoid batch_topk(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size,\n                float conf_thresh, int topk) {\n    res_batch.resize(batch_size);\n    for (int i = 0; i < batch_size; i++) {\n        get_topk(res_batch[i], &output[i * output_size], conf_thresh, topk);\n    }\n}\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        cv::Mat img = img_batch[i];\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect(img, res[j].bbox);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                        cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n        }\n    }\n}\n"
  },
  {
    "path": "yolov10/src/preprocess.cu",
    "content": "#include \"cuda_utils.h\"\n#include \"preprocess.h\"\n\nstatic uint8_t* img_buffer_host = nullptr;\nstatic uint8_t* img_buffer_device = nullptr;\n\n__global__ void warpaffine_kernel(uint8_t* src, int src_line_size, int src_width, int src_height, float* dst,\n                                  int dst_width, int dst_height, uint8_t const_value_st, AffineMatrix d2s, int edge) {\n    int position = blockDim.x * blockIdx.x + threadIdx.x;\n    if (position >= edge)\n        return;\n\n    float m_x1 = d2s.value[0];\n    float m_y1 = d2s.value[1];\n    float m_z1 = d2s.value[2];\n    float m_x2 = d2s.value[3];\n    float m_y2 = d2s.value[4];\n    float m_z2 = d2s.value[5];\n\n    int dx = position % dst_width;\n    int dy = position / dst_width;\n    float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;\n    float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;\n    float c0, c1, c2;\n\n    if (src_x <= -1 || src_x >= src_width || src_y <= -1 || src_y >= src_height) {\n        // out of range\n        c0 = const_value_st;\n        c1 = const_value_st;\n        c2 = const_value_st;\n    } else {\n        int y_low = floorf(src_y);\n        int x_low = floorf(src_x);\n        int y_high = y_low + 1;\n        int x_high = x_low + 1;\n\n        uint8_t const_value[] = {const_value_st, const_value_st, const_value_st};\n        float ly = src_y - y_low;\n        float lx = src_x - x_low;\n        float hy = 1 - ly;\n        float hx = 1 - lx;\n        float w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;\n        uint8_t* v1 = const_value;\n        uint8_t* v2 = const_value;\n        uint8_t* v3 = const_value;\n        uint8_t* v4 = const_value;\n\n        if (y_low >= 0) {\n            if (x_low >= 0)\n                v1 = src + y_low * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v2 = src + y_low * src_line_size + x_high * 3;\n        }\n\n        if (y_high < src_height) {\n            if (x_low >= 0)\n                v3 = src + y_high * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v4 = src + y_high * src_line_size + x_high * 3;\n        }\n\n        c0 = w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0];\n        c1 = w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1];\n        c2 = w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2];\n    }\n\n    // bgr to rgb\n    float t = c2;\n    c2 = c0;\n    c0 = t;\n\n    // normalization\n    c0 = c0 / 255.0f;\n    c1 = c1 / 255.0f;\n    c2 = c2 / 255.0f;\n\n    // rgbrgbrgb to rrrgggbbb\n    int area = dst_width * dst_height;\n    float* pdst_c0 = dst + dy * dst_width + dx;\n    float* pdst_c1 = pdst_c0 + area;\n    float* pdst_c2 = pdst_c1 + area;\n    *pdst_c0 = c0;\n    *pdst_c1 = c1;\n    *pdst_c2 = c2;\n    //    *pdst_c0 = 0.1;\n    //    *pdst_c1 = 0.1;\n    //    *pdst_c2 = 0.1;\n}\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream) {\n    int img_size = src_width * src_height * 3;\n    // copy data to pinned memory\n    memcpy(img_buffer_host, src, img_size);\n    // copy data to device memory\n    CUDA_CHECK(cudaMemcpyAsync(img_buffer_device, img_buffer_host, img_size, cudaMemcpyHostToDevice, stream));\n\n    AffineMatrix s2d, d2s;\n    float scale = std::min(dst_height / (float)src_height, dst_width / (float)src_width);\n\n    s2d.value[0] = scale;\n    s2d.value[1] = 0;\n    s2d.value[2] = -scale * src_width * 0.5 + dst_width * 0.5;\n    s2d.value[3] = 0;\n    s2d.value[4] = scale;\n    s2d.value[5] = -scale * src_height * 0.5 + dst_height * 0.5;\n    cv::Mat m2x3_s2d(2, 3, CV_32F, s2d.value);\n    cv::Mat m2x3_d2s(2, 3, CV_32F, d2s.value);\n    cv::invertAffineTransform(m2x3_s2d, m2x3_d2s);\n\n    memcpy(d2s.value, m2x3_d2s.ptr<float>(0), sizeof(d2s.value));\n\n    int jobs = dst_height * dst_width;\n    int threads = 256;\n    int blocks = ceil(jobs / (float)threads);\n    warpaffine_kernel<<<blocks, threads, 0, stream>>>(img_buffer_device, src_width * 3, src_width, src_height, dst,\n                                                      dst_width, dst_height, 128, d2s, jobs);\n}\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream) {\n    int dst_size = dst_width * dst_height * 3;\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        cuda_preprocess(img_batch[i].ptr(), img_batch[i].cols, img_batch[i].rows, &dst[dst_size * i], dst_width,\n                        dst_height, stream);\n        CUDA_CHECK(cudaStreamSynchronize(stream));\n    }\n}\n\nvoid cuda_preprocess_init(int max_image_size) {\n    // prepare input data in pinned memory\n    CUDA_CHECK(cudaMallocHost((void**)&img_buffer_host, max_image_size * 3));\n    // prepare input data in device memory\n    CUDA_CHECK(cudaMalloc((void**)&img_buffer_device, max_image_size * 3));\n}\n\nvoid cuda_preprocess_destroy() {\n    CUDA_CHECK(cudaFree(img_buffer_device));\n    CUDA_CHECK(cudaFreeHost(img_buffer_host));\n}\n"
  },
  {
    "path": "yolov10/yolov10_det.cpp",
    "content": "#include <fstream>\r\n#include <iostream>\r\n#include <opencv2/opencv.hpp>\r\n#include \"cuda_utils.h\"\r\n#include \"logging.h\"\r\n#include \"model.h\"\r\n#include \"postprocess.h\"\r\n#include \"preprocess.h\"\r\n#include \"utils.h\"\r\n\r\nLogger gLogger;\r\nusing namespace nvinfer1;\r\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\r\n\r\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, std::string& type, float& gd, float& gw,\r\n                      int& max_channels) {\r\n    IBuilder* builder = createInferBuilder(gLogger);\r\n    IBuilderConfig* config = builder->createBuilderConfig();\r\n    IHostMemory* serialized_engine = nullptr;\r\n\r\n    if (type == \"n\") {\r\n        serialized_engine = buildEngineYolov10DetN(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\r\n    } else if (type == \"s\") {\r\n        serialized_engine = buildEngineYolov10DetS(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\r\n    } else if (type == \"m\") {\r\n        serialized_engine = buildEngineYolov10DetM(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\r\n    } else if (type == \"b\" || type == \"l\") {\r\n        serialized_engine = buildEngineYolov10DetBL(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\r\n    } else if (type == \"x\") {\r\n        serialized_engine = buildEngineYolov10DetX(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\r\n    } else {\r\n        std::cerr << \"Unsupported type!\" << std::endl;\r\n        exit(0);\r\n    }\r\n\r\n    assert(serialized_engine);\r\n    std::ofstream p(engine_name, std::ios::binary);\r\n    if (!p) {\r\n        std::cout << \"could not open plan output file\" << std::endl;\r\n        assert(false);\r\n    }\r\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\r\n\r\n    delete serialized_engine;\r\n    delete config;\r\n    delete builder;\r\n}\r\n\r\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\r\n                        IExecutionContext** context) {\r\n    std::ifstream file(engine_name, std::ios::binary);\r\n    if (!file.good()) {\r\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\r\n        assert(false);\r\n    }\r\n    size_t size = 0;\r\n    file.seekg(0, file.end);\r\n    size = file.tellg();\r\n    file.seekg(0, file.beg);\r\n    char* serialized_engine = new char[size];\r\n    assert(serialized_engine);\r\n    file.read(serialized_engine, size);\r\n    file.close();\r\n\r\n    *runtime = createInferRuntime(gLogger);\r\n    assert(*runtime);\r\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\r\n    assert(*engine);\r\n    *context = (*engine)->createExecutionContext();\r\n    assert(*context);\r\n    delete[] serialized_engine;\r\n}\r\n\r\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\r\n                    float** output_buffer_host) {\r\n    assert(engine->getNbBindings() == 2);\r\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\r\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\r\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\r\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\r\n    assert(inputIndex == 0);\r\n    assert(outputIndex == 1);\r\n    // Create GPU buffers on device\r\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\r\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\r\n\r\n    *output_buffer_host = new float[kBatchSize * kOutputSize];\r\n}\r\n\r\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchsize) {\r\n    // infer on the batch asynchronously, and DMA output back to host\r\n    auto start = std::chrono::system_clock::now();\r\n    context.enqueueV2(buffers, stream, nullptr);\r\n\r\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\r\n                               stream));\r\n    auto end = std::chrono::system_clock::now();\r\n    std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\r\n              << \"ms\" << std::endl;\r\n\r\n    CUDA_CHECK(cudaStreamSynchronize(stream));\r\n}\r\n\r\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir, std::string& type,\r\n                float& gd, float& gw, int& max_channels) {\r\n    if (argc < 4)\r\n        return false;\r\n    if (std::string(argv[1]) == \"-s\" && (argc == 5 || argc == 7)) {\r\n        wts = std::string(argv[2]);\r\n        engine = std::string(argv[3]);\r\n        auto sub_type = std::string(argv[4]);\r\n\r\n        if (sub_type[0] == 'n') {\r\n            gd = 0.33;\r\n            gw = 0.25;\r\n            max_channels = 1024;\r\n            type = \"n\";\r\n        } else if (sub_type[0] == 's') {\r\n            gd = 0.33;\r\n            gw = 0.50;\r\n            max_channels = 1024;\r\n            type = \"s\";\r\n        } else if (sub_type[0] == 'm') {\r\n            gd = 0.67;\r\n            gw = 0.75;\r\n            max_channels = 768;\r\n            type = \"m\";\r\n        } else if (sub_type[0] == 'b') {\r\n            gd = 0.67;\r\n            gw = 1.0;\r\n            max_channels = 512;\r\n            type = \"b\";\r\n        } else if (sub_type[0] == 'l') {\r\n            gd = 1.0;\r\n            gw = 1.0;\r\n            max_channels = 512;\r\n            type = \"l\";\r\n        } else if (sub_type[0] == 'x') {\r\n            gd = 1.0;\r\n            gw = 1.25;\r\n            max_channels = 512;\r\n            type = \"x\";\r\n        } else {\r\n            return false;\r\n        }\r\n    } else if (std::string(argv[1]) == \"-d\" && argc == 4) {\r\n        engine = std::string(argv[2]);\r\n        img_dir = std::string(argv[3]);\r\n    } else {\r\n        return false;\r\n    }\r\n    return true;\r\n}\r\n\r\nint main(int argc, char** argv) {\r\n    // -s ../models/yolov10n.wts ../models/yolov10n.fp32.trt n\r\n    // -d ../models/yolov10n.fp32.trt ../images\r\n    cudaSetDevice(kGpuId);\r\n    std::string wts_name = \"\";\r\n    std::string engine_name = \"\";\r\n    std::string img_dir;\r\n    std::string type = \"\";\r\n    float gd = 0.0f, gw = 0.0f;\r\n    int max_channels = 0;\r\n\r\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir, type, gd, gw, max_channels)) {\r\n        std::cerr << \"Arguments not right!\" << std::endl;\r\n        std::cerr << \"./yolov10_det -s [.wts] [.engine] [n/s/m/b/l/x]  // serialize model to \"\r\n                     \"plan file\"\r\n                  << std::endl;\r\n        std::cerr << \"./yolov10_det -d [.engine] ../samples  // deserialize plan file and run inference\" << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    // Create a model using the API directly and serialize it to a file\r\n    if (!wts_name.empty()) {\r\n        serialize_engine(wts_name, engine_name, type, gd, gw, max_channels);\r\n        return 0;\r\n    }\r\n\r\n    // Deserialize the engine from file\r\n    IRuntime* runtime = nullptr;\r\n    ICudaEngine* engine = nullptr;\r\n    IExecutionContext* context = nullptr;\r\n    deserialize_engine(engine_name, &runtime, &engine, &context);\r\n    cudaStream_t stream;\r\n    CUDA_CHECK(cudaStreamCreate(&stream));\r\n    cuda_preprocess_init(kMaxInputImageSize);\r\n    // Prepare cpu and gpu buffers\r\n    float* device_buffers[2];\r\n    float* output_buffer_host = nullptr;\r\n\r\n    // Read images from directory\r\n    std::vector<std::string> file_names;\r\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\r\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host);\r\n\r\n    // batch predict\r\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\r\n        // Get a batch of images\r\n        std::vector<cv::Mat> img_batch;\r\n        std::vector<std::string> img_name_batch;\r\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\r\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\r\n            if (img.empty()) {\r\n                std::cerr << \"Fatal error: image cannot open!\" << std::endl;\r\n                return -1;\r\n            }\r\n            img_batch.push_back(img);\r\n            img_name_batch.push_back(file_names[j]);\r\n        }\r\n        // Preprocess\r\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\r\n        // Run inference\r\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize);\r\n        // output_buffer_host保存前100个值到文件\r\n        //        std::ofstream out_file(\"../output.txt\");\r\n        //        for (int i = 0; i < 100; i++) {\r\n        //            out_file << output_buffer_host[i] << std::endl;\r\n        //        }\r\n        //        out_file.close();\r\n\r\n        std::vector<std::vector<Detection>> res_batch;\r\n        batch_topk(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh);\r\n\r\n        // print results\r\n        for (size_t j = 0; j < res_batch.size(); j++) {\r\n            for (size_t k = 0; k < res_batch[j].size(); k++) {\r\n                std::cout << \"image: \" << img_name_batch[j] << \", bbox: \" << res_batch[j][k].bbox[0] << \", \"\r\n                          << res_batch[j][k].bbox[1] << \", \" << res_batch[j][k].bbox[2] << \", \"\r\n                          << res_batch[j][k].bbox[3] << \", conf: \" << res_batch[j][k].conf\r\n                          << \", class_id: \" << res_batch[j][k].class_id << std::endl;\r\n            }\r\n        }\r\n\r\n        // Draw bounding boxes\r\n        draw_bbox(img_batch, res_batch);\r\n        // Save images\r\n        for (size_t j = 0; j < img_batch.size(); j++) {\r\n            cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\r\n        }\r\n    }\r\n\r\n    // Release stream and buffers\r\n    cudaStreamDestroy(stream);\r\n    CUDA_CHECK(cudaFree(device_buffers[0]));\r\n    CUDA_CHECK(cudaFree(device_buffers[1]));\r\n    cuda_preprocess_destroy();\r\n    // Destroy the engine\r\n    delete context;\r\n    delete engine;\r\n    delete runtime;\r\n\r\n    return 0;\r\n}\r\n"
  },
  {
    "path": "yolov10/yolov10_det_trt.py",
    "content": "# -*- coding: UTF-8 -*-\r\n\"\"\"\r\n  @Author: mpj\r\n  @Date  : 2024/7/24 下午7:11\r\n  @version V1.0\r\n\"\"\"\r\nimport ctypes\r\nimport os\r\nimport shutil\r\nimport random\r\nimport sys\r\nimport threading\r\nimport time\r\nimport cv2\r\nimport numpy as np\r\nimport pycuda.autoinit  # noqa: F401\r\nimport pycuda.driver as cuda\r\nimport tensorrt as trt\r\n\r\nCONF_THRESH = 0.5\r\nIOU_THRESHOLD = 0.4\r\nDET_NUM = 6\r\n\r\n\r\ndef get_img_path_batches(batch_size, img_dir):\r\n    ret = []\r\n    batch = []\r\n    for root, dirs, files in os.walk(img_dir):\r\n        for name in files:\r\n            if len(batch) == batch_size:\r\n                ret.append(batch)\r\n                batch = []\r\n            batch.append(os.path.join(root, name))\r\n    if len(batch) > 0:\r\n        ret.append(batch)\r\n    return ret\r\n\r\n\r\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\r\n    \"\"\"\r\n    description: Plots one bounding box on image img,\r\n                 this function comes from Yolov10 project.\r\n    param:\r\n        x:      a box likes [x1,y1,x2,y2]\r\n        img:    a opencv image object\r\n        color:  color to draw rectangle, such as (0,255,0)\r\n        label:  str\r\n        line_thickness: int\r\n    return:\r\n        no return\r\n\r\n    \"\"\"\r\n    tl = (\r\n            line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\r\n    )  # line/font thickness\r\n    color = color or [random.randint(0, 255) for _ in range(3)]\r\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\r\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\r\n    if label:\r\n        tf = max(tl - 1, 1)  # font thickness\r\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\r\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\r\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\r\n        cv2.putText(\r\n            img,\r\n            label,\r\n            (c1[0], c1[1] - 2),\r\n            0,\r\n            tl / 3,\r\n            [225, 255, 255],\r\n            thickness=tf,\r\n            lineType=cv2.LINE_AA,\r\n            )\r\n\r\n\r\nclass Yolov10TRT(object):\r\n    \"\"\"\r\n    description: A Yolov10 class that warps TensorRT ops, preprocess and postprocess ops.\r\n    \"\"\"\r\n\r\n    def __init__(self, engine_file_path):\r\n        # Create a Context on this device,\r\n        self.ctx = cuda.Device(0).make_context()\r\n        stream = cuda.Stream()\r\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\r\n        runtime = trt.Runtime(TRT_LOGGER)\r\n\r\n        # Deserialize the engine from file\r\n        with open(engine_file_path, \"rb\") as f:\r\n            engine = runtime.deserialize_cuda_engine(f.read())\r\n        context = engine.create_execution_context()\r\n\r\n        host_inputs = []\r\n        cuda_inputs = []\r\n        host_outputs = []\r\n        cuda_outputs = []\r\n        bindings = []\r\n\r\n        for binding in engine:\r\n            print('bingding:', binding, engine.get_binding_shape(binding))\r\n            self.batch_size = engine.get_binding_shape(binding)[0]\r\n            if self.batch_size != 1:\r\n                raise ValueError(\"Only support batch_size=1\")\r\n            size = trt.volume(engine.get_binding_shape(binding))\r\n            dtype = engine.get_binding_dtype(binding)\r\n            dtype = trt.nptype(dtype)\r\n            # Allocate host and device buffers\r\n            host_mem = cuda.pagelocked_empty(size, dtype)\r\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\r\n            # Append the device buffer to device bindings.\r\n            bindings.append(int(cuda_mem))\r\n            # Append to the appropriate list.\r\n            if engine.binding_is_input(binding):\r\n                self.input_w = engine.get_binding_shape(binding)[-1]\r\n                self.input_h = engine.get_binding_shape(binding)[-2]\r\n                host_inputs.append(host_mem)\r\n                cuda_inputs.append(cuda_mem)\r\n            else:\r\n                host_outputs.append(host_mem)\r\n                cuda_outputs.append(cuda_mem)\r\n\r\n        # Store\r\n        self.stream = stream\r\n        self.context = context\r\n        self.engine = engine\r\n        self.host_inputs = host_inputs\r\n        self.cuda_inputs = cuda_inputs\r\n        self.host_outputs = host_outputs\r\n        self.cuda_outputs = cuda_outputs\r\n        self.bindings = bindings\r\n        print('batch_size:', self.batch_size)\r\n        self.det_output_length = host_outputs[0].shape[0]\r\n\r\n    def infer(self, raw_image_generator):\r\n        threading.Thread.__init__(self)\r\n        # Make self the active context, pushing it on top of the context stack.\r\n        self.ctx.push()\r\n        # Restore\r\n        stream = self.stream\r\n        context = self.context\r\n        host_inputs = self.host_inputs\r\n        cuda_inputs = self.cuda_inputs\r\n        host_outputs = self.host_outputs\r\n        cuda_outputs = self.cuda_outputs\r\n        bindings = self.bindings\r\n        # Do image preprocess\r\n        batch_image_raw = []\r\n        batch_origin_h = []\r\n        batch_origin_w = []\r\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\r\n        for i, image_raw in enumerate(raw_image_generator):\r\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\r\n            batch_image_raw.append(image_raw)\r\n            batch_origin_h.append(origin_h)\r\n            batch_origin_w.append(origin_w)\r\n            np.copyto(batch_input_image[i], input_image)\r\n        batch_input_image = np.ascontiguousarray(batch_input_image)\r\n\r\n        # Copy input image to host buffer\r\n        np.copyto(host_inputs[0], batch_input_image.ravel())\r\n        start = time.time()\r\n        # Transfer input data  to the GPU.\r\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\r\n        # Run inference.\r\n        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\r\n        # Transfer predictions back from the GPU.\r\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\r\n        # Synchronize the stream\r\n        stream.synchronize()\r\n        end = time.time()\r\n        # Remove any context from the top of the context stack, deactivating it.\r\n        self.ctx.pop()\r\n        # Here we use the first row of output in that batch_size = 1\r\n        output = host_outputs[0]\r\n        # Do postprocess\r\n        for i in range(self.batch_size):\r\n            result_boxes, result_scores, result_classid = self.post_process(\r\n                output[i * self.det_output_length: (i + 1) * self.det_output_length], batch_origin_h[i],\r\n                batch_origin_w[i]\r\n            )\r\n            # Draw rectangles and labels on the original image\r\n            for j in range(len(result_boxes)):\r\n                box = result_boxes[j]\r\n                plot_one_box(\r\n                    box,\r\n                    batch_image_raw[i],\r\n                    label=\"{}:{:.2f}\".format(\r\n                        categories[int(result_classid[j])], result_scores[j]\r\n                    ),\r\n                )\r\n        return batch_image_raw, end - start\r\n\r\n    def destroy(self):\r\n        # Remove any context from the top of the context stack, deactivating it.\r\n        self.ctx.pop()\r\n\r\n    def get_raw_image(self, image_path_batch):\r\n        \"\"\"\r\n        description: Read an image from image path\r\n        \"\"\"\r\n        for img_path in image_path_batch:\r\n            yield cv2.imread(img_path)\r\n\r\n    def get_raw_image_zeros(self, image_path_batch=None):\r\n        \"\"\"\r\n        description: Ready data for warmup\r\n        \"\"\"\r\n        for _ in range(self.batch_size):\r\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\r\n\r\n    def preprocess_image(self, raw_bgr_image):\r\n        \"\"\"\r\n        description: Convert BGR image to RGB,\r\n                     resize and pad it to target size, normalize to [0,1],\r\n                     transform to NCHW format.\r\n        param:\r\n            input_image_path: str, image path\r\n        return:\r\n            image:  the processed image\r\n            image_raw: the original image\r\n            h: original height\r\n            w: original width\r\n        \"\"\"\r\n        image_raw = raw_bgr_image\r\n        h, w, c = image_raw.shape\r\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\r\n        # Calculate widht and height and paddings\r\n        r_w = self.input_w / w\r\n        r_h = self.input_h / h\r\n        if r_h > r_w:\r\n            tw = self.input_w\r\n            th = int(r_w * h)\r\n            tx1 = tx2 = 0\r\n            ty1 = int((self.input_h - th) / 2)\r\n            ty2 = self.input_h - th - ty1\r\n        else:\r\n            tw = int(r_h * w)\r\n            th = self.input_h\r\n            tx1 = int((self.input_w - tw) / 2)\r\n            tx2 = self.input_w - tw - tx1\r\n            ty1 = ty2 = 0\r\n        # Resize the image with long side while maintaining ratio\r\n        image = cv2.resize(image, (tw, th))\r\n        # Pad the short side with (128,128,128)\r\n        image = cv2.copyMakeBorder(\r\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\r\n        )\r\n        image = image.astype(np.float32)\r\n        # Normalize to [0,1]\r\n        image /= 255.0\r\n        # HWC to CHW format:\r\n        image = np.transpose(image, [2, 0, 1])\r\n        # CHW to NCHW format\r\n        image = np.expand_dims(image, axis=0)\r\n        # Convert the image to row-major order, also known as \"C order\":\r\n        image = np.ascontiguousarray(image)\r\n        return image, image_raw, h, w\r\n\r\n    def xywh2xyxy(self, origin_h, origin_w, x):\r\n        \"\"\"\r\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\r\n        param:\r\n            origin_h:   height of original image\r\n            origin_w:   width of original image\r\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\r\n        return:\r\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\r\n        \"\"\"\r\n        y = np.zeros_like(x)\r\n        r_w = self.input_w / origin_w\r\n        r_h = self.input_h / origin_h\r\n        if r_h > r_w:\r\n            y[:, 0] = x[:, 0]\r\n            y[:, 2] = x[:, 2]\r\n            y[:, 1] = x[:, 1] - (self.input_h - r_w * origin_h) / 2\r\n            y[:, 3] = x[:, 3] - (self.input_h - r_w * origin_h) / 2\r\n            y /= r_w\r\n        else:\r\n            y[:, 0] = x[:, 0] - (self.input_w - r_h * origin_w) / 2\r\n            y[:, 2] = x[:, 2] - (self.input_w - r_h * origin_w) / 2\r\n            y[:, 1] = x[:, 1]\r\n            y[:, 3] = x[:, 3]\r\n            y /= r_h\r\n\r\n        return y\r\n\r\n    def post_process(self, output, origin_h, origin_w):\r\n        \"\"\"\r\n        description: postprocess the prediction\r\n        param:\r\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...]\r\n            origin_h:   height of original image\r\n            origin_w:   width of original image\r\n        return:\r\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\r\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\r\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\r\n        \"\"\"\r\n        num_values_per_detection = DET_NUM\r\n        # Get the num of boxes detected\r\n        num = int(output[0])\r\n        # Reshape to a two dimentional ndarray\r\n        # pred = np.reshape(output[1:], (-1, 38))[:num, :]\r\n        pred = np.reshape(output[1:], (-1, num_values_per_detection))[:num, :]\r\n        # Do nms\r\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\r\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\r\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\r\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\r\n        return result_boxes, result_scores, result_classid\r\n\r\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\r\n        \"\"\"\r\n        description: compute the IoU of two bounding boxes\r\n        param:\r\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\r\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\r\n            x1y1x2y2: select the coordinate format\r\n        return:\r\n            iou: computed iou\r\n        \"\"\"\r\n        if not x1y1x2y2:\r\n            # Transform from center and width to exact coordinates\r\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\r\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\r\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\r\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\r\n        else:\r\n            # Get the coordinates of bounding boxes\r\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\r\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\r\n\r\n        # Get the coordinates of the intersection rectangle\r\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\r\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\r\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\r\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\r\n        # Intersection area\r\n        inter_area = (np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None)\r\n                      * np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None))\r\n        # Union Area\r\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\r\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\r\n\r\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\r\n\r\n        return iou\r\n\r\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\r\n        \"\"\"\r\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\r\n        Non-Maximum Suppression to further filter detections.\r\n        param:\r\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\r\n            origin_h: original image height\r\n            origin_w: original image width\r\n            conf_thres: a confidence threshold to filter detections\r\n            nms_thres: a iou threshold to filter detections\r\n        return:\r\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\r\n        \"\"\"\r\n        # Get the boxes that score > CONF_THRESH\r\n        boxes = prediction[prediction[:, 4] >= conf_thres]\r\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\r\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\r\n        # clip the coordinates\r\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1)\r\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1)\r\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1)\r\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1)\r\n        # Object confidence\r\n        confs = boxes[:, 4]\r\n        # Sort by the confs\r\n        boxes = boxes[np.argsort(-confs)]\r\n        # Perform non-maximum suppression\r\n        keep_boxes = []\r\n        while boxes.shape[0]:\r\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\r\n            label_match = boxes[0, -1] == boxes[:, -1]\r\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\r\n            invalid = large_overlap & label_match\r\n            keep_boxes += [boxes[0]]\r\n            boxes = boxes[~invalid]\r\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\r\n        return boxes\r\n\r\n\r\nclass inferThread(threading.Thread):\r\n    def __init__(self, yolov8_wrapper, image_path_batch):\r\n        threading.Thread.__init__(self)\r\n        self.yolov8_wrapper = yolov8_wrapper\r\n        self.image_path_batch = image_path_batch\r\n\r\n    def run(self):\r\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(self.yolov8_wrapper.get_raw_image(self.image_path_batch))\r\n        for i, img_path in enumerate(self.image_path_batch):\r\n            parent, filename = os.path.split(img_path)\r\n            save_name = os.path.join('output', filename)\r\n            # Save image\r\n            cv2.imwrite(save_name, batch_image_raw[i])\r\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\r\n\r\n\r\nclass warmUpThread(threading.Thread):\r\n    def __init__(self, yolov8_wrapper):\r\n        threading.Thread.__init__(self)\r\n        self.yolov8_wrapper = yolov8_wrapper\r\n\r\n    def run(self):\r\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(self.yolov8_wrapper.get_raw_image_zeros())\r\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    # load custom plugin and engine\r\n    PLUGIN_LIBRARY = \"build/libmyplugins.so\"\r\n    engine_file_path = \"yolov8s.engine\"\r\n\r\n    if len(sys.argv) > 1:\r\n        engine_file_path = sys.argv[1]\r\n    if len(sys.argv) > 2:\r\n        PLUGIN_LIBRARY = sys.argv[2]\r\n\r\n    ctypes.CDLL(PLUGIN_LIBRARY)\r\n\r\n    # load coco labels\r\n\r\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\",\r\n                  \"traffic light\",\r\n                  \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\r\n                  \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\r\n                  \"frisbee\",\r\n                  \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\",\r\n                  \"surfboard\",\r\n                  \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\r\n                  \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\r\n                  \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\r\n                  \"cell phone\",\r\n                  \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\r\n                  \"teddy bear\",\r\n                  \"hair drier\", \"toothbrush\"]\r\n\r\n    if os.path.exists('output/'):\r\n        shutil.rmtree('output/')\r\n    os.makedirs('output/')\r\n    # a Yolov10TRT instance\r\n    yolov8_wrapper = Yolov10TRT(engine_file_path)\r\n    try:\r\n        print('batch size is', yolov8_wrapper.batch_size)\r\n\r\n        image_dir = \"images/\"\r\n        image_path_batches = get_img_path_batches(yolov8_wrapper.batch_size, image_dir)\r\n\r\n        for i in range(10):\r\n            # create a new thread to do warm_up\r\n            thread1 = warmUpThread(yolov8_wrapper)\r\n            thread1.start()\r\n            thread1.join()\r\n        for batch in image_path_batches:\r\n            # create a new thread to do inference\r\n            thread1 = inferThread(yolov8_wrapper, batch)\r\n            thread1.start()\r\n            thread1.join()\r\n    finally:\r\n        # destroy the instance\r\n        yolov8_wrapper.destroy()\r\n"
  },
  {
    "path": "yolov12/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\n\nproject(yolov12)\n\nadd_definitions(-std=c++11)\nadd_definitions(-DAPI_EXPORTS)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nset(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)\nenable_language(CUDA)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\ninclude_directories(${PROJECT_SOURCE_DIR}/plugin)\n\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\nif(CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n  message(\"embed_platform on\")\n  include_directories(/usr/local/cuda/targets/aarch64-linux/include)\n  link_directories(/usr/local/cuda/targets/aarch64-linux/lib)\nelse()\n  message(\"embed_platform off\")\n\n  # cuda\n  include_directories(/usr/local/cuda/include)\n  link_directories(/usr/local/cuda/lib64)\n\n  # tensorrt\n  include_directories(/workspace/shared/TensorRT-8.6.1.6/include)\n  link_directories(/workspace/shared/TensorRT-8.6.1.6/lib)\nendif()\n\nadd_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/plugin/yololayer.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nfile(GLOB_RECURSE SRCS ${PROJECT_SOURCE_DIR}/src/*.cpp ${PROJECT_SOURCE_DIR}/src/*.cu)\n\nadd_executable(yolo12_det ${PROJECT_SOURCE_DIR}/yolo12_det.cpp ${SRCS})\ntarget_link_libraries(yolo12_det nvinfer)\ntarget_link_libraries(yolo12_det cudart)\ntarget_link_libraries(yolo12_det myplugins)\ntarget_link_libraries(yolo12_det ${OpenCV_LIBS})\n"
  },
  {
    "path": "yolov12/gen_wts.py",
    "content": "import sys  # noqa: F401\nimport argparse\nimport os\nimport struct\nimport torch\n\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description='Convert .pt file to .wts')\n    parser.add_argument('-w', '--weights', required=True,\n                        help='Input weights (.pt) file path (required)')\n    parser.add_argument(\n        '-o', '--output', help='Output (.wts) file path (optional)')\n    parser.add_argument(\n        '-t', '--type', type=str, default='detect', choices=['detect', 'cls', 'seg', 'pose', 'obb'],\n        help='determines the model is detection/classification')\n    args = parser.parse_args()\n    if not os.path.isfile(args.weights):\n        raise SystemExit('Invalid input file')\n    if not args.output:\n        args.output = os.path.splitext(args.weights)[0] + '.wts'\n    elif os.path.isdir(args.output):\n        args.output = os.path.join(\n            args.output,\n            os.path.splitext(os.path.basename(args.weights))[0] + '.wts')\n    return args.weights, args.output, args.type\n\n\npt_file, wts_file, m_type = parse_args()\n\nprint(f'Generating .wts for {m_type} model')\n\n# Load model\nprint(f'Loading {pt_file}')\n\n# Initialize\ndevice = 'cpu'\n\n# Load model\nmodel = torch.load(pt_file, map_location=device, weights_only=False)['model'].float()  # load to FP32\n\nif m_type in ['detect', 'seg', 'pose', 'obb']:\n    anchor_grid = model.model[-1].anchors * model.model[-1].stride[..., None, None]\n\n    delattr(model.model[-1], 'anchors')\n\nmodel.to(device).eval()\n\nwith open(wts_file, 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\n"
  },
  {
    "path": "yolov12/include/block.h",
    "content": "#pragma once\n\n#include <map>\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file);\n\nnvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                      std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                      std::string lname, float eps);\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, std::vector<int> k, int s, std::string lname);\n\nnvinfer1::IElementWiseLayer* C2F(nvinfer1::INetworkDefinition* network,\n                                 std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                 int c2, int n, bool shortcut, float e, std::string lname);\n\nnvinfer1::IElementWiseLayer* C2(nvinfer1::INetworkDefinition* network,\n                                std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c1,\n                                int c2, int n, bool shortcut, float e, std::string lname);\n\nnvinfer1::IElementWiseLayer* SPPF(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int k, std::string lname);\n\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int grid, int k, int s, int p, std::string lname);\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network,\n                                       std::vector<nvinfer1::IConcatenationLayer*> dets, const int* px_arry,\n                                       int px_arry_num, bool is_segmentation, bool is_pose, bool is_obb);\n\nnvinfer1::IElementWiseLayer* C3K2(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int n, bool c3k, bool shortcut, float e, std::string lname);\n\nnvinfer1::ILayer* C2PSA(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>& weightMap,\n                        nvinfer1::ITensor& input, int c1, int c2, int n, float e, std::string lname);\n\nnvinfer1::ILayer* DWConv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int ch, std::vector<int> k, int s, std::string lname);\n\nnvinfer1::ILayer* A2C2f(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                        nvinfer1::ITensor& input, int c1, int c2, int n, bool a2, int area, bool residual,\n                        float mlp_ratio, float e, int g, bool shortcut, std::string lname);\n\nnvinfer1::ILayer* ABlock(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int dim, int num_heads, float mlp_ratio, int area,\n                         std::string lname);\n\nnvinfer1::ILayer* AAttn(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                        nvinfer1::ITensor& input, int dim, int num_heads, float mlp_ratio, int area, std::string lname);\n"
  },
  {
    "path": "yolov12/include/config.h",
    "content": "#define USE_FP16\n// #define USE_FP32\n// #define USE_INT8\n\nconst static char* kInputTensorName = \"images\";\nconst static char* kOutputTensorName = \"output\";\nconst static char* kProtoTensorName = \"proto\";\nconst static int kNumClass = 80;\nconst static int kPoseNumClass = 1;\nconst static int kNumberOfPoints = 17;  // number of keypoints total\n// obb model's number of classes\nconstexpr static int kObbNumClass = 15;\nconst static int kObbNe = 1;  // number of extra parameters\nconst static int kBatchSize = 1;\nconst static int kGpuId = 0;\nconst static int kInputH = 640;\nconst static int kInputW = 640;\nconst static int kObbInputH = 1024;\nconst static int kObbInputW = 1024;\nconst static float kNmsThresh = 0.45f;\nconst static float kConfThresh = 0.5f;\nconst static float kConfThreshKeypoints = 0.5f;  // keypoints confidence\nconst static int kMaxInputImageSize = 3000 * 3000;\nconst static int kMaxNumOutputBbox = 1000;\n//Quantization input image folder path\nconst static char* kInputQuantizationFolder = \"./coco_calib\";\n\n// Classfication model's number of classes\nconstexpr static int kClsNumClass = 1000;\n// Classfication model's input shape\nconstexpr static int kClsInputH = 224;\nconstexpr static int kClsInputW = 224;\n"
  },
  {
    "path": "yolov12/include/cuda_utils.h",
    "content": "#ifndef TRTX_CUDA_UTILS_H_\n#define TRTX_CUDA_UTILS_H_\n\n#include <cuda_runtime_api.h>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n#endif  // CUDA_CHECK\n\n#endif  // TRTX_CUDA_UTILS_H_\n"
  },
  {
    "path": "yolov12/include/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"NvInferRuntimeCommon.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "yolov12/include/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#include \"NvInfer.h\"\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "yolov12/include/model.h",
    "content": "#pragma once\n\n#include <assert.h>\n#include <string>\n#include \"NvInfer.h\"\n\nnvinfer1::IHostMemory* buildEngineYolo12Det(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type);\n"
  },
  {
    "path": "yolov12/include/postprocess.h",
    "content": "#pragma once\n\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\n// Preprocessing functions\ncv::Rect get_rect(cv::Mat& img, float bbox[4]);\n\n// Processing functions\nvoid batch_process(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                   int bbox_element, const std::vector<cv::Mat>& img_batch);\n\nvoid batch_process_obb(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                       int bbox_element, const std::vector<cv::Mat>& img_batch);\n\nvoid process_decode_ptr_host(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element, cv::Mat& img,\n                             int count);\n\nvoid process_decode_ptr_host_obb(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element,\n                                 cv::Mat& img, int count);\n\n// NMS functions\nvoid nms(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh = 0.5);\n\nvoid batch_nms(std::vector<std::vector<Detection>>& batch_res, float* output, int batch_size, int output_size,\n               float conf_thresh, float nms_thresh = 0.5);\n\nvoid nms_obb(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh = 0.5);\n\nvoid batch_nms_obb(std::vector<std::vector<Detection>>& batch_res, float* output, int batch_size, int output_size,\n                   float conf_thresh, float nms_thresh = 0.5);\n\n// CUDA-related functions\nvoid cuda_decode(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                 cudaStream_t stream);\n\nvoid cuda_nms(float* parray, float nms_threshold, int max_objects, cudaStream_t stream);\n\nvoid cuda_decode_obb(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                     cudaStream_t stream);\n\nvoid cuda_nms_obb(float* parray, float nms_threshold, int max_objects, cudaStream_t stream);\n\n// Drawing functions\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nvoid draw_bbox_obb(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nvoid draw_bbox_keypoints_line(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks,\n                    std::unordered_map<int, std::string>& labels_map);\n"
  },
  {
    "path": "yolov12/include/preprocess.h",
    "content": "#pragma once\n\n#include <map>\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\nvoid cuda_preprocess_init(int max_image_size);\n\nvoid cuda_preprocess_destroy();\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream);\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream);\n"
  },
  {
    "path": "yolov12/include/types.h",
    "content": "#pragma once\n#include \"config.h\"\n\nstruct alignas(float) Detection {\n    //center_x center_y w h\n    float bbox[4];\n    float conf;  // bbox_conf * cls_conf\n    float class_id;\n    float mask[32];\n    float keypoints[kNumberOfPoints * 3];  // 17*3 keypoints\n    float angle;                           // obb angle\n};\n\nstruct AffineMatrix {\n    float value[6];\n};\n\nconst int bbox_element =\n        sizeof(AffineMatrix) / sizeof(float) + 1;  // left, top, right, bottom, confidence, class, keepflag\n"
  },
  {
    "path": "yolov12/include/utils.h",
    "content": "#pragma once\n#include <dirent.h>\n#include <fstream>\n#include <opencv2/opencv.hpp>\n\nstatic inline cv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h) {\n    int w, h, x, y;\n    float r_w = input_w / (img.cols * 1.0);\n    float r_h = input_h / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = input_w;\n        h = r_w * img.rows;\n        x = 0;\n        y = (input_h - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = input_h;\n        x = (input_w - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n    cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\nstatic inline int read_files_in_dir(const char* p_dir_name, std::vector<std::string>& file_names) {\n    DIR* p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 && strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            //            std::cout << \"Found file: \" << cur_file_name << std::endl;\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\n// Function to trim leading and trailing whitespace from a string\nstatic inline std::string trim_leading_whitespace(const std::string& str) {\n    size_t first = str.find_first_not_of(' ');\n    if (std::string::npos == first) {\n        return str;\n    }\n    size_t last = str.find_last_not_of(' ');\n    return str.substr(first, (last - first + 1));\n}\n\n// Src: https://stackoverflow.com/questions/16605967\nstatic inline std::string to_string_with_precision(const float a_value, const int n = 2) {\n    std::ostringstream out;\n    out.precision(n);\n    out << std::fixed << a_value;\n    return out.str();\n}\n\nstatic inline int read_labels(const std::string labels_filename, std::unordered_map<int, std::string>& labels_map) {\n    std::ifstream file(labels_filename);\n    // Read each line of the file\n    std::string line;\n    int index = 0;\n    while (std::getline(file, line)) {\n        // Strip the line of any leading or trailing whitespace\n        line = trim_leading_whitespace(line);\n\n        // Add the stripped line to the labels_map, using the loop index as the key\n        labels_map[index] = line;\n        index++;\n    }\n    // Close the file\n    file.close();\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov12/plugin/yololayer.cu",
    "content": "#include <assert.h>\n#include <math.h>\n#include <iostream>\n#include <vector>\n#include \"cuda_utils.h\"\n#include \"types.h\"\n#include \"yololayer.h\"\n\nnamespace Tn {\ntemplate <typename T>\nvoid write(char*& buffer, const T& val) {\n    *reinterpret_cast<T*>(buffer) = val;\n    buffer += sizeof(T);\n}\n\ntemplate <typename T>\nvoid read(const char*& buffer, T& val) {\n    val = *reinterpret_cast<const T*>(buffer);\n    buffer += sizeof(T);\n}\n}  // namespace Tn\n\n__device__ float sigmoid(float x) {\n    return 1.0f / (1.0f + exp(-x));\n}\n\nnamespace nvinfer1 {\nYoloLayerPlugin::YoloLayerPlugin(int classCount, int numberofpoints, float confthreshkeypoints, int netWidth,\n                                 int netHeight, int maxOut, bool is_segmentation, bool is_pose, bool is_obb,\n                                 const int* strides, int stridesLength) {\n\n    mClassCount = classCount;\n    mNumberofpoints = numberofpoints;\n    mConfthreshkeypoints = confthreshkeypoints;\n    mYoloV8NetWidth = netWidth;\n    mYoloV8netHeight = netHeight;\n    mMaxOutObject = maxOut;\n    mStridesLength = stridesLength;\n    mStrides = new int[stridesLength];\n    memcpy(mStrides, strides, stridesLength * sizeof(int));\n    is_segmentation_ = is_segmentation;\n    is_pose_ = is_pose;\n    is_obb_ = is_obb;\n}\n\nYoloLayerPlugin::~YoloLayerPlugin() {\n    if (mStrides != nullptr) {\n        delete[] mStrides;\n        mStrides = nullptr;\n    }\n}\n\nYoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length) {\n    using namespace Tn;\n    const char *d = reinterpret_cast<const char*>(data), *a = d;\n    read(d, mClassCount);\n    read(d, mNumberofpoints);\n    read(d, mConfthreshkeypoints);\n    read(d, mThreadCount);\n    read(d, mYoloV8NetWidth);\n    read(d, mYoloV8netHeight);\n    read(d, mMaxOutObject);\n    read(d, mStridesLength);\n    mStrides = new int[mStridesLength];\n    for (int i = 0; i < mStridesLength; ++i) {\n        read(d, mStrides[i]);\n    }\n    read(d, is_segmentation_);\n    read(d, is_pose_);\n    read(d, is_obb_);\n\n    assert(d == a + length);\n}\n\nvoid YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT {\n\n    using namespace Tn;\n    char *d = static_cast<char*>(buffer), *a = d;\n    write(d, mClassCount);\n    write(d, mNumberofpoints);\n    write(d, mConfthreshkeypoints);\n    write(d, mThreadCount);\n    write(d, mYoloV8NetWidth);\n    write(d, mYoloV8netHeight);\n    write(d, mMaxOutObject);\n    write(d, mStridesLength);\n    for (int i = 0; i < mStridesLength; ++i) {\n        write(d, mStrides[i]);\n    }\n    write(d, is_segmentation_);\n    write(d, is_pose_);\n    write(d, is_obb_);\n\n    assert(d == a + getSerializationSize());\n}\n\nsize_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT {\n    return sizeof(mClassCount) + sizeof(mNumberofpoints) + sizeof(mConfthreshkeypoints) + sizeof(mThreadCount) +\n           sizeof(mYoloV8netHeight) + sizeof(mYoloV8NetWidth) + sizeof(mMaxOutObject) + sizeof(mStridesLength) +\n           sizeof(int) * mStridesLength + sizeof(is_segmentation_) + sizeof(is_pose_) + sizeof(is_obb_);\n}\n\nint YoloLayerPlugin::initialize() TRT_NOEXCEPT {\n    return 0;\n}\n\nnvinfer1::Dims YoloLayerPlugin::getOutputDimensions(int index, const nvinfer1::Dims* inputs,\n                                                    int nbInputDims) TRT_NOEXCEPT {\n    int total_size = mMaxOutObject * sizeof(Detection) / sizeof(float);\n    return nvinfer1::Dims3(total_size + 1, 1, 1);\n}\n\nvoid YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT {\n    mPluginNamespace = pluginNamespace;\n}\n\nconst char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT {\n    return mPluginNamespace;\n}\n\nnvinfer1::DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes,\n                                                      int nbInputs) const TRT_NOEXCEPT {\n    return nvinfer1::DataType::kFLOAT;\n}\n\nbool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                                   int nbInputs) const TRT_NOEXCEPT {\n\n    return false;\n}\n\nbool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT {\n\n    return false;\n}\n\nvoid YoloLayerPlugin::configurePlugin(nvinfer1::PluginTensorDesc const* in, int nbInput,\n                                      nvinfer1::PluginTensorDesc const* out, int nbOutput) TRT_NOEXCEPT{};\n\nvoid YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                                      IGpuAllocator* gpuAllocator) TRT_NOEXCEPT{};\n\nvoid YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\nconst char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT {\n\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nvoid YoloLayerPlugin::destroy() TRT_NOEXCEPT {\n    delete this;\n}\n\nnvinfer1::IPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT {\n\n    YoloLayerPlugin* p =\n            new YoloLayerPlugin(mClassCount, mNumberofpoints, mConfthreshkeypoints, mYoloV8NetWidth, mYoloV8netHeight,\n                                mMaxOutObject, is_segmentation_, is_pose_, is_obb_, mStrides, mStridesLength);\n    p->setPluginNamespace(mPluginNamespace);\n    return p;\n}\n\nint YoloLayerPlugin::enqueue(int batchSize, const void* TRT_CONST_ENQUEUE* inputs, void* const* outputs,\n                             void* workspace, cudaStream_t stream) TRT_NOEXCEPT {\n    forwardGpu((const float* const*)inputs, (float*)outputs[0], stream, mYoloV8netHeight, mYoloV8NetWidth, batchSize);\n    return 0;\n}\n\n__device__ float Logist(float data) {\n    return 1.0f / (1.0f + expf(-data));\n};\n\n__global__ void CalDetection(const float* input, float* output, int numElements, int maxoutobject, const int grid_h,\n                             int grid_w, const int stride, int classes, int nk, float confkeypoints, int outputElem,\n                             bool is_segmentation, bool is_pose, bool is_obb) {\n    int idx = threadIdx.x + blockDim.x * blockIdx.x;\n    if (idx >= numElements)\n        return;\n\n    const int N_kpts = nk;\n    int total_grid = grid_h * grid_w;\n    int info_len = 4 + classes + (is_segmentation ? 32 : 0) + (is_pose ? N_kpts * 3 : 0) + (is_obb ? 1 : 0);\n    int batchIdx = idx / total_grid;\n    int elemIdx = idx % total_grid;\n    const float* curInput = input + batchIdx * total_grid * info_len;\n    int outputIdx = batchIdx * outputElem;\n\n    int class_id = 0;\n    float max_cls_prob = 0.0;\n    for (int i = 4; i < 4 + classes; i++) {\n        float p = Logist(curInput[elemIdx + i * total_grid]);\n        if (p > max_cls_prob) {\n            max_cls_prob = p;\n            class_id = i - 4;\n        }\n    }\n\n    if (max_cls_prob < 0.1)\n        return;\n\n    int count = (int)atomicAdd(output + outputIdx, 1);\n    if (count >= maxoutobject)\n        return;\n    char* data = (char*)(output + outputIdx) + sizeof(float) + count * sizeof(Detection);\n    Detection* det = (Detection*)(data);\n\n    int row = elemIdx / grid_w;\n    int col = elemIdx % grid_w;\n\n    det->conf = max_cls_prob;\n    det->class_id = class_id;\n    det->bbox[0] = (col + 0.5f - curInput[elemIdx + 0 * total_grid]) * stride;\n    det->bbox[1] = (row + 0.5f - curInput[elemIdx + 1 * total_grid]) * stride;\n    det->bbox[2] = (col + 0.5f + curInput[elemIdx + 2 * total_grid]) * stride;\n    det->bbox[3] = (row + 0.5f + curInput[elemIdx + 3 * total_grid]) * stride;\n\n    if (is_segmentation) {\n        for (int k = 0; k < 32; ++k) {\n            det->mask[k] =\n                    curInput[elemIdx + (4 + classes + (is_pose ? N_kpts * 3 : 0) + (is_obb ? 1 : 0) + k) * total_grid];\n        }\n    }\n\n    if (is_pose) {\n        for (int kpt = 0; kpt < N_kpts; kpt++) {\n            int kpt_x_idx = (4 + classes + (is_segmentation ? 32 : 0) + (is_obb ? 1 : 0) + kpt * 3) * total_grid;\n            int kpt_y_idx = (4 + classes + (is_segmentation ? 32 : 0) + (is_obb ? 1 : 0) + kpt * 3 + 1) * total_grid;\n            int kpt_conf_idx = (4 + classes + (is_segmentation ? 32 : 0) + (is_obb ? 1 : 0) + kpt * 3 + 2) * total_grid;\n\n            float kpt_confidence = sigmoid(curInput[elemIdx + kpt_conf_idx]);\n\n            float kpt_x = (curInput[elemIdx + kpt_x_idx] * 2.0 + col) * stride;\n            float kpt_y = (curInput[elemIdx + kpt_y_idx] * 2.0 + row) * stride;\n\n            bool is_within_bbox =\n                    kpt_x >= det->bbox[0] && kpt_x <= det->bbox[2] && kpt_y >= det->bbox[1] && kpt_y <= det->bbox[3];\n\n            if (kpt_confidence < confkeypoints || !is_within_bbox) {\n                det->keypoints[kpt * 3] = -1;\n                det->keypoints[kpt * 3 + 1] = -1;\n                det->keypoints[kpt * 3 + 2] = -1;\n            } else {\n                det->keypoints[kpt * 3] = kpt_x;\n                det->keypoints[kpt * 3 + 1] = kpt_y;\n                det->keypoints[kpt * 3 + 2] = kpt_confidence;\n            }\n        }\n    }\n\n    if (is_obb) {\n        double pi = CV_PI;\n        auto angle_inx = curInput[elemIdx + (4 + classes + (is_segmentation ? 32 : 0) + (is_pose ? N_kpts * 3 : 0) +\n                                             0) * total_grid];\n        auto angle = (sigmoid(angle_inx) - 0.25f) * pi;\n\n        auto cos1 = cos(angle);\n        auto sin1 = sin(angle);\n        auto xf = (curInput[elemIdx + 2 * total_grid] - curInput[elemIdx + 0 * total_grid]) / 2;\n        auto yf = (curInput[elemIdx + 3 * total_grid] - curInput[elemIdx + 1 * total_grid]) / 2;\n\n        auto x = xf * cos1 - yf * sin1;\n        auto y = xf * sin1 + yf * cos1;\n\n        float cx = (col + 0.5f + x) * stride;\n        float cy = (row + 0.5f + y) * stride;\n\n        float w1 = (curInput[elemIdx + 0 * total_grid] + curInput[elemIdx + 2 * total_grid]) * stride;\n        float h1 = (curInput[elemIdx + 1 * total_grid] + curInput[elemIdx + 3 * total_grid]) * stride;\n        det->bbox[0] = cx;\n        det->bbox[1] = cy;\n        det->bbox[2] = w1;\n        det->bbox[3] = h1;\n        det->angle = angle;\n    }\n}\n\nvoid YoloLayerPlugin::forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV8netHeight,\n                                 int mYoloV8NetWidth, int batchSize) {\n    int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);\n    cudaMemsetAsync(output, 0, sizeof(float), stream);\n    for (int idx = 0; idx < batchSize; ++idx) {\n        CUDA_CHECK(cudaMemsetAsync(output + idx * outputElem, 0, sizeof(float), stream));\n    }\n    int numElem = 0;\n\n    //    const int maxGrids = mStridesLength;\n    //    int grids[maxGrids][2];\n    //    for (int i = 0; i < maxGrids; ++i) {\n    //        grids[i][0] = mYoloV8netHeight / mStrides[i];\n    //        grids[i][1] = mYoloV8NetWidth / mStrides[i];\n    //    }\n\n    int maxGrids = mStridesLength;\n    int flatGridsLen = 2 * maxGrids;\n    int* flatGrids = new int[flatGridsLen];\n\n    for (int i = 0; i < maxGrids; ++i) {\n        flatGrids[2 * i] = mYoloV8netHeight / mStrides[i];\n        flatGrids[2 * i + 1] = mYoloV8NetWidth / mStrides[i];\n    }\n\n    for (unsigned int i = 0; i < maxGrids; i++) {\n        // Access the elements of the original 2D array from the flattened 1D array\n        int grid_h = flatGrids[2 * i];      // Corresponds to the access of grids[i][0]\n        int grid_w = flatGrids[2 * i + 1];  // Corresponds to the access of grids[i][1]\n        int stride = mStrides[i];\n        numElem = grid_h * grid_w * batchSize;  // Calculate the total number of elements\n        if (numElem < mThreadCount)             // Adjust the thread count if needed\n            mThreadCount = numElem;\n\n        // The CUDA kernel call remains unchanged\n        CalDetection<<<(numElem + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream>>>(\n                inputs[i], output, numElem, mMaxOutObject, grid_h, grid_w, stride, mClassCount, mNumberofpoints,\n                mConfthreshkeypoints, outputElem, is_segmentation_, is_pose_, is_obb_);\n    }\n\n    delete[] flatGrids;\n}\n\nPluginFieldCollection YoloPluginCreator::mFC{};\nstd::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\nYoloPluginCreator::YoloPluginCreator() {\n    mPluginAttributes.clear();\n    mFC.nbFields = mPluginAttributes.size();\n    mFC.fields = mPluginAttributes.data();\n}\n\nconst char* YoloPluginCreator::getPluginName() const TRT_NOEXCEPT {\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloPluginCreator::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nconst PluginFieldCollection* YoloPluginCreator::getFieldNames() TRT_NOEXCEPT {\n    return &mFC;\n}\n\nIPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT {\n    assert(fc->nbFields == 1);\n    assert(strcmp(fc->fields[0].name, \"combinedInfo\") == 0);\n    const int* combinedInfo = static_cast<const int*>(fc->fields[0].data);\n    int netinfo_count = 9;\n    int class_count = combinedInfo[0];\n    int numberofpoints = combinedInfo[1];\n    float confthreshkeypoints = combinedInfo[2];\n    int input_w = combinedInfo[3];\n    int input_h = combinedInfo[4];\n    int max_output_object_count = combinedInfo[5];\n    bool is_segmentation = combinedInfo[6];\n    bool is_pose = combinedInfo[7];\n    bool is_obb = combinedInfo[8];\n    const int* px_arry = combinedInfo + netinfo_count;\n    int px_arry_length = fc->fields[0].length - netinfo_count;\n    YoloLayerPlugin* obj =\n            new YoloLayerPlugin(class_count, numberofpoints, confthreshkeypoints, input_w, input_h,\n                                max_output_object_count, is_segmentation, is_pose, is_obb, px_arry, px_arry_length);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\nIPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData,\n                                                     size_t serialLength) TRT_NOEXCEPT {\n    // This object will be deleted when the network is destroyed, which will\n    // call YoloLayerPlugin::destroy()\n    YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\n}  // namespace nvinfer1\n"
  },
  {
    "path": "yolov12/plugin/yololayer.h",
    "content": "#pragma once\n\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n#include \"macros.h\"\n\nnamespace nvinfer1 {\nclass API YoloLayerPlugin : public IPluginV2IOExt {\n   public:\n    YoloLayerPlugin(int classCount, int numberofpoints, float confthreshkeypoints, int netWidth, int netHeight,\n                    int maxOut, bool is_segmentation, bool is_pose, bool is_obb, const int* strides, int stridesLength);\n\n    YoloLayerPlugin(const void* data, size_t length);\n\n    ~YoloLayerPlugin();\n\n    int getNbOutputs() const TRT_NOEXCEPT override { return 1; }\n\n    nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n    int initialize() TRT_NOEXCEPT override;\n\n    virtual void terminate() TRT_NOEXCEPT override {}\n\n    virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0; }\n\n    virtual int enqueue(int batchSize, const void* const* inputs, void* TRT_CONST_ENQUEUE* outputs, void* workspace,\n                        cudaStream_t stream) TRT_NOEXCEPT override;\n\n    virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n    virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n    bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs,\n                                   int nbOutputs) const TRT_NOEXCEPT override {\n        return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n    }\n\n    const char* getPluginType() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    void destroy() TRT_NOEXCEPT override;\n\n    IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n    nvinfer1::DataType getOutputDataType(int32_t index, nvinfer1::DataType const* inputTypes,\n                                         int32_t nbInputs) const TRT_NOEXCEPT;\n\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                      int nbInputs) const TRT_NOEXCEPT override;\n\n    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n    void attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                         IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n    void configurePlugin(PluginTensorDesc const* in, int32_t nbInput, PluginTensorDesc const* out,\n                         int32_t nbOutput) TRT_NOEXCEPT override;\n\n    void detachFromContext() TRT_NOEXCEPT override;\n\n   private:\n    void forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV8netHeight,\n                    int mYoloV8NetWidth, int batchSize);\n\n    int mThreadCount = 256;\n    const char* mPluginNamespace;\n    int mClassCount;\n    int mNumberofpoints;\n    float mConfthreshkeypoints;\n    int mYoloV8NetWidth;\n    int mYoloV8netHeight;\n    int mMaxOutObject;\n    bool is_segmentation_;\n    bool is_pose_;\n    bool is_obb_;\n    int* mStrides;\n    int mStridesLength;\n};\n\nclass API YoloPluginCreator : public IPluginCreator {\n   public:\n    YoloPluginCreator();\n\n    ~YoloPluginCreator() override = default;\n\n    const char* getPluginName() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    const nvinfer1::PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n    nvinfer1::IPluginV2IOExt* createPlugin(const char* name,\n                                           const nvinfer1::PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n    nvinfer1::IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData,\n                                                size_t serialLength) TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override { mNamespace = libNamespace; }\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override { return mNamespace.c_str(); }\n\n   private:\n    std::string mNamespace;\n    static PluginFieldCollection mFC;\n    static std::vector<PluginField> mPluginAttributes;\n};\n\nREGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n}  // namespace nvinfer1\n"
  },
  {
    "path": "yolov12/readme.md",
    "content": "## Introduction\n\nYolo12 model supports TensorRT-8.\n\nTraining code [link](https://github.com/ultralytics/ultralytics/archive/refs/tags/v8.3.38.zip)\n\n## Environment\n\n* cuda 11.8\n* cudnn 8.9.1.23\n* tensorrt 8.6.1.6\n* opencv 4.8.0\n* ultralytics 8.3.0\n\n## Support\n\n* [x] YOLO12-det support FP32/FP16 and C++ API\n\n\n## Config\n\n* Choose the YOLO12 sub-model n/s/m/l/x from command line arguments.\n* Other configs please check [src/config.h](src/config.h)\n\n## Build and Run\n\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\n\n```shell\n# Download ultralytics\nwget https://github.com/ultralytics/ultralytics/archive/refs/tags/v8.3.119.zip -O ultralytics-8.3.119.zip\n# Unzip ultralytics\nunzip ultralytics-8.3.119.zip\ncd ultralytics-8.3.119\n# Download models\nwget https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo12n.pt -O yolo12n.pt # to download other models, replace 'yolo12n.pt' with 'yolo12s.pt', 'yolo12m.pt', 'yolo12l.pt' or 'yolo12x.pt'\n# Generate .wts\ncp [PATH-TO-TENSORRTX]/yolov12/gen_wts.py .\npython gen_wts.py -w yolo12n.pt -o yolo12n.wts -t detect\n# A file 'yolo12n.wts' will be generated.\n```\n\n2. build tensorrtx/yolov12 and run\n```shell\ncd [PATH-TO-TENSORRTX]/yolov12\nmkdir build\ncd build\ncmake ..\nmake\n```\n\n### Detection\n```shell\ncp [PATH-TO-ultralytics]/yolo12n.wts .\n# Build and serialize TensorRT engine\n./yolo12_det -s yolo12n.wts yolo12n.engine [n/s/m/l/x]\n# Run inference\n./yolo12_det -d yolo12n.engine ../images [c/g]\n# results saved in build directory\n```\n\n## More Information\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n"
  },
  {
    "path": "yolov12/src/block.cpp",
    "content": "#include \"block.h\"\n#include <assert.h>\n#include <math.h>\n#include <fstream>\n#include <iostream>\n#include \"config.h\"\n#include \"model.h\"\n#include \"yololayer.h\"\n\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, nvinfer1::Weights> WeightMap;\n\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = nvinfer1::DataType::kFLOAT;\n\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; x++) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        wt.count = size;\n        WeightMap[name] = wt;\n    }\n    return WeightMap;\n}\n\nnvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                      std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                      std::string lname, float eps) {\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights scale{nvinfer1::DataType::kFLOAT, scval, len};\n\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights shift{nvinfer1::DataType::kFLOAT, shval, len};\n\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    nvinfer1::Weights power{nvinfer1::DataType::kFLOAT, pval, len};\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    nvinfer1::IScaleLayer* output = network->addScale(input, nvinfer1::ScaleMode::kCHANNEL, shift, scale, power);\n    assert(output);\n    return output;\n}\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, std::vector<int> k, int s, std::string lname) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k[0], k[1]},\n                                                                  weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    // auto pad\n    int p0 = k[0] / 2;\n    int p1 = k[1] / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p0, p1});\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n\nstatic nvinfer1::ILayer* bottleneck(nvinfer1::INetworkDefinition* network,\n                                    std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                    int c1, int c2, bool shortcut, std::vector<int> k1, std::vector<int> k2, float e,\n                                    std::string lname) {\n    int c_ = (int)((float)c2 * e);\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, c_, k1, 1, lname + \".cv1\");\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *conv1->getOutput(0), c2, k2, 1, lname + \".cv2\");\n\n    if (shortcut && c1 == c2) {\n        nvinfer1::IElementWiseLayer* ew =\n                network->addElementWise(input, *conv2->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n        return ew;\n    }\n    return conv2;\n}\n\nnvinfer1::IElementWiseLayer* SPPF(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int k, std::string lname) {\n    int c_ = c1 / 2;\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, c_, {1, 1}, 1, lname + \".cv1\");\n    nvinfer1::IPoolingLayer* pool1 =\n            network->addPoolingNd(*conv1->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool1->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool1->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::IPoolingLayer* pool2 =\n            network->addPoolingNd(*pool1->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool2->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::IPoolingLayer* pool3 =\n            network->addPoolingNd(*pool2->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool3->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool3->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::ITensor* inputTensors[] = {conv1->getOutput(0), pool1->getOutput(0), pool2->getOutput(0),\n                                         pool3->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensors, 4);\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv2\");\n    return conv2;\n}\n\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int grid, int k, int s, int p, std::string lname) {\n\n    nvinfer1::IShuffleLayer* shuffle1 = network->addShuffle(input);\n    shuffle1->setReshapeDimensions(nvinfer1::Dims4{kBatchSize, 4, 16, grid});\n    shuffle1->setSecondTranspose(nvinfer1::Permutation{0, 2, 1, 3});\n    nvinfer1::ISoftMaxLayer* softmax = network->addSoftMax(*shuffle1->getOutput(0));\n    softmax->setAxes(1 << 1);\n\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(*softmax->getOutput(0), 1, nvinfer1::DimsHW{1, 1}, weightMap[lname], bias_empty);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n\n    nvinfer1::IShuffleLayer* shuffle2 = network->addShuffle(*conv->getOutput(0));\n    shuffle2->setReshapeDimensions(nvinfer1::Dims3{kBatchSize, 4, grid});\n\n    return shuffle2;\n}\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network,\n                                       std::vector<nvinfer1::IConcatenationLayer*> dets, const int* px_arry,\n                                       int px_arry_num, bool is_segmentation, bool is_pose, bool is_obb) {\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    const int netinfo_count = 9;  // Assuming the first 5 elements are for netinfo as per existing code.\n    const int total_count = netinfo_count + px_arry_num;  // Total number of elements for netinfo and px_arry combined.\n\n    std::vector<int> combinedInfo(total_count);\n    int class_num = kNumClass;\n    if (is_pose)\n        class_num = kPoseNumClass;\n    else if (is_obb)\n        class_num = kObbNumClass;\n    int input_w = kInputW;\n    if (is_obb)\n        input_w = kObbInputW;\n    int input_h = kInputH;\n    if (is_obb)\n        input_h = kObbInputH;\n    // Fill in the first 5 elements as per existing netinfo.\n    combinedInfo[0] = class_num;\n    combinedInfo[1] = kNumberOfPoints;\n    combinedInfo[2] = kConfThreshKeypoints;\n    combinedInfo[3] = input_w;\n    combinedInfo[4] = input_h;\n    combinedInfo[5] = kMaxNumOutputBbox;\n    combinedInfo[6] = is_segmentation;\n    combinedInfo[7] = is_pose;\n    combinedInfo[8] = is_obb;\n\n    // Copy the contents of px_arry into the combinedInfo vector after the initial 5 elements.\n    std::copy(px_arry, px_arry + px_arry_num, combinedInfo.begin() + netinfo_count);\n\n    // Now let's create the PluginField object to hold this combined information.\n    nvinfer1::PluginField pluginField;\n    pluginField.name = \"combinedInfo\";  // This can be any name that the plugin will recognize\n    pluginField.data = combinedInfo.data();\n    pluginField.type = nvinfer1::PluginFieldType::kINT32;\n    pluginField.length = combinedInfo.size();\n\n    // Create the PluginFieldCollection to hold the PluginField object.\n    nvinfer1::PluginFieldCollection pluginFieldCollection;\n    pluginFieldCollection.nbFields = 1;  // We have just one field, but it's a combined array\n    pluginFieldCollection.fields = &pluginField;\n\n    // Create the plugin object using the PluginFieldCollection.\n    nvinfer1::IPluginV2* pluginObject = creator->createPlugin(\"yololayer\", &pluginFieldCollection);\n\n    // We assume that the plugin is to be added onto the network.\n    // Prepare input tensors for the YOLO Layer.\n    std::vector<nvinfer1::ITensor*> inputTensors;\n    for (auto det : dets) {\n        inputTensors.push_back(det->getOutput(0));  // Assuming each IConcatenationLayer has one output tensor.\n    }\n\n    // Add the plugin to the network using the prepared input tensors.\n    nvinfer1::IPluginV2Layer* yoloLayer = network->addPluginV2(inputTensors.data(), inputTensors.size(), *pluginObject);\n\n    return yoloLayer;  // Return the added YOLO layer.\n}\n\nstatic nvinfer1::ILayer* C3k(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int c1, int c2, int n, bool shortcut, std::vector<int> k1,\n                             std::vector<int> k2, float e, std::string lname) {\n    int c_ = (int)((float)c2 * e);\n    auto cv1 = convBnSiLU(network, weightMap, input, c_, {1, 1}, 1, lname + \".cv1\");\n    auto cv2 = convBnSiLU(network, weightMap, input, c_, {1, 1}, 1, lname + \".cv2\");\n    nvinfer1::ITensor* y1 = cv1->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        auto b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, k1, k2, 1.0, lname + \".m.\" + std::to_string(i));\n        y1 = b->getOutput(0);\n    }\n\n    nvinfer1::ITensor* inputTensors[] = {y1, cv2->getOutput(0)};\n    auto cat = network->addConcatenation(inputTensors, 2);\n\n    auto cv3 = convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv3\");\n    return cv3;\n}\n\nnvinfer1::IElementWiseLayer* C3K2(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int n, bool c3k, bool shortcut, float e, std::string lname) {\n    int c_ = (float)c2 * e;\n\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, 2 * c_, {1, 1}, 1, lname + \".cv1\");\n    nvinfer1::Dims d = conv1->getOutput(0)->getDimensions();\n\n    nvinfer1::ISliceLayer* split1 =\n            network->addSlice(*conv1->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n                              nvinfer1::Dims4{d.d[0], d.d[1] / 2, d.d[2], d.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* split2 =\n            network->addSlice(*conv1->getOutput(0), nvinfer1::Dims4{0, d.d[1] / 2, 0, 0},\n                              nvinfer1::Dims4{d.d[0], d.d[1] / 2, d.d[2], d.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ITensor* inputTensor0[] = {split1->getOutput(0), split2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensor0, 2);\n    nvinfer1::ITensor* y1 = split2->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        nvinfer1::ILayer* b;\n        if (c3k) {\n            b = C3k(network, weightMap, *y1, c_, c_, 2, shortcut, {3, 3}, {3, 3}, 0.5,\n                    lname + \".m.\" + std::to_string(i));\n        } else {\n            b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, {3, 3}, {3, 3}, 0.5,\n                           lname + \".m.\" + std::to_string(i));\n        }\n        y1 = b->getOutput(0);\n\n        nvinfer1::ITensor* inputTensors[] = {cat->getOutput(0), b->getOutput(0)};\n        cat = network->addConcatenation(inputTensors, 2);\n    }\n\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv2\");\n\n    return conv2;\n}\n\nstatic nvinfer1::ILayer* convBn(nvinfer1::INetworkDefinition* network,\n                                std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int ch,\n                                int k, int s, std::string lname, int g = 1) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv;\n    if (lname.find(\".pe\") != std::string::npos) {\n        nvinfer1::Weights conv_bias = weightMap[lname + \".conv.bias\"];\n        conv = network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k, k}, weightMap[lname + \".conv.weight\"],\n                                         conv_bias);\n        assert(conv);\n        conv->setStrideNd(nvinfer1::DimsHW{s, s});\n        int p = k / 2;\n        conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n        conv->setNbGroups(g);\n        conv->setName((lname + \".conv\").c_str());\n\n        nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n        bn->setName((lname + \".bn\").c_str());\n        return bn;\n\n    } else {\n        conv = network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k, k}, weightMap[lname + \".conv.weight\"],\n                                         bias_empty);\n        assert(conv);\n        conv->setStrideNd(nvinfer1::DimsHW{s, s});\n        int p = k / 2;\n        conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n        conv->setNbGroups(g);\n        conv->setName((lname + \".conv\").c_str());\n\n        nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n        bn->setName((lname + \".bn\").c_str());\n        return bn;\n    }\n}\n\nstatic nvinfer1::ILayer* Attention(nvinfer1::INetworkDefinition* network,\n                                   std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                   int dim, int num_heads, float attn_ratio, std::string lname) {\n    int head_dim = dim / num_heads;\n    int key_dim = head_dim * attn_ratio;\n    float scale = pow(key_dim, -0.5);\n    int nh_kd = key_dim * num_heads;\n    int h = dim + nh_kd * 2;\n\n    auto d = input.getDimensions();\n    int B = d.d[0];\n    int H = d.d[2];\n    int W = d.d[3];\n    int N = H * W;\n    auto* qkv = convBn(network, weightMap, input, h, 1, 1, lname + \".qkv\");\n    // qkv.view(B, self.num_heads, -1, N)\n    auto shuffle = network->addShuffle(*qkv->getOutput(0));\n    shuffle->setReshapeDimensions(nvinfer1::Dims4{B, num_heads, -1, N});\n    // q, k, v = .split([self.key_dim, self.key_dim, self.head_dim], dim=2)\n    auto d1 = shuffle->getOutput(0)->getDimensions();\n    auto q = network->addSlice(*shuffle->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n                               nvinfer1::Dims4{d1.d[0], d1.d[1], key_dim, d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    auto k = network->addSlice(*shuffle->getOutput(0), nvinfer1::Dims4{0, 0, key_dim, 0},\n                               nvinfer1::Dims4{d1.d[0], d1.d[1], key_dim, d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    auto v = network->addSlice(*shuffle->getOutput(0), nvinfer1::Dims4{0, 0, key_dim * 2, 0},\n                               nvinfer1::Dims4{d1.d[0], d1.d[1], head_dim, d1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    // attn = ((q.transpose(-2, -1) @ k) * self.scale)\n    auto qT = network->addShuffle(*q->getOutput(0));\n    qT->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});\n    auto matmul = network->addMatrixMultiply(*qT->getOutput(0), nvinfer1::MatrixOperation::kNONE, *k->getOutput(0),\n                                             nvinfer1::MatrixOperation::kNONE);\n    // There are not many memory leaks, and I will change it when I have time\n    float* scale_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    scale_val[0] = scale;\n    nvinfer1::Weights s_w{nvinfer1::DataType::kFLOAT, scale_val, 1};\n    float* shift_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    shift_val[0] = 0;\n    nvinfer1::Weights sh_w{nvinfer1::DataType::kFLOAT, shift_val, 1};\n    float* power_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    power_val[0] = 1;\n    nvinfer1::Weights p_w{nvinfer1::DataType::kFLOAT, power_val, 1};\n    nvinfer1::IScaleLayer* scaleLayer =\n            network->addScale(*matmul->getOutput(0), nvinfer1::ScaleMode::kUNIFORM, sh_w, s_w, p_w);\n    // attn = attn.softmax(dim=-1)\n    nvinfer1::ISoftMaxLayer* softmax = network->addSoftMax(*scaleLayer->getOutput(0));\n    softmax->setAxes(1 << 3);\n    // x = (v @ attn.transpose(-2, -1)).view(B, -1, H, W) + self.pe(v.reshape(B, -1, H, W))\n    auto attnT = network->addShuffle(*softmax->getOutput(0));\n    attnT->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});\n    auto matmul2 = network->addMatrixMultiply(*v->getOutput(0), nvinfer1::MatrixOperation::kNONE, *attnT->getOutput(0),\n                                              nvinfer1::MatrixOperation::kNONE);\n    auto reshape = network->addShuffle(*matmul2->getOutput(0));\n    reshape->setReshapeDimensions(nvinfer1::Dims4{B, -1, H, W});\n    auto v_reshape = network->addShuffle(*v->getOutput(0));\n    v_reshape->setReshapeDimensions(nvinfer1::Dims4{B, -1, H, W});\n    // self.pe = Conv(dim, dim, 3, 1, g=dim, act=False)\n    auto pe = convBn(network, weightMap, *v_reshape->getOutput(0), dim, 3, 1, lname + \".pe\", dim);\n    auto sum = network->addElementWise(*reshape->getOutput(0), *pe->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    // x = self.proj(x)\n    // self.proj = Conv(dim, dim, 1, act=False)\n    auto proj = convBn(network, weightMap, *sum->getOutput(0), dim, 1, 1, lname + \".proj\");\n    return proj;\n}\n\nstatic nvinfer1::ILayer* PSABlock(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int dim,\n                                  float attn_ratio, int num_heads, bool shortcut, std::string lname) {\n    // x = x + self.attn(x) if self.add else self.attn(x)\n    auto attn = Attention(network, weightMap, input, dim, num_heads, attn_ratio, lname + \".attn\");\n    nvinfer1::ILayer* shortcut_layer = nullptr;\n    if (shortcut) {\n        shortcut_layer = network->addElementWise(input, *attn->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    } else {\n        shortcut_layer = attn;\n    }\n    // self.ffn = nn.Sequential(Conv(c, c * 2, 1), Conv(c * 2, c, 1, act=False))\n    // x = x + self.ffn(x) if self.add else self.ffn(x)\n    auto ffn0 = convBnSiLU(network, weightMap, *shortcut_layer->getOutput(0), dim * 2, {1, 1}, 1, lname + \".ffn.0\");\n    auto ffn1 = convBn(network, weightMap, *ffn0->getOutput(0), dim, 1, 1, lname + \".ffn.1\");\n    if (shortcut) {\n        return network->addElementWise(*shortcut_layer->getOutput(0), *ffn1->getOutput(0),\n                                       nvinfer1::ElementWiseOperation::kSUM);\n    } else {\n        return ffn1;\n    }\n}\n\nnvinfer1::ILayer* C2PSA(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>& weightMap,\n                        nvinfer1::ITensor& input, int c1, int c2, int n, float e, std::string lname) {\n    assert(network != nullptr);\n    int c = c1 * e;\n\n    // cv1 branch\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, 2 * c, {1, 1}, 1, lname + \".cv1\");\n    nvinfer1::ITensor* cv1_out = conv1->getOutput(0);\n\n    // Split the output of cv1 into two tensors\n    nvinfer1::Dims dims = cv1_out->getDimensions();\n    nvinfer1::ISliceLayer* split1 = network->addSlice(*cv1_out, nvinfer1::Dims4{0, 0, 0, 0},\n                                                      nvinfer1::Dims4{dims.d[0], dims.d[1] / 2, dims.d[2], dims.d[3]},\n                                                      nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* split2 = network->addSlice(*cv1_out, nvinfer1::Dims4{0, dims.d[1] / 2, 0, 0},\n                                                      nvinfer1::Dims4{dims.d[0], dims.d[1] / 2, dims.d[2], dims.d[3]},\n                                                      nvinfer1::Dims4{1, 1, 1, 1});\n\n    // Create y1 bottleneck sequence\n    nvinfer1::ITensor* y = split2->getOutput(0);\n    for (int i = 0; i < n; ++i) {\n        auto* bottleneck_layer =\n                PSABlock(network, weightMap, *y, c, 0.5, c / 64, true, lname + \".m.\" + std::to_string(i));\n        y = bottleneck_layer->getOutput(0);  // update 'y1' to be the output of the current bottleneck\n    }\n\n    // Concatenate y1 with the second split of cv1\n    nvinfer1::ITensor* concatInputs[2] = {split1->getOutput(0), y};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(concatInputs, 2);\n\n    // cv2 to produce the final output\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv2\");\n\n    return conv2;\n}\n\nnvinfer1::ILayer* DWConv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int ch, std::vector<int> k, int s, std::string lname) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k[0], k[1]},\n                                                                  weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setNbGroups(ch);\n    // auto pad\n    int p0 = k[0] / 2;\n    int p1 = k[1] / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p0, p1});\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n\nnvinfer1::ILayer* A2C2f(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                        nvinfer1::ITensor& input, int c1, int c2, int n, bool a2, int area, bool residual,\n                        float mlp_ratio, float e, int g, bool shortcut, std::string lname) {\n    int c = (int)(((float)c2) * e);\n    int num_heads = c / 32 * 2;\n\n    //assert(c % 32 == 0 && \"c2 should be divisible by 32\");\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, c * 2, {1, 1}, 1, lname + \".cv1\");\n\n    if (a2) {\n\n        nvinfer1::ILayer* ablock1 =\n                ABlock(network, weightMap, *conv1->getOutput(0), c, num_heads, mlp_ratio, area, lname + \".m.0.0\");\n        nvinfer1::ILayer* ablock2 =\n                ABlock(network, weightMap, *ablock1->getOutput(0), c, num_heads, mlp_ratio, area, lname + \".m.0.1\");\n        nvinfer1::ILayer* ablock3 =\n                ABlock(network, weightMap, *ablock2->getOutput(0), c, num_heads, mlp_ratio, area, lname + \".m.1.0\");\n        nvinfer1::ILayer* ablock4 =\n                ABlock(network, weightMap, *ablock3->getOutput(0), c, num_heads, mlp_ratio, area, lname + \".m.1.1\");\n\n        nvinfer1::ITensor* inputTensors[] = {conv1->getOutput(0), ablock2->getOutput(0), ablock4->getOutput(0)};\n        nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensors, 3);\n\n        nvinfer1::IElementWiseLayer* conv2 =\n                convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv2\");\n        return conv2;\n    } else {\n\n        nvinfer1::ILayer* c3k_ = C3k(network, weightMap, *conv1->getOutput(0), c * 2, c * 2, 2, shortcut, {3, 3},\n                                     {3, 3}, 0.5, lname + \".m.0\");\n\n        nvinfer1::ITensor* inputTensors[] = {conv1->getOutput(0), c3k_->getOutput(0)};\n        nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensors, 2);\n\n        nvinfer1::IElementWiseLayer* conv2 =\n                convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv2\");\n        return conv2;\n    }\n}\n\nnvinfer1::ILayer* ABlock(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int dim, int num_heads, float mlp_ratio, int area,\n                         std::string lname) {\n    int mlp_hidden_dim = (int)(dim * mlp_ratio);\n\n    nvinfer1::ILayer* attn = AAttn(network, weightMap, input, dim, num_heads, mlp_ratio, area, lname + \".attn\");\n    nvinfer1::IElementWiseLayer* sum =\n            network->addElementWise(input, *attn->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n\n    //mlp\n    nvinfer1::IElementWiseLayer* mlp1 =\n            convBnSiLU(network, weightMap, *sum->getOutput(0), mlp_hidden_dim * 2, {1, 1}, 1, lname + \".mlp.0\");\n\n    nvinfer1::ILayer* mlp2 = convBn(network, weightMap, *mlp1->getOutput(0), dim * 2, 1, 1, lname + \".mlp.1\");\n\n    nvinfer1::IElementWiseLayer* sum2 =\n            network->addElementWise(*sum->getOutput(0), *mlp2->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n\n    return sum2;\n}\n\nnvinfer1::ILayer* AAttn(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                        nvinfer1::ITensor& input, int dim, int num_heads, float mlp_ratio, int area,\n                        std::string lname) {\n    int head_dim = (int)(dim / num_heads);\n    int all_head_dim = head_dim * num_heads;\n    //TODO: SCALE IS STATIC, CONVERT TO DYNAMIC!\n    float scale = 0.176777;\n    auto dims = input.getDimensions();\n    int B = dims.d[0];\n    int C = dims.d[1];\n    int H = dims.d[2];\n    int W = dims.d[3];\n    int N = H * W;\n\n    auto* qkv = convBn(network, weightMap, input, all_head_dim * 3 * 2, 1, 1, lname + \".qkv\");\n\n    auto* reshape = network->addShuffle(*qkv->getOutput(0));\n    reshape->setReshapeDimensions(nvinfer1::Dims3{B, -1, N});\n    reshape->setSecondTranspose(nvinfer1::Permutation{0, 2, 1});\n\n    if (area > 1) {\n        B = B * area;\n        N = (H * W) / area;\n    }\n\n    auto* reshape1 = network->addShuffle(*reshape->getOutput(0));\n    reshape1->setReshapeDimensions(nvinfer1::Dims4{B, N, num_heads, head_dim * 3 * 2});\n    reshape1->setSecondTranspose(nvinfer1::Permutation{0, 2, 3, 1});\n\n    nvinfer1::ISliceLayer* q = network->addSlice(\n            *reshape1->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n            nvinfer1::Dims4{reshape1->getOutput(0)->getDimensions().d[0], reshape1->getOutput(0)->getDimensions().d[1],\n                            reshape1->getOutput(0)->getDimensions().d[2] / 3,\n                            reshape1->getOutput(0)->getDimensions().d[3]},\n            nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* k = network->addSlice(\n            *reshape1->getOutput(0), nvinfer1::Dims4{0, 0, reshape1->getOutput(0)->getDimensions().d[2] / 3, 0},\n            nvinfer1::Dims4{reshape1->getOutput(0)->getDimensions().d[0], reshape1->getOutput(0)->getDimensions().d[1],\n                            reshape1->getOutput(0)->getDimensions().d[2] / 3,\n                            reshape1->getOutput(0)->getDimensions().d[3]},\n            nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* v = network->addSlice(\n            *reshape1->getOutput(0), nvinfer1::Dims4{0, 0, 2 * reshape1->getOutput(0)->getDimensions().d[2] / 3, 0},\n            nvinfer1::Dims4{reshape1->getOutput(0)->getDimensions().d[0], reshape1->getOutput(0)->getDimensions().d[1],\n                            reshape1->getOutput(0)->getDimensions().d[2] / 3,\n                            reshape1->getOutput(0)->getDimensions().d[3]},\n            nvinfer1::Dims4{1, 1, 1, 1});\n\n    auto* qT = network->addShuffle(*q->getOutput(0));\n    qT->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});\n\n    auto matmul = network->addMatrixMultiply(*qT->getOutput(0), nvinfer1::MatrixOperation::kNONE, *k->getOutput(0),\n                                             nvinfer1::MatrixOperation::kNONE);\n\n    float* scale_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    scale_val[0] = scale;\n    nvinfer1::Weights s_w{nvinfer1::DataType::kFLOAT, scale_val, 1};\n    float* shift_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    shift_val[0] = 0;\n    nvinfer1::Weights sh_w{nvinfer1::DataType::kFLOAT, shift_val, 1};\n    float* power_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    power_val[0] = 1;\n    nvinfer1::Weights p_w{nvinfer1::DataType::kFLOAT, power_val, 1};\n    nvinfer1::IScaleLayer* mul =\n            network->addScale(*matmul->getOutput(0), nvinfer1::ScaleMode::kUNIFORM, sh_w, s_w, p_w);\n    auto* softmax = network->addSoftMax(*mul->getOutput(0));\n    softmax->setAxes(1 << 3);\n\n    auto transpose3 = network->addShuffle(*softmax->getOutput(0));\n    transpose3->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});\n\n    auto matmul1 = network->addMatrixMultiply(*v->getOutput(0), nvinfer1::MatrixOperation::kNONE,\n                                              *transpose3->getOutput(0), nvinfer1::MatrixOperation::kNONE);\n\n    auto transpose4 = network->addShuffle(*matmul1->getOutput(0));\n    transpose4->setFirstTranspose(nvinfer1::Permutation{0, 3, 1, 2});\n\n    if (area > 1) {\n        B = B / area;\n        N = N * area;\n    }\n\n    auto* reshape3 = network->addShuffle(*transpose4->getOutput(0));\n    reshape3->setReshapeDimensions(nvinfer1::Dims4{B, H, W, -1});\n\n    auto* transpose6 = network->addShuffle(*reshape3->getOutput(0));\n    transpose6->setFirstTranspose(nvinfer1::Permutation{0, 3, 1, 2});\n\n    auto transpose5 = network->addShuffle(*v->getOutput(0));\n    transpose5->setFirstTranspose(nvinfer1::Permutation{0, 3, 1, 2});\n\n    auto* reshape4 = network->addShuffle(*transpose5->getOutput(0));\n    reshape4->setReshapeDimensions(nvinfer1::Dims4{B, H, W, C});\n\n    //reshape4->setSecondTranspose(nvinfer1::Permutation{0, 3, 1, 2});\n    auto* transpose7 = network->addShuffle(*reshape4->getOutput(0));\n    transpose7->setFirstTranspose(nvinfer1::Permutation{0, 3, 1, 2});\n    auto* pe = convBn(network, weightMap, *transpose7->getOutput(0), all_head_dim * 2, 7, 1, lname + \".pe\",\n                      all_head_dim * 2);\n\n    auto* sum =\n            network->addElementWise(*pe->getOutput(0), *transpose6->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    auto* proj = convBn(network, weightMap, *sum->getOutput(0), all_head_dim * 2, 1, 1, lname + \".proj\");\n\n    return proj;\n}\n"
  },
  {
    "path": "yolov12/src/model.cpp",
    "content": "#include <math.h>\n#include <iostream>\n\n#include \"block.h\"\n//#include \"calibrator.h\"\n#include \"config.h\"\n#include \"model.h\"\n\nstatic int get_width(int x, float gw, int max_channels, int divisor = 8) {\n    auto channel = std::min(x, max_channels);\n    channel = int(ceil((channel * gw) / divisor)) * divisor;\n    return channel;\n}\n\nstatic int get_depth(int x, float gd) {\n    if (x == 1)\n        return 1;\n    int r = round(x * gd);\n    if (x * gd - int(x * gd) == 0.5 && (int(x * gd) % 2) == 0)\n        --r;\n    return std::max<int>(r, 1);\n}\n\nvoid calculateStrides(nvinfer1::IElementWiseLayer* conv_layers[], int size, int reference_size, int strides[]) {\n    for (int i = 0; i < size; ++i) {\n        nvinfer1::ILayer* layer = conv_layers[i];\n        nvinfer1::Dims dims = layer->getOutput(0)->getDimensions();\n        int feature_map_size = dims.d[2];\n        strides[i] = reference_size / feature_map_size;\n    }\n}\n\nnvinfer1::IHostMemory* buildEngineYolo12Det(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels, std::string& type)\n\n{\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    //\tnvinfer1::INetworkDefinition *network = builder->createNetworkV2(0U);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    /*******************************************************************************************************\n    ******************************************  YOLO12 INPUT  **********************************************\n    *******************************************************************************************************/\n\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n    *****************************************  YOLO12 BACKBONE  ********************************************\n    *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), {3, 3}, 2, \"model.0\");\n\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, *conv0->getOutput(0),\n                                                    get_width(128, gw, max_channels), {3, 3}, 2, \"model.1\");\n\n    bool c3k = false;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        c3k = true;\n    }\n\n    nvinfer1::IElementWiseLayer* conv2 =\n            C3K2(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                 get_width(256, gw, max_channels), get_depth(2, gd), c3k, true, 0.25, \"model.2\");\n\n    nvinfer1::IElementWiseLayer* conv3 = convBnSiLU(network, weightMap, *conv2->getOutput(0),\n                                                    get_width(256, gw, max_channels), {3, 3}, 2, \"model.3\");\n\n    nvinfer1::IElementWiseLayer* conv4 =\n            C3K2(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                 get_width(512, gw, max_channels), get_depth(2, gd), c3k, true, 0.25, \"model.4\");\n\n    nvinfer1::IElementWiseLayer* conv5 = convBnSiLU(network, weightMap, *conv4->getOutput(0),\n                                                    get_width(512, gw, max_channels), {3, 3}, 2, \"model.5\");\n\n    nvinfer1::ILayer* conv6 = A2C2f(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                                    get_width(512, gw, max_channels), 4, true, 4, true, 2.0, 0.25, 1, true, \"model.6\");\n\n    nvinfer1::IElementWiseLayer* conv7 = convBnSiLU(network, weightMap, *conv6->getOutput(0),\n                                                    get_width(1024, gw, max_channels), {3, 3}, 2, \"model.7\");\n\n    nvinfer1::ILayer* conv8 = A2C2f(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                                    get_width(1024, gw, max_channels), 4, true, 1, true, 2.0, 0.25, 1, true, \"model.8\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLO12 HEAD  ********************************************\n    *******************************************************************************************************/\n\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample9 = network->addResize(*conv8->getOutput(0));\n    assert(upsample9);\n    upsample9->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample9->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensors10[] = {upsample9->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat10 = network->addConcatenation(inputTensors10, 2);\n\n    nvinfer1::ILayer* conv11 =\n            A2C2f(network, weightMap, *cat10->getOutput(0), get_width(1024, gw, max_channels),\n                  get_width(512, gw, max_channels), 4, false, 1, true, 2.0, 0.25, 1, true, \"model.11\");\n\n    nvinfer1::IResizeLayer* upsample12 = network->addResize(*conv11->getOutput(0));\n    assert(upsample12);\n    upsample12->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample12->setScales(scale, 4);\n    nvinfer1::ITensor* inputTensors13[] = {upsample12->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat13 = network->addConcatenation(inputTensors13, 2);\n    nvinfer1::ILayer* conv14 =\n            A2C2f(network, weightMap, *cat13->getOutput(0), get_width(256, gw, max_channels),\n                  get_width(256, gw, max_channels), 4, false, 1, true, 2.0, 0.25, 1, true, \"model.14\");\n\n    nvinfer1::IElementWiseLayer* conv15 = convBnSiLU(network, weightMap, *conv14->getOutput(0),\n                                                     get_width(256, gw, max_channels), {3, 3}, 2, \"model.15\");\n    nvinfer1::ITensor* inputTensors16[] = {conv15->getOutput(0), conv11->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat16 = network->addConcatenation(inputTensors16, 2);\n\n    nvinfer1::ILayer* conv17 =\n            A2C2f(network, weightMap, *cat16->getOutput(0), get_width(512, gw, max_channels),\n                  get_width(512, gw, max_channels), 4, false, 1, true, 2.0, 0.25, 1, true, \"model.17\");\n\n    nvinfer1::IElementWiseLayer* conv18 = convBnSiLU(network, weightMap, *conv17->getOutput(0),\n                                                     get_width(512, gw, max_channels), {3, 3}, 2, \"model.18\");\n    nvinfer1::ITensor* inputTensors19[] = {conv18->getOutput(0), conv8->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat19 = network->addConcatenation(inputTensors19, 2);\n\n    nvinfer1::IElementWiseLayer* conv20 =\n            C3K2(network, weightMap, *cat19->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), get_depth(2, gd), true, true, 0.5, \"model.20\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLO12 OUTPUT  ******************************************\n    *******************************************************************************************************/\n\n    int c2 = std::max(std::max(16, get_width(256, gw, max_channels) / 4), 16 * 4);\n    int c3 = std::max(get_width(256, gw, max_channels), std::min(kNumClass, 100));\n\n    // output 0\n    nvinfer1::IElementWiseLayer* conv21_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv14->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv21_cv2_0_0->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.0.1\");\n\n    nvinfer1::IConvolutionLayer* conv21_cv2_0_2 =\n            network->addConvolutionNd(*conv21_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv2.0.2.weight\"], weightMap[\"model.21.cv2.0.2.bias\"]);\n    conv21_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    auto* conv21_cv3_0_0_0 = DWConv(network, weightMap, *conv14->getOutput(0), get_width(256, gw, max_channels), {3, 3},\n                                    1, \"model.21.cv3.0.0.0\");\n    auto* conv21_cv3_0_0_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_0_0_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.0.0.1\");\n    auto* conv21_cv3_0_1_0 =\n            DWConv(network, weightMap, *conv21_cv3_0_0_1->getOutput(0), c3, {3, 3}, 1, \"model.21.cv3.0.1.0\");\n    auto* conv21_cv3_0_1_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_0_1_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.0.1.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv3_0_2 =\n            network->addConvolutionNd(*conv21_cv3_0_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv3.0.2.weight\"], weightMap[\"model.21.cv3.0.2.bias\"]);\n    conv21_cv3_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv3_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    nvinfer1::ITensor* inputTensor21_0[] = {conv21_cv2_0_2->getOutput(0), conv21_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21_0 = network->addConcatenation(inputTensor21_0, 2);\n\n    //output 1\n    nvinfer1::IElementWiseLayer* conv21_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv17->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv21_cv2_1_0->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv2_1_2 =\n            network->addConvolutionNd(*conv21_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv2.1.2.weight\"], weightMap[\"model.21.cv2.1.2.bias\"]);\n    conv21_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv21_cv3_1_0_0 = DWConv(network, weightMap, *conv17->getOutput(0), get_width(512, gw, max_channels), {3, 3},\n                                    1, \"model.21.cv3.1.0.0\");\n    auto* conv21_cv3_1_0_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_1_0_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.1.0.1\");\n    auto* conv21_cv3_1_1_0 =\n            DWConv(network, weightMap, *conv21_cv3_1_0_1->getOutput(0), c3, {3, 3}, 1, \"model.21.cv3.1.1.0\");\n    auto* conv21_cv3_1_1_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_1_1_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.1.1.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv3_1_2 =\n            network->addConvolutionNd(*conv21_cv3_1_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv3.1.2.weight\"], weightMap[\"model.21.cv3.1.2.bias\"]);\n    conv21_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor21_1[] = {conv21_cv2_1_2->getOutput(0), conv21_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21_1 = network->addConcatenation(inputTensor21_1, 2);\n\n    //output 2\n    nvinfer1::IElementWiseLayer* conv21_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv20->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv21_cv2_2_0->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv2_2_2 =\n            network->addConvolutionNd(*conv21_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv2.2.2.weight\"], weightMap[\"model.21.cv2.2.2.bias\"]);\n    conv21_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    auto* conv21_cv3_2_0_0 = DWConv(network, weightMap, *conv20->getOutput(0), get_width(1024, gw, max_channels),\n                                    {3, 3}, 1, \"model.21.cv3.2.0.0\");\n    auto* conv21_cv3_2_0_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_2_0_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.2.0.1\");\n    auto* conv21_cv3_2_1_0 =\n            DWConv(network, weightMap, *conv21_cv3_2_0_1->getOutput(0), c3, {3, 3}, 1, \"model.21.cv3.2.1.0\");\n    auto* conv21_cv3_2_1_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_2_1_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.2.1.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv3_2_2 =\n            network->addConvolutionNd(*conv21_cv3_2_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv3.2.2.weight\"], weightMap[\"model.21.cv3.2.2.bias\"]);\n    conv21_cv3_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv3_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor21_2[] = {conv21_cv2_2_2->getOutput(0), conv21_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21_2 = network->addConcatenation(inputTensor21_2, 2);\n\n    /*******************************************************************************************************\n    *********************************************  YOLO12 DETECT  ******************************************\n    *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle21_0 = network->addShuffle(*cat21_0->getOutput(0));\n    shuffle21_0->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split21_0_0 = network->addSlice(\n            *shuffle21_0->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split21_0_1 =\n            network->addSlice(*shuffle21_0->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])},\n                              nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* dfl21_0 =\n            DFL(network, weightMap, *split21_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.21.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor22_dfl_0[] = {dfl21_0->getOutput(0), split21_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_0 = network->addConcatenation(inputTensor22_dfl_0, 2);\n    cat22_dfl_0->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle21_1 = network->addShuffle(*cat21_1->getOutput(0));\n    shuffle21_1->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split21_1_0 = network->addSlice(\n            *shuffle21_1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split21_1_1 =\n            network->addSlice(*shuffle21_1->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl21_1 =\n            DFL(network, weightMap, *split21_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.21.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor22_dfl_1[] = {dfl21_1->getOutput(0), split21_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_1 = network->addConcatenation(inputTensor22_dfl_1, 2);\n    cat22_dfl_1->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle21_2 = network->addShuffle(*cat21_2->getOutput(0));\n    shuffle21_2->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split21_2_0 = network->addSlice(\n            *shuffle21_2->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split21_2_1 =\n            network->addSlice(*shuffle21_2->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl21_2 =\n            DFL(network, weightMap, *split21_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.21.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor22_dfl_2[] = {dfl21_2->getOutput(0), split21_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_2 = network->addConcatenation(inputTensor22_dfl_2, 2);\n    cat22_dfl_2->setAxis(1);\n\n    nvinfer1::IPluginV2Layer* yolo =\n            addYoLoLayer(network, std::vector<nvinfer1::IConcatenationLayer*>{cat22_dfl_0, cat22_dfl_1, cat22_dfl_2},\n                         strides, stridesLength, false, false, false);\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 16 * (1 << 20));\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n"
  },
  {
    "path": "yolov12/src/postprocess.cpp",
    "content": "#include \"postprocess.h\"\n#include \"utils.h\"\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    float l, r, t, b;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n\n    if (r_h > r_w) {\n        l = bbox[0];\n        r = bbox[2];\n        t = bbox[1] - (kInputH - r_w * img.rows) / 2;\n        b = bbox[3] - (kInputH - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - (kInputW - r_h * img.cols) / 2;\n        r = bbox[2] - (kInputW - r_h * img.cols) / 2;\n        t = bbox[1];\n        b = bbox[3];\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\ncv::Rect get_rect_obb(cv::Mat& img, float bbox[4]) {\n    float l, r, t, b;\n    float r_w = kObbInputW / (img.cols * 1.0);\n    float r_h = kObbInputH / (img.rows * 1.0);\n\n    if (r_h > r_w) {\n        l = bbox[0];\n        r = bbox[2];\n        t = bbox[1] - (kObbInputH - r_w * img.rows) / 2;\n        b = bbox[3] - (kObbInputH - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - (kObbInputW - r_h * img.cols) / 2;\n        r = bbox[2] - (kObbInputW - r_h * img.cols) / 2;\n        t = bbox[1];\n        b = bbox[3];\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\ncv::Rect get_rect_adapt_landmark(cv::Mat& img, float bbox[4], float lmk[kNumberOfPoints * 3]) {\n    float l, r, t, b;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n    if (r_h > r_w) {\n        l = bbox[0] / r_w;\n        r = bbox[2] / r_w;\n        t = (bbox[1] - (kInputH - r_w * img.rows) / 2) / r_w;\n        b = (bbox[3] - (kInputH - r_w * img.rows) / 2) / r_w;\n        for (int i = 0; i < kNumberOfPoints * 3; i += 3) {\n            lmk[i] /= r_w;\n            lmk[i + 1] = (lmk[i + 1] - (kInputH - r_w * img.rows) / 2) / r_w;\n            // lmk[i + 2]\n        }\n    } else {\n        l = (bbox[0] - (kInputW - r_h * img.cols) / 2) / r_h;\n        r = (bbox[2] - (kInputW - r_h * img.cols) / 2) / r_h;\n        t = bbox[1] / r_h;\n        b = bbox[3] / r_h;\n        for (int i = 0; i < kNumberOfPoints * 3; i += 3) {\n            lmk[i] = (lmk[i] - (kInputW - r_h * img.cols) / 2) / r_h;\n            lmk[i + 1] /= r_h;\n            // lmk[i + 2]\n        }\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\nstatic float iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n            (std::max)(lbox[0], rbox[0]),\n            (std::min)(lbox[2], rbox[2]),\n            (std::max)(lbox[1], rbox[1]),\n            (std::min)(lbox[3], rbox[3]),\n    };\n\n    if (interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS = (interBox[1] - interBox[0]) * (interBox[3] - interBox[2]);\n    float unionBoxS = (lbox[2] - lbox[0]) * (lbox[3] - lbox[1]) + (rbox[2] - rbox[0]) * (rbox[3] - rbox[1]) - interBoxS;\n    return interBoxS / unionBoxS;\n}\n\nstatic bool cmp(const Detection& a, const Detection& b) {\n    if (a.conf == b.conf) {\n        return a.bbox[0] < b.bbox[0];\n    }\n    return a.conf > b.conf;\n}\n\nvoid nms(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh) {\n    int det_size = sizeof(Detection) / sizeof(float);\n    std::map<float, std::vector<Detection>> m;\n\n    for (int i = 0; i < output[0]; i++) {\n        if (output[1 + det_size * i + 4] <= conf_thresh || isnan(output[1 + det_size * i + 4]))\n            continue;\n        Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        if (m.count(det.class_id) == 0)\n            m.emplace(det.class_id, std::vector<Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                    dets.erase(dets.begin() + n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\nvoid batch_nms(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size,\n               float conf_thresh, float nms_thresh) {\n    res_batch.resize(batch_size);\n    for (int i = 0; i < batch_size; i++) {\n        nms(res_batch[i], &output[i * output_size], conf_thresh, nms_thresh);\n    }\n}\n\nvoid process_decode_ptr_host(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element, cv::Mat& img,\n                             int count) {\n    Detection det;\n    for (int i = 0; i < count; i++) {\n        int basic_pos = 1 + i * bbox_element;\n        int keep_flag = decode_ptr_host[basic_pos + 6];\n        if (keep_flag == 1) {\n            det.bbox[0] = decode_ptr_host[basic_pos + 0];\n            det.bbox[1] = decode_ptr_host[basic_pos + 1];\n            det.bbox[2] = decode_ptr_host[basic_pos + 2];\n            det.bbox[3] = decode_ptr_host[basic_pos + 3];\n            det.conf = decode_ptr_host[basic_pos + 4];\n            det.class_id = decode_ptr_host[basic_pos + 5];\n            res.push_back(det);\n        }\n    }\n}\n\nvoid batch_process(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                   int bbox_element, const std::vector<cv::Mat>& img_batch) {\n    res_batch.resize(batch_size);\n    int count = static_cast<int>(*decode_ptr_host);\n    count = std::min(count, kMaxNumOutputBbox);\n    for (int i = 0; i < batch_size; i++) {\n        auto& img = const_cast<cv::Mat&>(img_batch[i]);\n        process_decode_ptr_host(res_batch[i], &decode_ptr_host[i * count], bbox_element, img, count);\n    }\n}\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        cv::Mat img = img_batch[i];\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect(img, res[j].bbox);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                        cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n        }\n    }\n}\n\nvoid draw_bbox_keypoints_line(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    const std::vector<std::pair<int, int>> skeleton_pairs = {\n            {0, 1}, {0, 2},  {0, 5}, {0, 6},  {1, 2},   {1, 3},   {2, 4},   {5, 6},   {5, 7},  {5, 11},\n            {6, 8}, {6, 12}, {7, 9}, {8, 10}, {11, 12}, {11, 13}, {12, 14}, {13, 15}, {14, 16}};\n\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        cv::Mat img = img_batch[i];\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect_adapt_landmark(img, res[j].bbox, res[j].keypoints);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                        cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n\n            for (int k = 0; k < kNumberOfPoints * 3; k += 3) {\n                if (res[j].keypoints[k + 2] > 0.5) {\n                    cv::circle(img, cv::Point((int)res[j].keypoints[k], (int)res[j].keypoints[k + 1]), 3,\n                               cv::Scalar(0, 0x27, 0xC1), -1);\n                }\n            }\n\n            for (const auto& bone : skeleton_pairs) {\n                int kp1_idx = bone.first * 3;\n                int kp2_idx = bone.second * 3;\n                if (res[j].keypoints[kp1_idx + 2] > 0.5 && res[j].keypoints[kp2_idx + 2] > 0.5) {\n                    cv::Point p1((int)res[j].keypoints[kp1_idx], (int)res[j].keypoints[kp1_idx + 1]);\n                    cv::Point p2((int)res[j].keypoints[kp2_idx], (int)res[j].keypoints[kp2_idx + 1]);\n                    cv::line(img, p1, p2, cv::Scalar(0, 0x27, 0xC1), 2);\n                }\n            }\n        }\n    }\n}\n\ncv::Mat scale_mask(cv::Mat mask, cv::Mat img) {\n    int x, y, w, h;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = kInputW;\n        h = r_w * img.rows;\n        x = 0;\n        y = (kInputH - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = kInputH;\n        x = (kInputW - w) / 2;\n        y = 0;\n    }\n    cv::Rect r(x, y, w, h);\n    cv::Mat res;\n    cv::resize(mask(r), res, img.size());\n    return res;\n}\n\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks,\n                    std::unordered_map<int, std::string>& labels_map) {\n    static std::vector<uint32_t> colors = {0xFF3838, 0xFF9D97, 0xFF701F, 0xFFB21D, 0xCFD231, 0x48F90A, 0x92CC17,\n                                           0x3DDB86, 0x1A9334, 0x00D4BB, 0x2C99A8, 0x00C2FF, 0x344593, 0x6473FF,\n                                           0x0018EC, 0x8438FF, 0x520085, 0xCB38FF, 0xFF95C8, 0xFF37C7};\n    for (size_t i = 0; i < dets.size(); i++) {\n        cv::Mat img_mask = scale_mask(masks[i], img);\n        auto color = colors[(int)dets[i].class_id % colors.size()];\n        auto bgr = cv::Scalar(color & 0xFF, color >> 8 & 0xFF, color >> 16 & 0xFF);\n\n        cv::Rect r = get_rect(img, dets[i].bbox);\n        for (int x = r.x; x < r.x + r.width; x++) {\n            for (int y = r.y; y < r.y + r.height; y++) {\n                float val = img_mask.at<float>(y, x);\n                if (val <= 0.5)\n                    continue;\n                img.at<cv::Vec3b>(y, x)[0] = img.at<cv::Vec3b>(y, x)[0] / 2 + bgr[0] / 2;\n                img.at<cv::Vec3b>(y, x)[1] = img.at<cv::Vec3b>(y, x)[1] / 2 + bgr[1] / 2;\n                img.at<cv::Vec3b>(y, x)[2] = img.at<cv::Vec3b>(y, x)[2] / 2 + bgr[2] / 2;\n            }\n        }\n\n        cv::rectangle(img, r, bgr, 2);\n\n        // Get the size of the text\n        cv::Size textSize =\n                cv::getTextSize(labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf),\n                                cv::FONT_HERSHEY_PLAIN, 1.2, 2, NULL);\n        // Set the top left corner of the rectangle\n        cv::Point topLeft(r.x, r.y - textSize.height);\n\n        // Set the bottom right corner of the rectangle\n        cv::Point bottomRight(r.x + textSize.width, r.y + textSize.height);\n\n        // Set the thickness of the rectangle lines\n        int lineThickness = 2;\n\n        // Draw the rectangle on the image\n        cv::rectangle(img, topLeft, bottomRight, bgr, -1);\n\n        cv::putText(img, labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf),\n                    cv::Point(r.x, r.y + 4), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar::all(0xFF), 2);\n    }\n}\n\nvoid process_decode_ptr_host_obb(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element,\n                                 cv::Mat& img, int count) {\n    Detection det;\n    for (int i = 0; i < count; i++) {\n        int basic_pos = 1 + i * bbox_element;\n        int keep_flag = decode_ptr_host[basic_pos + 6];\n        if (keep_flag == 1) {\n            det.bbox[0] = decode_ptr_host[basic_pos + 0];\n            det.bbox[1] = decode_ptr_host[basic_pos + 1];\n            det.bbox[2] = decode_ptr_host[basic_pos + 2];\n            det.bbox[3] = decode_ptr_host[basic_pos + 3];\n            det.conf = decode_ptr_host[basic_pos + 4];\n            det.class_id = decode_ptr_host[basic_pos + 5];\n            det.angle = decode_ptr_host[basic_pos + 7];\n            res.push_back(det);\n        }\n    }\n}\n\nvoid batch_process_obb(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                       int bbox_element, const std::vector<cv::Mat>& img_batch) {\n    res_batch.resize(batch_size);\n    int count = static_cast<int>(*decode_ptr_host);\n    count = std::min(count, kMaxNumOutputBbox);\n    for (int i = 0; i < batch_size; i++) {\n        auto& img = const_cast<cv::Mat&>(img_batch[i]);\n        process_decode_ptr_host_obb(res_batch[i], &decode_ptr_host[i * count], bbox_element, img, count);\n    }\n}\n\nstd::tuple<float, float, float> convariance_matrix(Detection res) {\n    float w = res.bbox[2];\n    float h = res.bbox[3];\n\n    float a = w * w / 12.0;\n    float b = h * h / 12.0;\n    float c = res.angle;\n\n    float cos_r = std::cos(c);\n    float sin_r = std::sin(c);\n\n    float cos_r2 = cos_r * cos_r;\n    float sin_r2 = sin_r * sin_r;\n\n    float a_val = a * cos_r2 + b * sin_r2;\n    float b_val = a * sin_r2 + b * cos_r2;\n    float c_val = (a - b) * cos_r * sin_r;\n\n    return std::make_tuple(a_val, b_val, c_val);\n}\n\nstatic float probiou(const Detection& res1, const Detection& res2, float eps = 1e-7) {\n    // Calculate the prob iou between oriented bounding boxes, https://arxiv.org/pdf/2106.06072v1.pdf.\n    float a1, b1, c1, a2, b2, c2;\n    std::tuple<float, float, float> matrix1 = {a1, b1, c1};\n    std::tuple<float, float, float> matrix2 = {a2, b2, c2};\n    matrix1 = convariance_matrix(res1);\n    matrix2 = convariance_matrix(res2);\n    a1 = std::get<0>(matrix1);\n    b1 = std::get<1>(matrix1);\n    c1 = std::get<2>(matrix1);\n    a2 = std::get<0>(matrix2);\n    b2 = std::get<1>(matrix2);\n    c2 = std::get<2>(matrix2);\n\n    float x1 = res1.bbox[0], y1 = res1.bbox[1];\n    float x2 = res2.bbox[0], y2 = res2.bbox[1];\n\n    float t1 = ((a1 + a2) * std::pow(y1 - y2, 2) + (b1 + b2) * std::pow(x1 - x2, 2)) /\n               ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2) + eps);\n    float t2 = ((c1 + c2) * (x2 - x1) * (y1 - y2)) / ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2) + eps);\n    float t3 = std::log(\n            ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2)) /\n                    (4 * std::sqrt(std::max(a1 * b1 - c1 * c1, 0.0f)) * std::sqrt(std::max(a2 * b2 - c2 * c2, 0.0f)) +\n                     eps) +\n            eps);\n\n    float bd = 0.25f * t1 + 0.5f * t2 + 0.5f * t3;\n    bd = std::max(std::min(bd, 100.0f), eps);\n    float hd = std::sqrt(1.0 - std::exp(-bd) + eps);\n\n    return 1 - hd;\n}\n\nvoid nms_obb(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh) {\n    int det_size = sizeof(Detection) / sizeof(float);\n    std::map<float, std::vector<Detection>> m;\n\n    for (int i = 0; i < output[0]; i++) {\n\n        if (output[1 + det_size * i + 4] <= conf_thresh)\n            continue;\n        Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        if (m.count(det.class_id) == 0)\n            m.emplace(det.class_id, std::vector<Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (probiou(item, dets[n]) >= nms_thresh) {\n                    dets.erase(dets.begin() + n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\nvoid batch_nms_obb(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size,\n                   float conf_thresh, float nms_thresh) {\n    res_batch.resize(batch_size);\n    for (int i = 0; i < batch_size; i++) {\n        nms_obb(res_batch[i], &output[i * output_size], conf_thresh, nms_thresh);\n    }\n}\n\nstatic std::vector<cv::Point> get_corner(cv::Mat& img, const Detection& box) {\n    float cos_value, sin_value;\n\n    // Calculate center point and width/height\n    float x1 = box.bbox[0];\n    float y1 = box.bbox[1];\n    float w = box.bbox[2];\n    float h = box.bbox[3];\n    float angle = box.angle * 180.0f / CV_PI;  // Convert radians to degrees\n\n    // Print original angle\n    std::cout << \"Original angle: \" << angle << std::endl;\n\n    // Swap width and height if height is greater than or equal to width\n    if (h >= w) {\n        std::swap(w, h);\n        angle = fmod(angle + 90.0f, 180.0f);  // Adjust angle to be within [0, 180)\n    }\n\n    // Ensure the angle is between 0 and 180 degrees\n    if (angle < 0) {\n        angle += 360.0f;  // Convert to positive value\n    }\n    if (angle > 180.0f) {\n        angle -= 180.0f;  // Subtract 180 from angles greater than 180\n    }\n\n    // Print adjusted angle\n    std::cout << \"Adjusted angle: \" << angle << std::endl;\n\n    // Convert to normal angle value\n    float normal_angle = fmod(angle, 180.0f);\n    if (normal_angle < 0) {\n        normal_angle += 180.0f;  // Ensure it's a positive value\n    }\n\n    // Print normal angle value\n    std::cout << \"Normal angle: \" << normal_angle << std::endl;\n\n    cos_value = std::cos(angle * CV_PI / 180.0f);  // Convert to radians\n    sin_value = std::sin(angle * CV_PI / 180.0f);\n\n    // Calculate each corner point\n    float l = x1 - w / 2;  // Left boundary\n    float r = x1 + w / 2;  // Right boundary\n    float t = y1 - h / 2;  // Top boundary\n    float b = y1 + h / 2;  // Bottom boundary\n\n    // Use get_rect function to scale the coordinates\n    float bbox[4] = {l, t, r, b};\n    cv::Rect rect = get_rect_obb(img, bbox);\n\n    float x_ = (rect.x + rect.x + rect.width) / 2;   // Center x\n    float y_ = (rect.y + rect.y + rect.height) / 2;  // Center y\n    float width = rect.width;                        // Width\n    float height = rect.height;                      // Height\n\n    // Calculate each corner point\n    std::vector<cv::Point> corner_points(4);\n    float vec1x = width / 2 * cos_value;\n    float vec1y = width / 2 * sin_value;\n    float vec2x = -height / 2 * sin_value;\n    float vec2y = height / 2 * cos_value;\n\n    corner_points[0] = cv::Point(int(round(x_ + vec1x + vec2x)), int(round(y_ + vec1y + vec2y)));  // Top-left corner\n    corner_points[1] = cv::Point(int(round(x_ + vec1x - vec2x)), int(round(y_ + vec1y - vec2y)));  // Top-right corner\n    corner_points[2] =\n            cv::Point(int(round(x_ - vec1x - vec2x)), int(round(y_ - vec1y - vec2y)));  // Bottom-right corner\n    corner_points[3] = cv::Point(int(round(x_ - vec1x + vec2x)), int(round(y_ - vec1y + vec2y)));  // Bottom-left corner\n\n    // Check and adjust corner points to ensure the rectangle is parallel to image boundaries\n    for (auto& point : corner_points) {\n        point.x = std::max(0, std::min(point.x, img.cols - 1));\n        point.y = std::max(0, std::min(point.y, img.rows - 1));\n    }\n\n    return corner_points;\n}\n\nvoid draw_bbox_obb(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    static std::vector<uint32_t> colors = {0xFF3838, 0xFF9D97, 0xFF701F, 0xFFB21D, 0xCFD231, 0x48F90A, 0x92CC17,\n                                           0x3DDB86, 0x1A9334, 0x00D4BB, 0x2C99A8, 0x00C2FF, 0x344593, 0x6473FF,\n                                           0x0018EC, 0x8438FF, 0x520085, 0xCB38FF, 0xFF95C8, 0xFF37C7};\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        auto& img = img_batch[i];\n        for (auto& obj : res) {\n            auto color = colors[(int)obj.class_id % colors.size()];\n            auto bgr = cv::Scalar(color & 0xFF, color >> 8 & 0xFF, color >> 16 & 0xFF);\n            auto corner_points = get_corner(img, obj);\n            cv::polylines(img, std::vector<std::vector<cv::Point>>{corner_points}, true, bgr, 1);\n\n            auto text = (std::to_string((int)(obj.class_id)) + \":\" + to_string_with_precision(obj.conf));\n            cv::Size textsize = cv::getTextSize(text, 0, 0.3, 1, nullptr);\n\n            int width = textsize.width;\n            int height = textsize.height;\n            bool outside = (corner_points[0].y - height >= 3) ? true : false;\n            cv::Point p1(corner_points[0].x, corner_points[0].y), p2;\n            p2.x = corner_points[0].x + width;\n            if (outside) {\n                p2.y = corner_points[0].y - height - 3;\n            } else {\n                p2.y = corner_points[0].y + height + 3;\n            }\n            cv::rectangle(img, p1, p2, bgr, -1, cv::LINE_AA);\n            cv::putText(\n                    img, text,\n                    cv::Point(corner_points[0].x, (outside ? corner_points[0].y - 2 : corner_points[0].y + height + 2)),\n                    0, 0.3, cv::Scalar::all(255), 1, cv::LINE_AA);\n        }\n    }\n}\n"
  },
  {
    "path": "yolov12/src/postprocess.cu",
    "content": "//\n// Created by lindsay on 23-7-17.\n//\n#include \"postprocess.h\"\n#include \"types.h\"\n\nstatic __global__ void decode_kernel_obb(float* predict, int num_bboxes, float confidence_threshold, float* parray,\n                                         int max_objects) {\n    float count = predict[0];\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    if (position >= count)\n        return;\n\n    float* pitem = predict + 1 + position * (sizeof(Detection) / sizeof(float));\n    int index = atomicAdd(parray, 1);\n    if (index >= max_objects)\n        return;\n\n    float confidence = pitem[4];\n\n    if (confidence < confidence_threshold)\n        return;\n    //[center_x center_y w h conf class_id  mask[32] keypoints[51] angle]\n    float cx = pitem[0];\n    float cy = pitem[1];\n    float width = pitem[2];\n    float height = pitem[3];\n    float label = pitem[5];\n    float angle = pitem[89];\n\n    float* pout_item = parray + 1 + index * bbox_element;\n    *pout_item++ = cx;\n    *pout_item++ = cy;\n    *pout_item++ = width;\n    *pout_item++ = height;\n    *pout_item++ = confidence;\n    *pout_item++ = label;\n    *pout_item++ = 1;  // 1 = keep, 0 = ignore\n    *pout_item++ = angle;\n}\n\nstatic __global__ void decode_kernel(float* predict, int num_bboxes, float confidence_threshold, float* parray,\n                                     int max_objects) {\n    float count = predict[0];\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    if (position >= count)\n        return;\n\n    float* pitem = predict + 1 + position * (sizeof(Detection) / sizeof(float));\n    int index = atomicAdd(parray, 1);\n    if (index >= max_objects)\n        return;\n\n    float confidence = pitem[4];\n    if (confidence < confidence_threshold)\n        return;\n\n    float left = pitem[0];\n    float top = pitem[1];\n    float right = pitem[2];\n    float bottom = pitem[3];\n    float label = pitem[5];\n\n    float* pout_item = parray + 1 + index * bbox_element;\n    *pout_item++ = left;\n    *pout_item++ = top;\n    *pout_item++ = right;\n    *pout_item++ = bottom;\n    *pout_item++ = confidence;\n    *pout_item++ = label;\n    *pout_item++ = 1;  // 1 = keep, 0 = ignore\n}\n\nstatic __device__ float box_iou(float aleft, float atop, float aright, float abottom, float bleft, float btop,\n                                float bright, float bbottom) {\n    float cleft = max(aleft, bleft);\n    float ctop = max(atop, btop);\n    float cright = min(aright, bright);\n    float cbottom = min(abottom, bbottom);\n    float c_area = max(cright - cleft, 0.0f) * max(cbottom - ctop, 0.0f);\n    if (c_area == 0.0f)\n        return 0.0f;\n\n    float a_area = max(0.0f, aright - aleft) * max(0.0f, abottom - atop);\n    float b_area = max(0.0f, bright - bleft) * max(0.0f, bbottom - btop);\n    return c_area / (a_area + b_area - c_area);\n}\n\nstatic __global__ void nms_kernel(float* bboxes, int max_objects, float threshold) {\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    int count = min(static_cast<int>(bboxes[0]), max_objects);\n    if (position >= count)\n        return;\n\n    float* pcurrent = bboxes + 1 + position * bbox_element;\n    for (int i = 0; i < count; ++i) {\n        float* pitem = bboxes + 1 + i * bbox_element;\n        if (i == position || pcurrent[5] != pitem[5])\n            continue;\n        if (pitem[4] >= pcurrent[4]) {\n            if (pitem[4] == pcurrent[4] && i < position)\n                continue;\n            float iou =\n                    box_iou(pcurrent[0], pcurrent[1], pcurrent[2], pcurrent[3], pitem[0], pitem[1], pitem[2], pitem[3]);\n            if (iou > threshold) {\n                pcurrent[6] = 0;\n                return;\n            }\n        }\n    }\n}\n\nstatic __device__ void convariance_matrix(float w, float h, float r, float& a, float& b, float& c) {\n    float a_val = w * w / 12.0f;\n    float b_val = h * h / 12.0f;\n    float cos_r = cosf(r);\n    float sin_r = sinf(r);\n\n    a = a_val * cos_r * cos_r + b_val * sin_r * sin_r;\n    b = a_val * sin_r * sin_r + b_val * cos_r * cos_r;\n    c = (a_val - b_val) * sin_r * cos_r;\n}\n\nstatic __device__ float box_probiou(float cx1, float cy1, float w1, float h1, float r1, float cx2, float cy2, float w2,\n                                    float h2, float r2, float eps = 1e-7) {\n\n    // Calculate the prob iou between oriented bounding boxes, https://arxiv.org/pdf/2106.06072v1.pdf.\n    float a1, b1, c1, a2, b2, c2;\n    convariance_matrix(w1, h1, r1, a1, b1, c1);\n    convariance_matrix(w2, h2, r2, a2, b2, c2);\n\n    float t1 = ((a1 + a2) * powf(cy1 - cy2, 2) + (b1 + b2) * powf(cx1 - cx2, 2)) /\n               ((a1 + a2) * (b1 + b2) - powf(c1 + c2, 2) + eps);\n    float t2 = ((c1 + c2) * (cx2 - cx1) * (cy1 - cy2)) / ((a1 + a2) * (b1 + b2) - powf(c1 + c2, 2) + eps);\n    float t3 = logf(((a1 + a2) * (b1 + b2) - powf(c1 + c2, 2)) /\n                            (4 * sqrtf(fmaxf(a1 * b1 - c1 * c1, 0.0f)) * sqrtf(fmaxf(a2 * b2 - c2 * c2, 0.0f)) + eps) +\n                    eps);\n    float bd = 0.25f * t1 + 0.5f * t2 + 0.5f * t3;\n    bd = fmaxf(fminf(bd, 100.0f), eps);\n    float hd = sqrtf(1.0f - expf(-bd) + eps);\n    return 1 - hd;\n}\n\nstatic __global__ void nms_kernel_obb(float* bboxes, int max_objects, float threshold) {\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    int count = min(static_cast<int>(bboxes[0]), max_objects);\n    if (position >= count)\n        return;\n\n    float* pcurrent = bboxes + 1 + position * bbox_element;\n    for (int i = 0; i < count; ++i) {\n        float* pitem = bboxes + 1 + i * bbox_element;\n        if (i == position || pcurrent[5] != pitem[5])\n            continue;\n        if (pitem[4] >= pcurrent[4]) {\n            if (pitem[4] == pcurrent[4] && i < position)\n                continue;\n            float iou = box_probiou(pcurrent[0], pcurrent[1], pcurrent[2], pcurrent[3], pcurrent[7], pitem[0], pitem[1],\n                                    pitem[2], pitem[3], pitem[7]);\n            if (iou > threshold) {\n                pcurrent[6] = 0;\n                return;\n            }\n        }\n    }\n}\n\nvoid cuda_decode(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                 cudaStream_t stream) {\n    int block = 256;\n    int grid = ceil(num_bboxes / (float)block);\n    decode_kernel<<<grid, block, 0, stream>>>((float*)predict, num_bboxes, confidence_threshold, parray, max_objects);\n}\n\nvoid cuda_nms(float* parray, float nms_threshold, int max_objects, cudaStream_t stream) {\n    int block = max_objects < 256 ? max_objects : 256;\n    int grid = ceil(max_objects / (float)block);\n    nms_kernel<<<grid, block, 0, stream>>>(parray, max_objects, nms_threshold);\n}\n\nvoid cuda_decode_obb(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                     cudaStream_t stream) {\n    int block = 256;\n    int grid = ceil(num_bboxes / (float)block);\n    decode_kernel_obb<<<grid, block, 0, stream>>>((float*)predict, num_bboxes, confidence_threshold, parray,\n                                                  max_objects);\n}\n\nvoid cuda_nms_obb(float* parray, float nms_threshold, int max_objects, cudaStream_t stream) {\n    int block = max_objects < 256 ? max_objects : 256;\n    int grid = ceil(max_objects / (float)block);\n    nms_kernel_obb<<<grid, block, 0, stream>>>(parray, max_objects, nms_threshold);\n}\n"
  },
  {
    "path": "yolov12/src/preprocess.cu",
    "content": "#include \"cuda_utils.h\"\n#include \"preprocess.h\"\n\nstatic uint8_t* img_buffer_host = nullptr;\nstatic uint8_t* img_buffer_device = nullptr;\n\n__global__ void warpaffine_kernel(uint8_t* src, int src_line_size, int src_width, int src_height, float* dst,\n                                  int dst_width, int dst_height, uint8_t const_value_st, AffineMatrix d2s, int edge) {\n    int position = blockDim.x * blockIdx.x + threadIdx.x;\n    if (position >= edge)\n        return;\n\n    float m_x1 = d2s.value[0];\n    float m_y1 = d2s.value[1];\n    float m_z1 = d2s.value[2];\n    float m_x2 = d2s.value[3];\n    float m_y2 = d2s.value[4];\n    float m_z2 = d2s.value[5];\n\n    int dx = position % dst_width;\n    int dy = position / dst_width;\n    float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;\n    float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;\n    float c0, c1, c2;\n\n    if (src_x <= -1 || src_x >= src_width || src_y <= -1 || src_y >= src_height) {\n        // out of range\n        c0 = const_value_st;\n        c1 = const_value_st;\n        c2 = const_value_st;\n    } else {\n        int y_low = floorf(src_y);\n        int x_low = floorf(src_x);\n        int y_high = y_low + 1;\n        int x_high = x_low + 1;\n\n        uint8_t const_value[] = {const_value_st, const_value_st, const_value_st};\n        float ly = src_y - y_low;\n        float lx = src_x - x_low;\n        float hy = 1 - ly;\n        float hx = 1 - lx;\n        float w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;\n        uint8_t* v1 = const_value;\n        uint8_t* v2 = const_value;\n        uint8_t* v3 = const_value;\n        uint8_t* v4 = const_value;\n\n        if (y_low >= 0) {\n            if (x_low >= 0)\n                v1 = src + y_low * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v2 = src + y_low * src_line_size + x_high * 3;\n        }\n\n        if (y_high < src_height) {\n            if (x_low >= 0)\n                v3 = src + y_high * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v4 = src + y_high * src_line_size + x_high * 3;\n        }\n\n        c0 = w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0];\n        c1 = w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1];\n        c2 = w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2];\n    }\n\n    // bgr to rgb\n    float t = c2;\n    c2 = c0;\n    c0 = t;\n\n    // normalization\n    c0 = c0 / 255.0f;\n    c1 = c1 / 255.0f;\n    c2 = c2 / 255.0f;\n\n    // rgbrgbrgb to rrrgggbbb\n    int area = dst_width * dst_height;\n    float* pdst_c0 = dst + dy * dst_width + dx;\n    float* pdst_c1 = pdst_c0 + area;\n    float* pdst_c2 = pdst_c1 + area;\n    *pdst_c0 = c0;\n    *pdst_c1 = c1;\n    *pdst_c2 = c2;\n}\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream) {\n    int img_size = src_width * src_height * 3;\n    // copy data to pinned memory\n    memcpy(img_buffer_host, src, img_size);\n    // copy data to device memory\n    CUDA_CHECK(cudaMemcpyAsync(img_buffer_device, img_buffer_host, img_size, cudaMemcpyHostToDevice, stream));\n\n    AffineMatrix s2d, d2s;\n    float scale = std::min(dst_height / (float)src_height, dst_width / (float)src_width);\n\n    s2d.value[0] = scale;\n    s2d.value[1] = 0;\n    s2d.value[2] = -scale * src_width * 0.5 + dst_width * 0.5;\n    s2d.value[3] = 0;\n    s2d.value[4] = scale;\n    s2d.value[5] = -scale * src_height * 0.5 + dst_height * 0.5;\n    cv::Mat m2x3_s2d(2, 3, CV_32F, s2d.value);\n    cv::Mat m2x3_d2s(2, 3, CV_32F, d2s.value);\n    cv::invertAffineTransform(m2x3_s2d, m2x3_d2s);\n\n    memcpy(d2s.value, m2x3_d2s.ptr<float>(0), sizeof(d2s.value));\n\n    int jobs = dst_height * dst_width;\n    int threads = 256;\n    int blocks = ceil(jobs / (float)threads);\n    warpaffine_kernel<<<blocks, threads, 0, stream>>>(img_buffer_device, src_width * 3, src_width, src_height, dst,\n                                                      dst_width, dst_height, 128, d2s, jobs);\n}\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream) {\n    int dst_size = dst_width * dst_height * 3;\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        cuda_preprocess(img_batch[i].ptr(), img_batch[i].cols, img_batch[i].rows, &dst[dst_size * i], dst_width,\n                        dst_height, stream);\n        CUDA_CHECK(cudaStreamSynchronize(stream));\n    }\n}\n\nvoid cuda_preprocess_init(int max_image_size) {\n    // prepare input data in pinned memory\n    CUDA_CHECK(cudaMallocHost((void**)&img_buffer_host, max_image_size * 3));\n    // prepare input data in device memory\n    CUDA_CHECK(cudaMalloc((void**)&img_buffer_device, max_image_size * 3));\n}\n\nvoid cuda_preprocess_destroy() {\n    CUDA_CHECK(cudaFree(img_buffer_device));\n    CUDA_CHECK(cudaFreeHost(img_buffer_host));\n}\n"
  },
  {
    "path": "yolov12/yolo12_det.cpp",
    "content": "\n#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"utils.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\n\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, float& gd, float& gw, int& max_channels,\n                      std::string& type) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine = nullptr;\n\n    serialized_engine = buildEngineYolo12Det(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels, type);\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_buffer_host, float** decode_ptr_host, float** decode_ptr_device,\n                    std::string cuda_post_process) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n    if (cuda_post_process == \"c\") {\n        *output_buffer_host = new float[kBatchSize * kOutputSize];\n    } else if (cuda_post_process == \"g\") {\n        if (kBatchSize > 1) {\n            std::cerr << \"Do not yet support GPU post processing for multiple batches\" << std::endl;\n            exit(0);\n        }\n        // Allocate memory for decode_ptr_host and copy to device\n        *decode_ptr_host = new float[1 + kMaxNumOutputBbox * bbox_element];\n        CUDA_CHECK(cudaMalloc((void**)decode_ptr_device, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element)));\n    }\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchsize,\n           float* decode_ptr_host, float* decode_ptr_device, int model_bboxes, std::string cuda_post_process) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueueV2(buffers, stream, nullptr);\n    if (cuda_post_process == \"c\") {\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n    } else if (cuda_post_process == \"g\") {\n        CUDA_CHECK(\n                cudaMemsetAsync(decode_ptr_device, 0, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), stream));\n        cuda_decode((float*)buffers[1], model_bboxes, kConfThresh, decode_ptr_device, kMaxNumOutputBbox, stream);\n        cuda_nms(decode_ptr_device, kNmsThresh, kMaxNumOutputBbox, stream);  //cuda nms\n        CUDA_CHECK(cudaMemcpyAsync(decode_ptr_host, decode_ptr_device,\n                                   sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference and gpu postprocess time: \"\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir, std::string& type,\n                std::string& cuda_post_process, float& gd, float& gw, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && (argc == 5)) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        auto sub_type = std::string(argv[4]);\n\n        if (sub_type[0] == 'n') {\n            gd = 0.50;\n            gw = 0.25;\n            max_channels = 1024;\n            type = \"n\";\n        } else if (sub_type[0] == 's') {\n            gd = 0.50;\n            gw = 0.50;\n            max_channels = 1024;\n            type = \"s\";\n        } else if (sub_type[0] == 'm') {\n            gd = 0.50;\n            gw = 1.00;\n            max_channels = 512;\n            type = \"m\";\n        } else if (sub_type[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n            type = \"l\";\n        } else if (sub_type[0] == 'x') {\n            gd = 1.0;\n            gw = 1.50;\n            max_channels = 512;\n            type = \"x\";\n        } else {\n            return false;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 5) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n        cuda_post_process = std::string(argv[4]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    // yolo12_det -s ../models/yolo12n.wts ../models/yolo12n.fp32.trt n\n    // yolo12_det -d ../models/yolo12n.fp32.trt ../images c\n    cudaSetDevice(kGpuId);\n    std::string wts_name;\n    std::string engine_name;\n    std::string img_dir;\n    std::string cuda_post_process;\n    std::string type;\n    int model_bboxes;\n    float gd = 0, gw = 0;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir, type, cuda_post_process, gd, gw, max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolo12_det -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to \"\n                     \"plan file\"\n                  << std::endl;\n        std::cerr << \"./yolo12_det -d [.engine] ../images  [c/g]// deserialize plan file and run inference\"\n                  << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, gd, gw, max_channels, type);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* output_buffer_host = nullptr;\n    float* decode_ptr_host = nullptr;\n    float* decode_ptr_device = nullptr;\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host, &decode_ptr_host,\n                   &decode_ptr_device, cuda_post_process);\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n        // Run inference\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize, decode_ptr_host,\n              decode_ptr_device, model_bboxes, cuda_post_process);\n        // 保存output_buffer_host的前100个值，一行一个\n        //        std::ofstream out(\"../models/output.txt\");\n        //        for (int j = 0; j < 100; j++) {\n        //            out << output_buffer_host[j] << std::endl;\n        //        }\n        //        out.close();\n        std::vector<std::vector<Detection>> res_batch;\n        if (cuda_post_process == \"c\") {\n            // NMS\n            batch_nms(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n        } else if (cuda_post_process == \"g\") {\n            //Process gpu decode and nms results\n            batch_process(res_batch, decode_ptr_host, img_batch.size(), bbox_element, img_batch);\n        }\n        // Draw bounding boxes\n        draw_bbox(img_batch, res_batch);\n        // Save images\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    CUDA_CHECK(cudaFree(decode_ptr_device));\n    delete[] decode_ptr_host;\n    delete[] output_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    // Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < kOutputSize; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov12-tubro/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\n\nproject(yolov12)\n\nadd_definitions(-std=c++11)\nadd_definitions(-DAPI_EXPORTS)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\n# Set CUDA compiler - use find_package or environment variable\nif(NOT DEFINED CMAKE_CUDA_COMPILER)\n  find_program(\n    CMAKE_CUDA_COMPILER nvcc\n    HINTS ENV CUDA_HOME\n    PATH_SUFFIXES bin)\nendif()\nenable_language(CUDA)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\ninclude_directories(${PROJECT_SOURCE_DIR}/plugin)\n\n# include and link dirs of cuda and tensorrt\n# Use CUDA_TOOLKIT_ROOT_DIR or CUDA_HOME environment variable\nif(NOT DEFINED CUDA_TOOLKIT_ROOT_DIR)\n  if(DEFINED ENV{CUDA_HOME})\n    set(CUDA_TOOLKIT_ROOT_DIR $ENV{CUDA_HOME})\n  else()\n    set(CUDA_TOOLKIT_ROOT_DIR \"/usr/local/cuda\")\n  endif()\nendif()\n\n# Use TENSORRT_DIR environment variable or default path\nif(NOT DEFINED TENSORRT_DIR)\n  if(DEFINED ENV{TENSORRT_DIR})\n    set(TENSORRT_DIR $ENV{TENSORRT_DIR})\n  else()\n    set(TENSORRT_DIR \"/opt/TensorRT-8.6.1.6\")\n  endif()\nendif()\n\nif(CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n  message(\"embed_platform on\")\n  include_directories(\n    ${CUDA_TOOLKIT_ROOT_DIR}/targets/aarch64-linux/include)\n  link_directories(\n    ${CUDA_TOOLKIT_ROOT_DIR}/targets/aarch64-linux/lib)\nelse()\n  message(\"embed_platform off\")\n\n  # cuda\n  include_directories(${CUDA_TOOLKIT_ROOT_DIR}/include)\n  link_directories(${CUDA_TOOLKIT_ROOT_DIR}/lib64)\n\n  # tensorrt\n  include_directories(${TENSORRT_DIR}/include)\n  link_directories(${TENSORRT_DIR}/lib)\nendif()\n\nadd_library(\n  myplugins SHARED\n  ${PROJECT_SOURCE_DIR}/plugin/yololayer.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nfile(\n  GLOB_RECURSE SRCS\n  ${PROJECT_SOURCE_DIR}/src/*.cpp\n  ${PROJECT_SOURCE_DIR}/src/*.cu)\n\nadd_executable(\n  yolov12-det\n  ${PROJECT_SOURCE_DIR}/yolov12_det.cpp\n  ${SRCS})\ntarget_link_libraries(\n  yolov12-det nvinfer cudart myplugins ${OpenCV_LIBS})\n\nadd_executable(\n  yolov12-seg\n  ${PROJECT_SOURCE_DIR}/yolov12_seg.cpp\n  ${SRCS})\ntarget_link_libraries(\n  yolov12-seg nvinfer cudart myplugins ${OpenCV_LIBS})\n\nadd_executable(\n  yolov12-cls\n  ${PROJECT_SOURCE_DIR}/yolov12_cls.cpp\n  ${SRCS})\ntarget_link_libraries(\n  yolov12-cls nvinfer cudart myplugins ${OpenCV_LIBS})\n"
  },
  {
    "path": "yolov12-tubro/gen_wts.py",
    "content": "import sys  # noqa: F401\nimport argparse\nimport os\nimport struct\nimport torch\n\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description='Convert .pt file to .wts')\n    parser.add_argument('-w', '--weights', required=True,\n                        help='Input weights (.pt) file path (required)')\n    parser.add_argument(\n        '-o', '--output', help='Output (.wts) file path (optional)')\n    parser.add_argument(\n        '-t', '--type', type=str, default='detect', choices=['detect', 'cls', 'seg', 'pose', 'obb'],\n        help='determines the model is detection/classification')\n    args = parser.parse_args()\n    if not os.path.isfile(args.weights):\n        raise SystemExit('Invalid input file')\n    if not args.output:\n        args.output = os.path.splitext(args.weights)[0] + '.wts'\n    elif os.path.isdir(args.output):\n        args.output = os.path.join(\n            args.output,\n            os.path.splitext(os.path.basename(args.weights))[0] + '.wts')\n    return args.weights, args.output, args.type\n\n\npt_file, wts_file, m_type = parse_args()\n\nprint(f'Generating .wts for {m_type} model')\n\n# Load model\nprint(f'Loading {pt_file}')\n\n# Initialize\ndevice = 'cpu'\n\n# Load model\nmodel = torch.load(pt_file, map_location=device, weights_only=False)['model'].float()  # load to FP32\n\nif m_type in ['detect', 'seg', 'pose', 'obb']:\n    anchor_grid = model.model[-1].anchors * model.model[-1].stride[..., None, None]\n\n    delattr(model.model[-1], 'anchors')\n\nmodel.to(device).eval()\n\nwith open(wts_file, 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\n"
  },
  {
    "path": "yolov12-tubro/include/block.h",
    "content": "#pragma once\n\n#include <map>\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n\nusing namespace std;\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file);\n\nnvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                      std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                      std::string lname, float eps);\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, std::vector<int> k, int s, std::string lname, int p = 0, int g = 1,\n                                        int d = 1);\n\nnvinfer1::ILayer* Conv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                       nvinfer1::ITensor& input, int c_out, std::string lname, int k = 1, int s = 1, int padding = 0,\n                       int g = 1, bool act = true);\n\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int grid, int k, int s, int p, std::string lname);\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network,\n                                       std::vector<nvinfer1::IConcatenationLayer*> dets, const int* px_arry,\n                                       int px_arry_num, bool is_segmentation, bool is_pose, bool is_obb);\n\nnvinfer1::IElementWiseLayer* C3k(nvinfer1::INetworkDefinition* network,\n                                 std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c2,\n                                 std::string lname, int n = 1, bool shortcut = true, int g = 1, float e = 0.5,\n                                 int k = 3);\n\nnvinfer1::IElementWiseLayer* C3K2(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c2,\n                                  int n, std::string lname, bool c3k = false, float e = 0.5, int g = 1,\n                                  bool shortcut = true);\n\nnvinfer1::ILayer* AAttn(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                        nvinfer1::ITensor& input, int dim, int num_heads, std::string lname, int area = 1);\n\nnvinfer1::ILayer* DWConv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int ch, std::vector<int> k, int s, std::string lname);\n\nnvinfer1::IElementWiseLayer* ABlock(nvinfer1::INetworkDefinition* network,\n                                    std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                    int dim, int num_heads, std::string lname, float mlp_ratio = 1.2, int area = 1);\n\nnvinfer1::ILayer* A2C2f(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>,\n                        nvinfer1::ITensor& input, int c2, int n, std::string lname, bool a2 = true, int area = 1,\n                        bool residual = false, float mlp_ratio = 2.0, float e = 0.5, int g = 1, bool shortcut = true);\n\nvoid cout_dim(nvinfer1::ITensor& input);\n"
  },
  {
    "path": "yolov12-tubro/include/calibrator.h",
    "content": "#ifndef ENTROPY_CALIBRATOR_H\n#define ENTROPY_CALIBRATOR_H\n\n#include <NvInfer.h>\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\n//! \\class Int8EntropyCalibrator2\n//!\n//! \\brief Implements Entropy calibrator 2.\n//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.\n//!\nclass Int8EntropyCalibrator2 : public nvinfer1::IInt8EntropyCalibrator2 {\n   public:\n    Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name,\n                           const char* input_blob_name, bool read_cache = true);\n    virtual ~Int8EntropyCalibrator2();\n    int getBatchSize() const TRT_NOEXCEPT override;\n    bool getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT override;\n    const void* readCalibrationCache(size_t& length) TRT_NOEXCEPT override;\n    void writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT override;\n\n   private:\n    int batchsize_;\n    int input_w_;\n    int input_h_;\n    int img_idx_;\n    std::string img_dir_;\n    std::vector<std::string> img_files_;\n    size_t input_count_;\n    std::string calib_table_name_;\n    const char* input_blob_name_;\n    bool read_cache_;\n    void* device_input_;\n    std::vector<char> calib_cache_;\n};\n\n#endif  // ENTROPY_CALIBRATOR_H\n"
  },
  {
    "path": "yolov12-tubro/include/config.h",
    "content": "#define USE_FP16\n// #define USE_FP32\n// #define USE_INT8\n\nconst static char* kInputTensorName = \"images\";\nconst static char* kOutputTensorName = \"output\";\nconst static char* kProtoTensorName = \"proto\";\nconst static int kNumClass = 4;\nconst static int kPoseNumClass = 1;\nconst static int kNumberOfPoints = 17;  // number of keypoints total\n// obb model's number of classes\nconstexpr static int kObbNumClass = 15;\nconst static int kObbNe = 1;  // number of extra parameters\nconst static int kBatchSize = 1;\nconst static int kGpuId = 0;\nconst static int kInputH = 640;\nconst static int kInputW = 640;\nconst static int kObbInputH = 1024;\nconst static int kObbInputW = 1024;\nconst static float kNmsThresh = 0.45f;\nconst static float kConfThresh = 0.5f;\nconst static float kConfThreshKeypoints = 0.5f;  // keypoints confidence\nconst static int kMaxInputImageSize = 3000 * 3000;\nconst static int kMaxNumOutputBbox = 1000;\n//Quantization input image folder path\nconst static char* kInputQuantizationFolder = \"./coco_calib\";\n\n// Classfication model's number of classes\nconstexpr static int kClsNumClass = 5;\n// Classfication model's input shape\nconstexpr static int kClsInputH = 224;\nconstexpr static int kClsInputW = 224;\n"
  },
  {
    "path": "yolov12-tubro/include/cuda_utils.h",
    "content": "#ifndef TRTX_CUDA_UTILS_H_\n#define TRTX_CUDA_UTILS_H_\n\n#include <cuda_runtime_api.h>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n#endif  // CUDA_CHECK\n\n#endif  // TRTX_CUDA_UTILS_H_\n"
  },
  {
    "path": "yolov12-tubro/include/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"NvInferRuntimeCommon.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "yolov12-tubro/include/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#include \"NvInfer.h\"\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "yolov12-tubro/include/model.h",
    "content": "#pragma once\n\n#include <assert.h>\n#include <string>\n#include \"NvInfer.h\"\n\nnvinfer1::IHostMemory* buildEngineYolov12Det(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                             nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                             int& max_channels, std::string& type);\n\nnvinfer1::IHostMemory* buildEngineYolov12Seg(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                             nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                             int& max_channels, std::string& type);\n\nnvinfer1::IHostMemory* buildEngineYolov12Cls(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                             nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                             std::string& type, int max_channels);\n"
  },
  {
    "path": "yolov12-tubro/include/postprocess.h",
    "content": "#pragma once\n\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\n// Preprocessing functions\ncv::Rect get_rect(cv::Mat& img, float bbox[4]);\n\n// Processing functions\nvoid batch_process(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                   int bbox_element, const std::vector<cv::Mat>& img_batch);\n\nvoid batch_process_obb(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                       int bbox_element, const std::vector<cv::Mat>& img_batch);\n\nvoid process_decode_ptr_host(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element, cv::Mat& img,\n                             int count);\n\nvoid process_decode_ptr_host_obb(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element,\n                                 cv::Mat& img, int count);\n\n// NMS functions\nvoid nms(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh = 0.5);\n\nvoid batch_nms(std::vector<std::vector<Detection>>& batch_res, float* output, int batch_size, int output_size,\n               float conf_thresh, float nms_thresh = 0.5);\n\nvoid nms_obb(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh = 0.5);\n\nvoid batch_nms_obb(std::vector<std::vector<Detection>>& batch_res, float* output, int batch_size, int output_size,\n                   float conf_thresh, float nms_thresh = 0.5);\n\n// CUDA-related functions\nvoid cuda_decode(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                 cudaStream_t stream);\n\nvoid cuda_nms(float* parray, float nms_threshold, int max_objects, cudaStream_t stream);\n\nvoid cuda_decode_obb(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                     cudaStream_t stream);\n\nvoid cuda_nms_obb(float* parray, float nms_threshold, int max_objects, cudaStream_t stream);\n\n// Drawing functions\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nvoid draw_bbox_obb(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nvoid draw_bbox_keypoints_line(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks,\n                    std::unordered_map<int, std::string>& labels_map);\n"
  },
  {
    "path": "yolov12-tubro/include/preprocess.h",
    "content": "#pragma once\n\n#include <map>\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\nvoid cuda_preprocess_init(int max_image_size);\n\nvoid cuda_preprocess_destroy();\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream);\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream);\n"
  },
  {
    "path": "yolov12-tubro/include/types.h",
    "content": "#pragma once\n#include \"config.h\"\n\nstruct alignas(float) Detection {\n    //center_x center_y w h\n    float bbox[4];\n    float conf;  // bbox_conf * cls_conf\n    float class_id;\n    float mask[32];\n    float keypoints[kNumberOfPoints * 3];  // 17*3 keypoints\n    float angle;                           // obb angle\n};\n\nstruct AffineMatrix {\n    float value[6];\n};\n\nconst int bbox_element =\n        sizeof(AffineMatrix) / sizeof(float) + 1;  // left, top, right, bottom, confidence, class, keepflag\n"
  },
  {
    "path": "yolov12-tubro/include/utils.h",
    "content": "#pragma once\n#include <dirent.h>\n#include <fstream>\n#include <opencv2/opencv.hpp>\n\nstatic inline cv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h) {\n    int w, h, x, y;\n    float r_w = input_w / (img.cols * 1.0);\n    float r_h = input_h / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = input_w;\n        h = r_w * img.rows;\n        x = 0;\n        y = (input_h - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = input_h;\n        x = (input_w - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n    cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\nstatic inline int read_files_in_dir(const char* p_dir_name, std::vector<std::string>& file_names) {\n    DIR* p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 && strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            //            std::cout << \"Found file: \" << cur_file_name << std::endl;\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\n// Function to trim leading and trailing whitespace from a string\nstatic inline std::string trim_leading_whitespace(const std::string& str) {\n    size_t first = str.find_first_not_of(' ');\n    if (std::string::npos == first) {\n        return str;\n    }\n    size_t last = str.find_last_not_of(' ');\n    return str.substr(first, (last - first + 1));\n}\n\n// Src: https://stackoverflow.com/questions/16605967\nstatic inline std::string to_string_with_precision(const float a_value, const int n = 2) {\n    std::ostringstream out;\n    out.precision(n);\n    out << std::fixed << a_value;\n    return out.str();\n}\n\nstatic inline int read_labels(const std::string labels_filename, std::unordered_map<int, std::string>& labels_map) {\n    std::ifstream file(labels_filename);\n    // Read each line of the file\n    std::string line;\n    int index = 0;\n    while (std::getline(file, line)) {\n        // Strip the line of any leading or trailing whitespace\n        line = trim_leading_whitespace(line);\n\n        // Add the stripped line to the labels_map, using the loop index as the key\n        labels_map[index] = line;\n        index++;\n    }\n    // Close the file\n    file.close();\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov12-tubro/plugin/yololayer.cu",
    "content": "#include <assert.h>\n#include <math.h>\n#include <iostream>\n#include <vector>\n#include \"cuda_utils.h\"\n#include \"types.h\"\n#include \"yololayer.h\"\n\nnamespace Tn {\ntemplate <typename T>\nvoid write(char*& buffer, const T& val) {\n    *reinterpret_cast<T*>(buffer) = val;\n    buffer += sizeof(T);\n}\n\ntemplate <typename T>\nvoid read(const char*& buffer, T& val) {\n    val = *reinterpret_cast<const T*>(buffer);\n    buffer += sizeof(T);\n}\n}  // namespace Tn\n\n__device__ float sigmoid(float x) {\n    return 1.0f / (1.0f + exp(-x));\n}\n\nnamespace nvinfer1 {\nYoloLayerPlugin::YoloLayerPlugin(int classCount, int numberofpoints, float confthreshkeypoints, int netWidth,\n                                 int netHeight, int maxOut, bool is_segmentation, bool is_pose, bool is_obb,\n                                 const int* strides, int stridesLength) {\n\n    mClassCount = classCount;\n    mNumberofpoints = numberofpoints;\n    mConfthreshkeypoints = confthreshkeypoints;\n    mYoloV8NetWidth = netWidth;\n    mYoloV8netHeight = netHeight;\n    mMaxOutObject = maxOut;\n    mStridesLength = stridesLength;\n    mStrides = new int[stridesLength];\n    memcpy(mStrides, strides, stridesLength * sizeof(int));\n    is_segmentation_ = is_segmentation;\n    is_pose_ = is_pose;\n    is_obb_ = is_obb;\n}\n\nYoloLayerPlugin::~YoloLayerPlugin() {\n    if (mStrides != nullptr) {\n        delete[] mStrides;\n        mStrides = nullptr;\n    }\n}\n\nYoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length) {\n    using namespace Tn;\n    const char *d = reinterpret_cast<const char*>(data), *a = d;\n    read(d, mClassCount);\n    read(d, mNumberofpoints);\n    read(d, mConfthreshkeypoints);\n    read(d, mThreadCount);\n    read(d, mYoloV8NetWidth);\n    read(d, mYoloV8netHeight);\n    read(d, mMaxOutObject);\n    read(d, mStridesLength);\n    mStrides = new int[mStridesLength];\n    for (int i = 0; i < mStridesLength; ++i) {\n        read(d, mStrides[i]);\n    }\n    read(d, is_segmentation_);\n    read(d, is_pose_);\n    read(d, is_obb_);\n\n    assert(d == a + length);\n}\n\nvoid YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT {\n\n    using namespace Tn;\n    char *d = static_cast<char*>(buffer), *a = d;\n    write(d, mClassCount);\n    write(d, mNumberofpoints);\n    write(d, mConfthreshkeypoints);\n    write(d, mThreadCount);\n    write(d, mYoloV8NetWidth);\n    write(d, mYoloV8netHeight);\n    write(d, mMaxOutObject);\n    write(d, mStridesLength);\n    for (int i = 0; i < mStridesLength; ++i) {\n        write(d, mStrides[i]);\n    }\n    write(d, is_segmentation_);\n    write(d, is_pose_);\n    write(d, is_obb_);\n\n    assert(d == a + getSerializationSize());\n}\n\nsize_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT {\n    return sizeof(mClassCount) + sizeof(mNumberofpoints) + sizeof(mConfthreshkeypoints) + sizeof(mThreadCount) +\n           sizeof(mYoloV8netHeight) + sizeof(mYoloV8NetWidth) + sizeof(mMaxOutObject) + sizeof(mStridesLength) +\n           sizeof(int) * mStridesLength + sizeof(is_segmentation_) + sizeof(is_pose_) + sizeof(is_obb_);\n}\n\nint YoloLayerPlugin::initialize() TRT_NOEXCEPT {\n    return 0;\n}\n\nnvinfer1::Dims YoloLayerPlugin::getOutputDimensions(int index, const nvinfer1::Dims* inputs,\n                                                    int nbInputDims) TRT_NOEXCEPT {\n    int total_size = mMaxOutObject * sizeof(Detection) / sizeof(float);\n    return nvinfer1::Dims3(total_size + 1, 1, 1);\n}\n\nvoid YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT {\n    mPluginNamespace = pluginNamespace;\n}\n\nconst char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT {\n    return mPluginNamespace;\n}\n\nnvinfer1::DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes,\n                                                      int nbInputs) const TRT_NOEXCEPT {\n    return nvinfer1::DataType::kFLOAT;\n}\n\nbool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                                   int nbInputs) const TRT_NOEXCEPT {\n\n    return false;\n}\n\nbool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT {\n\n    return false;\n}\n\nvoid YoloLayerPlugin::configurePlugin(nvinfer1::PluginTensorDesc const* in, int nbInput,\n                                      nvinfer1::PluginTensorDesc const* out, int nbOutput) TRT_NOEXCEPT{};\n\nvoid YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                                      IGpuAllocator* gpuAllocator) TRT_NOEXCEPT{};\n\nvoid YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\nconst char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT {\n\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nvoid YoloLayerPlugin::destroy() TRT_NOEXCEPT {\n    delete this;\n}\n\nnvinfer1::IPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT {\n\n    YoloLayerPlugin* p =\n            new YoloLayerPlugin(mClassCount, mNumberofpoints, mConfthreshkeypoints, mYoloV8NetWidth, mYoloV8netHeight,\n                                mMaxOutObject, is_segmentation_, is_pose_, is_obb_, mStrides, mStridesLength);\n    p->setPluginNamespace(mPluginNamespace);\n    return p;\n}\n\nint YoloLayerPlugin::enqueue(int batchSize, const void* TRT_CONST_ENQUEUE* inputs, void* const* outputs,\n                             void* workspace, cudaStream_t stream) TRT_NOEXCEPT {\n    forwardGpu((const float* const*)inputs, (float*)outputs[0], stream, mYoloV8netHeight, mYoloV8NetWidth, batchSize);\n    return 0;\n}\n\n__device__ float Logist(float data) {\n    return 1.0f / (1.0f + expf(-data));\n};\n\n__global__ void CalDetection(const float* input, float* output, int numElements, int maxoutobject, const int grid_h,\n                             int grid_w, const int stride, int classes, int nk, float confkeypoints, int outputElem,\n                             bool is_segmentation, bool is_pose, bool is_obb) {\n    int idx = threadIdx.x + blockDim.x * blockIdx.x;\n    if (idx >= numElements)\n        return;\n\n    const int N_kpts = nk;\n    int total_grid = grid_h * grid_w;\n    int info_len = 4 + classes + (is_segmentation ? 32 : 0) + (is_pose ? N_kpts * 3 : 0) + (is_obb ? 1 : 0);\n    int batchIdx = idx / total_grid;\n    int elemIdx = idx % total_grid;\n    const float* curInput = input + batchIdx * total_grid * info_len;\n    int outputIdx = batchIdx * outputElem;\n\n    int class_id = 0;\n    float max_cls_prob = 0.0;\n    for (int i = 4; i < 4 + classes; i++) {\n        float p = Logist(curInput[elemIdx + i * total_grid]);\n        if (p > max_cls_prob) {\n            max_cls_prob = p;\n            class_id = i - 4;\n        }\n    }\n\n    if (max_cls_prob < 0.1)\n        return;\n\n    int count = (int)atomicAdd(output + outputIdx, 1);\n    if (count >= maxoutobject)\n        return;\n    char* data = (char*)(output + outputIdx) + sizeof(float) + count * sizeof(Detection);\n    Detection* det = (Detection*)(data);\n\n    int row = elemIdx / grid_w;\n    int col = elemIdx % grid_w;\n\n    det->conf = max_cls_prob;\n    det->class_id = class_id;\n    det->bbox[0] = (col + 0.5f - curInput[elemIdx + 0 * total_grid]) * stride;\n    det->bbox[1] = (row + 0.5f - curInput[elemIdx + 1 * total_grid]) * stride;\n    det->bbox[2] = (col + 0.5f + curInput[elemIdx + 2 * total_grid]) * stride;\n    det->bbox[3] = (row + 0.5f + curInput[elemIdx + 3 * total_grid]) * stride;\n\n    if (is_segmentation) {\n        for (int k = 0; k < 32; ++k) {\n            det->mask[k] =\n                    curInput[elemIdx + (4 + classes + (is_pose ? N_kpts * 3 : 0) + (is_obb ? 1 : 0) + k) * total_grid];\n        }\n    }\n\n    if (is_pose) {\n        for (int kpt = 0; kpt < N_kpts; kpt++) {\n            int kpt_x_idx = (4 + classes + (is_segmentation ? 32 : 0) + (is_obb ? 1 : 0) + kpt * 3) * total_grid;\n            int kpt_y_idx = (4 + classes + (is_segmentation ? 32 : 0) + (is_obb ? 1 : 0) + kpt * 3 + 1) * total_grid;\n            int kpt_conf_idx = (4 + classes + (is_segmentation ? 32 : 0) + (is_obb ? 1 : 0) + kpt * 3 + 2) * total_grid;\n\n            float kpt_confidence = sigmoid(curInput[elemIdx + kpt_conf_idx]);\n\n            float kpt_x = (curInput[elemIdx + kpt_x_idx] * 2.0 + col) * stride;\n            float kpt_y = (curInput[elemIdx + kpt_y_idx] * 2.0 + row) * stride;\n\n            bool is_within_bbox =\n                    kpt_x >= det->bbox[0] && kpt_x <= det->bbox[2] && kpt_y >= det->bbox[1] && kpt_y <= det->bbox[3];\n\n            if (kpt_confidence < confkeypoints || !is_within_bbox) {\n                det->keypoints[kpt * 3] = -1;\n                det->keypoints[kpt * 3 + 1] = -1;\n                det->keypoints[kpt * 3 + 2] = -1;\n            } else {\n                det->keypoints[kpt * 3] = kpt_x;\n                det->keypoints[kpt * 3 + 1] = kpt_y;\n                det->keypoints[kpt * 3 + 2] = kpt_confidence;\n            }\n        }\n    }\n\n    if (is_obb) {\n        double pi = CV_PI;\n        auto angle_inx = curInput[elemIdx + (4 + classes + (is_segmentation ? 32 : 0) + (is_pose ? N_kpts * 3 : 0) +\n                                             0) * total_grid];\n        auto angle = (sigmoid(angle_inx) - 0.25f) * pi;\n\n        auto cos1 = cos(angle);\n        auto sin1 = sin(angle);\n        auto xf = (curInput[elemIdx + 2 * total_grid] - curInput[elemIdx + 0 * total_grid]) / 2;\n        auto yf = (curInput[elemIdx + 3 * total_grid] - curInput[elemIdx + 1 * total_grid]) / 2;\n\n        auto x = xf * cos1 - yf * sin1;\n        auto y = xf * sin1 + yf * cos1;\n\n        float cx = (col + 0.5f + x) * stride;\n        float cy = (row + 0.5f + y) * stride;\n\n        float w1 = (curInput[elemIdx + 0 * total_grid] + curInput[elemIdx + 2 * total_grid]) * stride;\n        float h1 = (curInput[elemIdx + 1 * total_grid] + curInput[elemIdx + 3 * total_grid]) * stride;\n        det->bbox[0] = cx;\n        det->bbox[1] = cy;\n        det->bbox[2] = w1;\n        det->bbox[3] = h1;\n        det->angle = angle;\n    }\n}\n\nvoid YoloLayerPlugin::forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV8netHeight,\n                                 int mYoloV8NetWidth, int batchSize) {\n    int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);\n    cudaMemsetAsync(output, 0, sizeof(float), stream);\n    for (int idx = 0; idx < batchSize; ++idx) {\n        CUDA_CHECK(cudaMemsetAsync(output + idx * outputElem, 0, sizeof(float), stream));\n    }\n    int numElem = 0;\n\n    //    const int maxGrids = mStridesLength;\n    //    int grids[maxGrids][2];\n    //    for (int i = 0; i < maxGrids; ++i) {\n    //        grids[i][0] = mYoloV8netHeight / mStrides[i];\n    //        grids[i][1] = mYoloV8NetWidth / mStrides[i];\n    //    }\n\n    int maxGrids = mStridesLength;\n    int flatGridsLen = 2 * maxGrids;\n    int* flatGrids = new int[flatGridsLen];\n\n    for (int i = 0; i < maxGrids; ++i) {\n        flatGrids[2 * i] = mYoloV8netHeight / mStrides[i];\n        flatGrids[2 * i + 1] = mYoloV8NetWidth / mStrides[i];\n    }\n\n    for (unsigned int i = 0; i < maxGrids; i++) {\n        // Access the elements of the original 2D array from the flattened 1D array\n        int grid_h = flatGrids[2 * i];      // Corresponds to the access of grids[i][0]\n        int grid_w = flatGrids[2 * i + 1];  // Corresponds to the access of grids[i][1]\n        int stride = mStrides[i];\n        numElem = grid_h * grid_w * batchSize;  // Calculate the total number of elements\n        if (numElem < mThreadCount)             // Adjust the thread count if needed\n            mThreadCount = numElem;\n\n        // The CUDA kernel call remains unchanged\n        CalDetection<<<(numElem + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream>>>(\n                inputs[i], output, numElem, mMaxOutObject, grid_h, grid_w, stride, mClassCount, mNumberofpoints,\n                mConfthreshkeypoints, outputElem, is_segmentation_, is_pose_, is_obb_);\n    }\n\n    delete[] flatGrids;\n}\n\nPluginFieldCollection YoloPluginCreator::mFC{};\nstd::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\nYoloPluginCreator::YoloPluginCreator() {\n    mPluginAttributes.clear();\n    mFC.nbFields = mPluginAttributes.size();\n    mFC.fields = mPluginAttributes.data();\n}\n\nconst char* YoloPluginCreator::getPluginName() const TRT_NOEXCEPT {\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloPluginCreator::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nconst PluginFieldCollection* YoloPluginCreator::getFieldNames() TRT_NOEXCEPT {\n    return &mFC;\n}\n\nIPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT {\n    assert(fc->nbFields == 1);\n    assert(strcmp(fc->fields[0].name, \"combinedInfo\") == 0);\n    const int* combinedInfo = static_cast<const int*>(fc->fields[0].data);\n    int netinfo_count = 9;\n    int class_count = combinedInfo[0];\n    int numberofpoints = combinedInfo[1];\n    float confthreshkeypoints = combinedInfo[2];\n    int input_w = combinedInfo[3];\n    int input_h = combinedInfo[4];\n    int max_output_object_count = combinedInfo[5];\n    bool is_segmentation = combinedInfo[6];\n    bool is_pose = combinedInfo[7];\n    bool is_obb = combinedInfo[8];\n    const int* px_arry = combinedInfo + netinfo_count;\n    int px_arry_length = fc->fields[0].length - netinfo_count;\n    YoloLayerPlugin* obj =\n            new YoloLayerPlugin(class_count, numberofpoints, confthreshkeypoints, input_w, input_h,\n                                max_output_object_count, is_segmentation, is_pose, is_obb, px_arry, px_arry_length);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\nIPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData,\n                                                     size_t serialLength) TRT_NOEXCEPT {\n    // This object will be deleted when the network is destroyed, which will\n    // call YoloLayerPlugin::destroy()\n    YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\n}  // namespace nvinfer1\n"
  },
  {
    "path": "yolov12-tubro/plugin/yololayer.h",
    "content": "#pragma once\n\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n#include \"macros.h\"\n\nnamespace nvinfer1 {\nclass API YoloLayerPlugin : public IPluginV2IOExt {\n   public:\n    YoloLayerPlugin(int classCount, int numberofpoints, float confthreshkeypoints, int netWidth, int netHeight,\n                    int maxOut, bool is_segmentation, bool is_pose, bool is_obb, const int* strides, int stridesLength);\n\n    YoloLayerPlugin(const void* data, size_t length);\n\n    ~YoloLayerPlugin();\n\n    int getNbOutputs() const TRT_NOEXCEPT override { return 1; }\n\n    nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n    int initialize() TRT_NOEXCEPT override;\n\n    virtual void terminate() TRT_NOEXCEPT override {}\n\n    virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0; }\n\n    virtual int enqueue(int batchSize, const void* const* inputs, void* TRT_CONST_ENQUEUE* outputs, void* workspace,\n                        cudaStream_t stream) TRT_NOEXCEPT override;\n\n    virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n    virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n    bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs,\n                                   int nbOutputs) const TRT_NOEXCEPT override {\n        return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n    }\n\n    const char* getPluginType() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    void destroy() TRT_NOEXCEPT override;\n\n    IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n    nvinfer1::DataType getOutputDataType(int32_t index, nvinfer1::DataType const* inputTypes,\n                                         int32_t nbInputs) const TRT_NOEXCEPT;\n\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                      int nbInputs) const TRT_NOEXCEPT override;\n\n    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n    void attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                         IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n    void configurePlugin(PluginTensorDesc const* in, int32_t nbInput, PluginTensorDesc const* out,\n                         int32_t nbOutput) TRT_NOEXCEPT override;\n\n    void detachFromContext() TRT_NOEXCEPT override;\n\n   private:\n    void forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV8netHeight,\n                    int mYoloV8NetWidth, int batchSize);\n\n    int mThreadCount = 256;\n    const char* mPluginNamespace;\n    int mClassCount;\n    int mNumberofpoints;\n    float mConfthreshkeypoints;\n    int mYoloV8NetWidth;\n    int mYoloV8netHeight;\n    int mMaxOutObject;\n    bool is_segmentation_;\n    bool is_pose_;\n    bool is_obb_;\n    int* mStrides;\n    int mStridesLength;\n};\n\nclass API YoloPluginCreator : public IPluginCreator {\n   public:\n    YoloPluginCreator();\n\n    ~YoloPluginCreator() override = default;\n\n    const char* getPluginName() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    const nvinfer1::PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n    nvinfer1::IPluginV2IOExt* createPlugin(const char* name,\n                                           const nvinfer1::PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n    nvinfer1::IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData,\n                                                size_t serialLength) TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override { mNamespace = libNamespace; }\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override { return mNamespace.c_str(); }\n\n   private:\n    std::string mNamespace;\n    static PluginFieldCollection mFC;\n    static std::vector<PluginField> mPluginAttributes;\n};\n\nREGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n}  // namespace nvinfer1\n"
  },
  {
    "path": "yolov12-tubro/readme.md",
    "content": "## Introduction\n\nYolov12 model supports TensorRT-8.\n\nDetection training code [link](https://github.com/sunsmarterjie/yolov12/releases/tag/turbo)\nSegment training code[link](https://github.com/sunsmarterjie/yolov12/releases/tag/seg)\nClassify training code[link](https://github.com/sunsmarterjie/yolov12/releases/tag/cls)\n\n## Environment\n\n* cuda 11.6\n* cudnn 8.9.1.23\n* tensorrt 8.6.1.6\n* opencv 4.8.0\n* ultralytics 8.3.63\n\n## Support\n\n* [x] YOLO12-det support FP32/FP16 and C++ API\n* [x] YOLO12-seg support FP32/FP16 and C++ API\n* [x] YOLO12-cls support FP32/FP16 and C++ API\n\n\n## Config\n\n* Choose the YOLO12 sub-model n/s/m/l/x from command line arguments.\n* Other configs please check [src/config.h](src/config.h)\n\n## Build and Run (Detection)\n\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\n\n```shell\n# You are supposed to train your own models instead of using the pre-trained models\n# to download other models, replace 'yolov12n.pt' with 'yolov12s.pt', 'yolov12m.pt', 'yolov12l.pt' or 'yolov12x.pt'\n# Generate .wts\ncp [PATH-TO-TENSORRTX]/yolov12/gen_wts.py .\npython gen_wts.py -w yolov12n.pt -o yolov12n.wts -t detect\n# A file 'yolov12n.wts' will be generated.\n```\n\n2. build tensorrtx/yolov12 and run\n```shell\ncd [PATH-TO-TENSORRTX]/yolov12\nmkdir build\ncd build\ncmake ..\nmake\n```\n\n\n\n## Build and Run (Segment)\n\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\n\n```shell\n# You are supposed to train your own models instead of using the pre-trained models\nto download other models, replace 'yolov12n-seg.pt' with 'yolov12s-seg.pt', 'yolov12m-seg.pt', 'yolov12l-seg.pt' or 'yolov12x-seg.pt'\n# Generate .wts\ncp [PATH-TO-TENSORRTX]/yolov12/gen_wts.py .\npython gen_wts.py -w yolov12n.pt -o yolov12n.wts -t seg\n# A file 'yolov12n.wts' will be generated.\n```\n\n2. build tensorrtx/yolov12 and run\n```shell\ncd [PATH-TO-TENSORRTX]/yolov12\nmkdir build\ncd build\ncmake ..\nmake\n```\n\n## Build and Run (Classify)\n\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\n\n```shell\n# Download ultralytics\n\n# You are supposed to train your own models instead of using the pre-trained models\nto download other models, replace 'yolov12n-cls.pt' with 'yolov12s-cls.pt', 'yolov12m-cls.pt', 'yolov12l-cls.pt' or 'yolov12x-cls.pt'\n# Generate .wts\ncp [PATH-TO-TENSORRTX]/yolov12/gen_wts.py .\npython gen_wts.py -w yolov12n-cls.pt -t cls -o yolov12n-cls.wts\n# A file 'yolov12n-cls.wts' will be generated.\n```\n\n2. build tensorrtx/yolov12 and run\n```shell\ncd [PATH-TO-TENSORRTX]/yolov12\nmkdir build\ncd build\ncmake ..\nmake\n```\n\n### Detection\n```shell\ncp [PATH-TO-ultralytics]/yolov12n.wts .\n# Build and serialize TensorRT engine\n./yolov12_det -s yolov12n.wts yolov12n.engine [n/s/m/l/x]\n# Run inference\n./yolov12_det -d yolov12n.engine ../images [c/g]\n# results saved in build directory\n```\n\n### Segment\n```shell\ncp [PATH-TO-ultralytics]/yolov2n-seg.wts .\n# Build and serialize TensorRT engine\n./yolov12-seg -s yolov12n-seg.wts yolov12n-seg.engine [n/s/m/l/x]\n# Run inference\n./yolov12-seg -d yolov12n-seg.engine ../images\n# results saved in build directory\n```\n\n\n\n### Classify\n```shell\ncp [PATH-TO-ultralytics]/yolov2n-cls.wts .\n# Build and serialize TensorRT engine\n./yolov12-cls -s yolov12n-cls.wts yolov12n-cls.engine [n/s/m/l/x]\n# Run inference\n./yolov12-cls -d yolov12n-cls.engine ../images\n# results saved in build directory\n## More Information\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n"
  },
  {
    "path": "yolov12-tubro/src/block.cpp",
    "content": "#include \"block.h\"\n#include <assert.h>\n#include <math.h>\n#include <fstream>\n#include <iostream>\n#include \"config.h\"\n#include \"model.h\"\n#include \"yololayer.h\"\n\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, nvinfer1::Weights> WeightMap;\n\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = nvinfer1::DataType::kFLOAT;\n\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; x++) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        wt.count = size;\n        WeightMap[name] = wt;\n        // std::cout << \"===========name:              \" << name << std::endl;\n    }\n    return WeightMap;\n}\n\nnvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                      std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                      std::string lname, float eps) {\n\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights scale{nvinfer1::DataType::kFLOAT, scval, len};\n\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights shift{nvinfer1::DataType::kFLOAT, shval, len};\n\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    nvinfer1::Weights power{nvinfer1::DataType::kFLOAT, pval, len};\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    nvinfer1::IScaleLayer* output = network->addScale(input, nvinfer1::ScaleMode::kCHANNEL, shift, scale, power);\n    assert(output);\n    return output;\n}\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, std::vector<int> k, int s, std::string lname, int p, int g, int d) {\n\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k[0], k[1]},\n                                                                  weightMap[lname + \".conv.weight\"], bias_empty);\n\n    conv->setNbGroups(g);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    // auto pad\n    int p0 = k[0] / 2;\n    int p1 = k[1] / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p0, p1});\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n\nstatic nvinfer1::ILayer* bottleneck(nvinfer1::INetworkDefinition* network,\n                                    std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                    int c1, int c2, bool shortcut, std::vector<int> k1, std::vector<int> k2, float e,\n                                    int g, std::string lname) {\n    int c_ = (int)((float)c2 * e);\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, c_, k1, 1, lname + \".cv1\");\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *conv1->getOutput(0), c2, k2, 1, lname + \".cv2\", 0, g);\n\n    if (shortcut && c1 == c2) {\n        nvinfer1::IElementWiseLayer* ew =\n                network->addElementWise(input, *conv2->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n        return ew;\n    }\n    return conv2;\n}\n\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int grid, int k, int s, int p, std::string lname) {\n\n    nvinfer1::IShuffleLayer* shuffle1 = network->addShuffle(input);\n    shuffle1->setReshapeDimensions(nvinfer1::Dims4{kBatchSize, 4, 16, grid});\n    shuffle1->setSecondTranspose(nvinfer1::Permutation{0, 2, 1, 3});\n    nvinfer1::ISoftMaxLayer* softmax = network->addSoftMax(*shuffle1->getOutput(0));\n    softmax->setAxes(1 << 1);\n\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(*softmax->getOutput(0), 1, nvinfer1::DimsHW{1, 1}, weightMap[lname], bias_empty);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n\n    nvinfer1::IShuffleLayer* shuffle2 = network->addShuffle(*conv->getOutput(0));\n    shuffle2->setReshapeDimensions(nvinfer1::Dims3{kBatchSize, 4, grid});\n\n    return shuffle2;\n}\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network,\n                                       std::vector<nvinfer1::IConcatenationLayer*> dets, const int* px_arry,\n                                       int px_arry_num, bool is_segmentation, bool is_pose, bool is_obb) {\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    const int netinfo_count = 9;  // Assuming the first 5 elements are for netinfo as per existing code.\n    const int total_count = netinfo_count + px_arry_num;  // Total number of elements for netinfo and px_arry combined.\n\n    std::vector<int> combinedInfo(total_count);\n    int class_num = kNumClass;\n    if (is_pose)\n        class_num = kPoseNumClass;\n    else if (is_obb)\n        class_num = kObbNumClass;\n    int input_w = kInputW;\n    if (is_obb)\n        input_w = kObbInputW;\n    int input_h = kInputH;\n    if (is_obb)\n        input_h = kObbInputH;\n    // Fill in the first 5 elements as per existing netinfo.\n    combinedInfo[0] = class_num;\n    combinedInfo[1] = kNumberOfPoints;\n    combinedInfo[2] = kConfThreshKeypoints;\n    combinedInfo[3] = input_w;\n    combinedInfo[4] = input_h;\n    combinedInfo[5] = kMaxNumOutputBbox;\n    combinedInfo[6] = is_segmentation;\n    combinedInfo[7] = is_pose;\n    combinedInfo[8] = is_obb;\n\n    // Copy the contents of px_arry into the combinedInfo vector after the initial 5 elements.\n    std::copy(px_arry, px_arry + px_arry_num, combinedInfo.begin() + netinfo_count);\n\n    // Now let's create the PluginField object to hold this combined information.\n    nvinfer1::PluginField pluginField;\n    pluginField.name = \"combinedInfo\";  // This can be any name that the plugin will recognize\n    pluginField.data = combinedInfo.data();\n    pluginField.type = nvinfer1::PluginFieldType::kINT32;\n    pluginField.length = combinedInfo.size();\n\n    // Create the PluginFieldCollection to hold the PluginField object.\n    nvinfer1::PluginFieldCollection pluginFieldCollection;\n    pluginFieldCollection.nbFields = 1;  // We have just one field, but it's a combined array\n    pluginFieldCollection.fields = &pluginField;\n\n    // Create the plugin object using the PluginFieldCollection.\n    nvinfer1::IPluginV2* pluginObject = creator->createPlugin(\"yololayer\", &pluginFieldCollection);\n\n    // We assume that the plugin is to be added onto the network.\n    // Prepare input tensors for the YOLO Layer.\n    std::vector<nvinfer1::ITensor*> inputTensors;\n    for (auto det : dets) {\n        inputTensors.push_back(det->getOutput(0));  // Assuming each IConcatenationLayer has one output tensor.\n    }\n\n    // Add the plugin to the network using the prepared input tensors.\n    nvinfer1::IPluginV2Layer* yoloLayer = network->addPluginV2(inputTensors.data(), inputTensors.size(), *pluginObject);\n\n    return yoloLayer;  // Return the added YOLO layer.\n}\n\nnvinfer1::ILayer* Conv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                       nvinfer1::ITensor& input, int c_out, std::string lname, int k, int s, int padding, int g,\n                       bool act) {\n    nvinfer1::Weights emptywts{nvinfer1::DataType::kFLOAT, nullptr, 0};\n\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(input, c_out, nvinfer1::DimsHW{k, k},\n                                                                  weightMap[lname + \".conv.weight\"], emptywts);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    // auto pad\n    int p0 = k / 2;\n    int p1 = k / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p0, p1});\n    conv->setNbGroups(g);\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n    if (act) {\n        nvinfer1::IActivationLayer* sigmoid =\n                network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n        nvinfer1::IElementWiseLayer* ew = network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0),\n                                                                  nvinfer1::ElementWiseOperation::kPROD);\n        assert(ew);\n        return ew;\n    } else\n        return bn;\n}\n\nnvinfer1::ILayer* DWConv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int ch, std::vector<int> k, int s, std::string lname) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k[0], k[1]},\n                                                                  weightMap[lname + \".conv.weight\"], bias_empty);\n\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setNbGroups(ch);\n    // auto pad\n    int p0 = k[0] / 2;\n    int p1 = k[1] / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p0, p1});\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n\nnvinfer1::IElementWiseLayer* C3k(nvinfer1::INetworkDefinition* network,\n                                 std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c2,\n                                 std::string lname, int n, bool shortcut, int g, float e, int k) {\n    int c_ = c2 * float(e);\n\n    nvinfer1::IElementWiseLayer* cv1 = convBnSiLU(network, weightMap, input, c_, {1, 1}, 1, lname + \".cv1\");\n    nvinfer1::IElementWiseLayer* cv2 = convBnSiLU(network, weightMap, input, c_, {1, 1}, 1, lname + \".cv2\");\n    nvinfer1::ITensor* y = cv1->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        nvinfer1::ILayer* b = bottleneck(network, weightMap, *y, c_, c_, shortcut, {k, k}, {k, k}, 1.0, g,\n                                         lname + \".m.\" + std::to_string(i));\n        y = b->getOutput(0);\n    }\n    nvinfer1::ITensor* inputTensor[] = {y, cv2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensor, 2);\n    nvinfer1::IElementWiseLayer* cv3 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv3\");\n\n    return cv3;\n}\n\nnvinfer1::IElementWiseLayer* C3K2(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c2,\n                                  int n, std::string lname, bool c3k, float e, int g, bool shortcut) {\n    int c = int(c2 * float(e));\n    nvinfer1::ILayer* cv1 = Conv(network, weightMap, input, 2 * c, lname + \".cv1\", 1, 1);\n    nvinfer1::ISliceLayer* sl0 = network->addSlice(\n            *cv1->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n            nvinfer1::Dims4{cv1->getOutput(0)->getDimensions().d[0], cv1->getOutput(0)->getDimensions().d[1] / 2,\n                            cv1->getOutput(0)->getDimensions().d[2], cv1->getOutput(0)->getDimensions().d[3]},\n            nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* sl1 = network->addSlice(\n            *cv1->getOutput(0), nvinfer1::Dims4{0, cv1->getOutput(0)->getDimensions().d[1] / 2, 0, 0},\n            nvinfer1::Dims4{cv1->getOutput(0)->getDimensions().d[0], cv1->getOutput(0)->getDimensions().d[1] / 2,\n                            cv1->getOutput(0)->getDimensions().d[2], cv1->getOutput(0)->getDimensions().d[3]},\n            nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ITensor* inputTensor0[] = {sl0->getOutput(0), sl1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensor0, 2);\n    nvinfer1::ITensor* current = sl1->getOutput(0);\n\n    for (int i = 0; i < n; i++) {\n        nvinfer1::ILayer* b;\n        if (c3k) {\n            b = C3k(network, weightMap, *current, c, lname + \".m.\" + std::to_string(i), 2, shortcut, g);\n        } else {\n            b = bottleneck(network, weightMap, *current, c, c, shortcut, {3, 3}, {3, 3}, 0.5, g,\n                           lname + \".m.\" + std::to_string(i));\n        }\n        current = b->getOutput(0);\n        nvinfer1::ITensor* inputTensors[] = {cat->getOutput(0), b->getOutput(0)};\n        cat = network->addConcatenation(inputTensors, 2);\n    }\n    nvinfer1::IElementWiseLayer* cv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv2\");\n    return cv2;\n}\n\nnvinfer1::ILayer* AAttn(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                        nvinfer1::ITensor& input, int dim, int num_heads, std::string lname, int area) {\n\n    nvinfer1::Dims d_input = input.getDimensions();\n    int B = d_input.d[0];\n    int C = d_input.d[1];\n    int H = d_input.d[2];\n    int W = d_input.d[3];\n    int N = W * H;\n    int head_dim = dim / num_heads;\n    int all_head_dim = head_dim * num_heads;\n\n    nvinfer1::ILayer* qk = Conv(network, weightMap, input, all_head_dim * 2, lname + \".qk\", 1, 1, 0, 1, false);\n    nvinfer1::IShuffleLayer* qk_flatten_t = network->addShuffle(*qk->getOutput(0));\n    qk_flatten_t->setReshapeDimensions(nvinfer1::Dims3{B, -1, N});\n    qk_flatten_t->setSecondTranspose(nvinfer1::Permutation{0, 2, 1});\n\n    nvinfer1::ILayer* v = Conv(network, weightMap, input, all_head_dim, lname + \".v\", 1, 1, 0, 1, false);\n    nvinfer1::IShuffleLayer* v_flatten_t = network->addShuffle(*v->getOutput(0));\n    v_flatten_t->setReshapeDimensions(nvinfer1::Dims3{B, -1, N});\n    v_flatten_t->setSecondTranspose(nvinfer1::Permutation{0, 2, 1});  // (1, 6400, 64)\n\n    nvinfer1::ILayer* pe = Conv(network, weightMap, *v->getOutput(0), dim, lname + \".pe\", 5, 1, 2, dim, false);\n\n    nvinfer1::ITensor* q_k = qk_flatten_t->getOutput(0);\n    nvinfer1::ITensor* v_ = v_flatten_t->getOutput(0);\n    if (area > 1) {\n        B = B * area;\n        N = N / area;\n\n        nvinfer1::IShuffleLayer* qk_reshape = network->addShuffle(*qk_flatten_t->getOutput(0));\n        qk_reshape->setReshapeDimensions(nvinfer1::Dims3{B, N, C * 2});\n        nvinfer1::IShuffleLayer* v_reshape = network->addShuffle(*v_flatten_t->getOutput(0));\n        v_reshape->setReshapeDimensions(nvinfer1::Dims3{B, N, C});\n\n        q_k = qk_reshape->getOutput(0);\n        v_ = v_reshape->getOutput(0);\n    }\n    nvinfer1::Dims q_k_dim = q_k->getDimensions();\n    nvinfer1::ISliceLayer* q =\n            network->addSlice(*q_k, nvinfer1::Dims3{0, 0, 0},\n                              nvinfer1::Dims3{q_k_dim.d[0], q_k_dim.d[1], q_k_dim.d[2] / 2}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* k =\n            network->addSlice(*q_k, nvinfer1::Dims3{0, 0, q_k_dim.d[2] / 2},\n                              nvinfer1::Dims3{q_k_dim.d[0], q_k_dim.d[1], q_k_dim.d[2] / 2}, nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* q_reshape = network->addShuffle(*q->getOutput(0));\n    q_reshape->setReshapeDimensions(nvinfer1::Dims4{B, N, num_heads, head_dim});\n    nvinfer1::IShuffleLayer* k_reshape = network->addShuffle(*k->getOutput(0));\n    k_reshape->setReshapeDimensions(nvinfer1::Dims4{B, N, num_heads, head_dim});\n    nvinfer1::IShuffleLayer* v_reshape = network->addShuffle(*v_);\n    v_reshape->setReshapeDimensions(nvinfer1::Dims4{B, N, num_heads, head_dim});\n\n    // (B, N, num_head, head_dim)--->(B, num_head, head_dim, N)\n    nvinfer1::IShuffleLayer* q_t_view = network->addShuffle(*q_reshape->getOutput(0));\n    q_t_view->setFirstTranspose(nvinfer1::Permutation{0, 2, 3, 1});\n\n    nvinfer1::IShuffleLayer* k_t_view = network->addShuffle(*k_reshape->getOutput(0));\n    k_t_view->setFirstTranspose(nvinfer1::Permutation{0, 2, 3, 1});\n    nvinfer1::IShuffleLayer* v_t_view = network->addShuffle(*v_reshape->getOutput(0));\n    v_t_view->setFirstTranspose(nvinfer1::Permutation{0, 2, 3, 1});\n\n    nvinfer1::IShuffleLayer* q_T = network->addShuffle(*q_t_view->getOutput(0));\n    q_T->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});  // (B, num_head, N, head_dim, N)\n    nvinfer1::IMatrixMultiplyLayer* q_mul_k =\n            network->addMatrixMultiply(*q_T->getOutput(0), nvinfer1::MatrixOperation::kNONE, *k_t_view->getOutput(0),\n                                       nvinfer1::MatrixOperation::kNONE);\n\n    float scale = 1.0 / sqrt(head_dim);\n    float* scale_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    scale_val[0] = scale;\n    nvinfer1::Weights s_w{nvinfer1::DataType::kFLOAT, scale_val, 1};  // scale\n    float* shift_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    shift_val[0] = 0;\n    nvinfer1::Weights sh_w{nvinfer1::DataType::kFLOAT, shift_val, 1};  // shift\n    float* power_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    power_val[0] = 1;\n    nvinfer1::Weights p_w{nvinfer1::DataType::kFLOAT, power_val, 1};  // power\n    nvinfer1::IScaleLayer* q_mul_k_scale =\n            network->addScale(*q_mul_k->getOutput(0), nvinfer1::ScaleMode::kUNIFORM, sh_w, s_w, p_w);\n\n    nvinfer1::IReduceLayer* attn_max =\n            network->addReduce(*q_mul_k_scale->getOutput(0), nvinfer1::ReduceOperation::kMAX, 1 << 3, true);\n\n    nvinfer1::IElementWiseLayer* attn_sub = network->addElementWise(\n            *q_mul_k_scale->getOutput(0), *attn_max->getOutput(0), nvinfer1::ElementWiseOperation::kSUB);\n    nvinfer1::IUnaryLayer* attn_exp = network->addUnary(*attn_sub->getOutput(0), nvinfer1::UnaryOperation::kEXP);\n    nvinfer1::IReduceLayer* attn_sum =\n            network->addReduce(*attn_exp->getOutput(0), nvinfer1::ReduceOperation::kSUM, 1 << 3, true);\n\n    nvinfer1::IElementWiseLayer* attn_div = network->addElementWise(*attn_exp->getOutput(0), *attn_sum->getOutput(0),\n                                                                    nvinfer1::ElementWiseOperation::kDIV);\n\n    nvinfer1::IShuffleLayer* attn_t = network->addShuffle(*attn_div->getOutput(0));\n    attn_t->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});\n\n    nvinfer1::IMatrixMultiplyLayer* attn_v =\n            network->addMatrixMultiply(*v_t_view->getOutput(0), nvinfer1::MatrixOperation::kNONE, *attn_t->getOutput(0),\n                                       nvinfer1::MatrixOperation::kNONE);\n\n    nvinfer1::IShuffleLayer* attn_v_t = network->addShuffle(*attn_v->getOutput(0));\n    attn_v_t->setFirstTranspose(nvinfer1::Permutation{0, 3, 1, 2});\n    nvinfer1::ITensor* attn_temp = attn_v_t->getOutput(0);\n    if (area > 1) {\n        B = B / area;\n        N = N * area;\n\n        nvinfer1::IShuffleLayer* attn_v_t_r = network->addShuffle(*attn_v_t->getOutput(0));\n        attn_v_t_r->setReshapeDimensions(nvinfer1::Dims3{B, N, C});\n        attn_temp = attn_v_t_r->getOutput(0);\n    }\n    nvinfer1::IShuffleLayer* attn_x = network->addShuffle(*attn_temp);\n    attn_x->setReshapeDimensions(nvinfer1::Dims4{B, H, W, C});\n    attn_x->setSecondTranspose(nvinfer1::Permutation{0, 3, 1, 2});\n    nvinfer1::IElementWiseLayer* x_add_pp =\n            network->addElementWise(*attn_x->getOutput(0), *pe->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    nvinfer1::ILayer* proj = Conv(network, weightMap, *x_add_pp->getOutput(0), dim, lname + \".proj\", 1, 1, 0, 1, false);\n\n    return proj;\n}\n\nnvinfer1::IElementWiseLayer* ABlock(nvinfer1::INetworkDefinition* network,\n                                    std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                    int dim, int num_heads, std::string lname, float mlp_ratio, int area) {\n\n    nvinfer1::ILayer* attn = AAttn(network, weightMap, input, dim, num_heads, lname + \".attn\", area);\n    nvinfer1::IElementWiseLayer* add1 =  // x = x + self.attn(x)\n            network->addElementWise(input, *attn->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    int mlp_hidden_dim = int(dim * mlp_ratio);\n\n    nvinfer1::ILayer* mlp_0 =\n            Conv(network, weightMap, *add1->getOutput(0), mlp_hidden_dim, lname + \".mlp.0\", 1, 1, 0, 1, true);\n    nvinfer1::ILayer* mlp_1 = Conv(network, weightMap, *mlp_0->getOutput(0), dim, lname + \".mlp.1\", 1, 1, 0, 1, false);\n\n    nvinfer1::IElementWiseLayer* result =\n            network->addElementWise(*add1->getOutput(0), *mlp_1->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    return result;\n}\n\nnvinfer1::ILayer* A2C2f(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                        nvinfer1::ITensor& input, int c2, int n, std::string lname, bool a2, int area, bool residual,\n                        float mlp_ratio, float e, int g, bool shortcut) {\n\n    int c_ = static_cast<int>(c2 * e);\n    assert(c_ % 32 == 0 && \"Dimension of ABlock must be a multiple of 32\");\n    int num_heads = c_ / 32;\n\n    nvinfer1::ILayer* cv1 = Conv(network, weightMap, input, c_, lname + \".cv1\", 1, 1);\n    std::vector<nvinfer1::ITensor*> y{cv1->getOutput(0)};\n    nvinfer1::ITensor* current = cv1->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        if (a2) {\n            nvinfer1::ILayer* m_0 = ABlock(network, weightMap, *current, c_, num_heads,\n                                           lname + \".m.\" + std::to_string(i) + \".0\", mlp_ratio, area);\n            nvinfer1::ILayer* m_1 = ABlock(network, weightMap, *m_0->getOutput(0), c_, num_heads,\n                                           lname + \".m.\" + std::to_string(i) + \".1\", mlp_ratio, area);\n            current = m_1->getOutput(0);\n        } else {\n            // C3k\n            nvinfer1::ILayer* m =\n                    C3k(network, weightMap, *current, c_, lname + \".m.\" + std::to_string(i), 2, shortcut, g);\n            current = m->getOutput(0);\n        }\n        y.push_back(current);\n    }\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(y.data(), static_cast<int>(y.size()));\n    cat->setAxis(1);\n    nvinfer1::ILayer* cv2 = Conv(network, weightMap, *cat->getOutput(0), c2, lname + \".cv2\", 1, 1);\n\n    if (a2 && residual) {\n        // std::cout << lname << \" applying residual connection with gamma\" << std::endl;\n\n        nvinfer1::Weights gamma = weightMap[lname + \".gamma\"];\n\n        nvinfer1::IConstantLayer* gamma_layer = network->addConstant(nvinfer1::Dims4{1, c2, 1, 1}, gamma);\n        nvinfer1::IElementWiseLayer* scaled_output = network->addElementWise(\n                *gamma_layer->getOutput(0), *cv2->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n        nvinfer1::IElementWiseLayer* result =\n                network->addElementWise(input, *scaled_output->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n\n        return result;\n    } else {\n\n        return cv2;\n    }\n}\n"
  },
  {
    "path": "yolov12-tubro/src/calibrator.cpp",
    "content": "#include \"calibrator.h\"\n#include <fstream>\n#include <iostream>\n#include <iterator>\n#include <opencv2/dnn/dnn.hpp>\n#include \"cuda_utils.h\"\n#include \"utils.h\"\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir,\n                                               const char* calib_table_name, const char* input_blob_name,\n                                               bool read_cache)\n    : batchsize_(batchsize),\n      input_w_(input_w),\n      input_h_(input_h),\n      img_idx_(0),\n      img_dir_(img_dir),\n      calib_table_name_(calib_table_name),\n      input_blob_name_(input_blob_name),\n      read_cache_(read_cache) {\n    input_count_ = 3 * input_w * input_h * batchsize;\n    CUDA_CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n    read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2() {\n    CUDA_CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const TRT_NOEXCEPT {\n    return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT {\n    if (img_idx_ + batchsize_ > (int)img_files_.size()) {\n        return false;\n    }\n\n    std::vector<cv::Mat> input_imgs_;\n    for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n        std::cout << img_files_[i] << \"  \" << i << std::endl;\n        cv::Mat temp = cv::imread(img_dir_ + \"/\" + img_files_[i]);\n        if (temp.empty()) {\n            std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n            return false;\n        }\n        cv::Mat pr_img = preprocess_img(temp, input_w_, input_h_);\n        input_imgs_.push_back(pr_img);\n    }\n    img_idx_ += batchsize_;\n    cv::Mat blob = cv::dnn::blobFromImages(input_imgs_, 1.0 / 255.0, cv::Size(input_w_, input_h_), cv::Scalar(0, 0, 0),\n                                           true, false);\n    CUDA_CHECK(cudaMemcpy(device_input_, blob.ptr<float>(0), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n    assert(!strcmp(names[0], input_blob_name_));\n    bindings[0] = device_input_;\n    return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length) TRT_NOEXCEPT {\n    std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n    calib_cache_.clear();\n    std::ifstream input(calib_table_name_, std::ios::binary);\n    input >> std::noskipws;\n    if (read_cache_ && input.good()) {\n        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n    }\n    length = calib_cache_.size();\n    return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT {\n    std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n    std::ofstream output(calib_table_name_, std::ios::binary);\n    output.write(reinterpret_cast<const char*>(cache), length);\n}\n"
  },
  {
    "path": "yolov12-tubro/src/model.cpp",
    "content": "#include <math.h>\n#include <iostream>\n\n#include \"block.h\"\n//#include \"calibrator.h\"\n#include \"config.h\"\n#include \"model.h\"\n\nstatic int get_width(int x, float gw, int max_channels, int divisor = 8) {\n    auto channel = std::min(x, max_channels);\n    channel = int(ceil((channel * gw) / divisor)) * divisor;\n    return channel;\n}\n\nstatic int get_depth(int x, float gd) {\n    if (x == 1)\n        return 1;\n    int r = round(x * gd);\n    if (x * gd - int(x * gd) == 0.5 && (int(x * gd) % 2) == 0)\n        --r;\n    return std::max<int>(r, 1);\n}\nstatic nvinfer1::IElementWiseLayer* convBnSiLUProto(nvinfer1::INetworkDefinition* network,\n                                                    std::map<std::string, nvinfer1::Weights> weightMap,\n                                                    nvinfer1::ITensor& input, int ch, int k, int s, int p,\n                                                    std::string lname) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k, k}, weightMap[lname + \".conv.weight\"], bias_empty);\n\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n    conv->setName((lname + \".conv\").c_str());\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n    bn->setName((lname + \".bn\").c_str());\n    // This concat operator is not used for calculation, in order to prevent the operator fusion unrealized error when int8 is quantized.\n    // Error Code 10: Internal Error (Could not find any implementation for node\n    // model.22.proto.cv3.conv + model.22.proto.cv3.sigmoid + PWN(PWN((Unnamed Layer* 353) [Activation]), PWN(model.22.proto.cv3.silu)).)\n\n#if defined(USE_INT8)\n    nvinfer1::ITensor* inputTensors[] = {bn->getOutput(0)};\n    auto concat = network->addConcatenation(inputTensors, 1);\n    nvinfer1::IActivationLayer* sigmoid =\n            network->addActivation(*concat->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    assert(sigmoid);\n    bn->setName((lname + \".sigmoid\").c_str());\n    nvinfer1::IElementWiseLayer* ew = network->addElementWise(*concat->getOutput(0), *sigmoid->getOutput(0),\n                                                              nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    ew->setName((lname + \".silu\").c_str());\n#else\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    assert(sigmoid);\n    bn->setName((lname + \".sigmoid\").c_str());\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    ew->setName((lname + \".silu\").c_str());\n\n#endif\n    return ew;\n}\n\nstatic nvinfer1::IElementWiseLayer* Proto(nvinfer1::INetworkDefinition* network,\n                                          std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input,\n                                          std::string lname, float gw, int max_channels) {\n    int mid_channel = get_width(256, gw, max_channels);\n    auto cv1 = convBnSiLU(network, weightMap, input, mid_channel, {3, 3}, 1, lname + \".cv1\");\n    //    float *convTranpsose_bais = (float *) weightMap[\"model.23.proto.upsample.bias\"].values;\n    //    int convTranpsose_bais_len = weightMap[\"model.23.proto.upsample.bias\"].count;\n    //    nvinfer1::Weights bias{nvinfer1::DataType::kFLOAT, convTranpsose_bais, convTranpsose_bais_len};\n    auto convTranpsose =\n            network->addDeconvolutionNd(*cv1->getOutput(0), mid_channel, nvinfer1::DimsHW{2, 2},\n                                        weightMap[lname + \".upsample.weight\"], weightMap[lname + \".upsample.bias\"]);\n\n    assert(convTranpsose);\n    convTranpsose->setStrideNd(nvinfer1::DimsHW{2, 2});\n    convTranpsose->setPadding(nvinfer1::DimsHW{0, 0});\n    auto cv2 = convBnSiLU(network, weightMap, *convTranpsose->getOutput(0), mid_channel, {3, 3}, 1, lname + \".cv2\");\n    auto cv3 = convBnSiLUProto(network, weightMap, *cv2->getOutput(0), 32, 1, 1, 0, lname + \".cv3\");\n    assert(cv3);\n    return cv3;\n}\n\nstatic nvinfer1::IShuffleLayer* cv4_conv_combined(nvinfer1::INetworkDefinition* network,\n                                                  std::map<std::string, nvinfer1::Weights>& weightMap,\n                                                  nvinfer1::ITensor& input, std::string lname, int grid_shape, float gw,\n                                                  const std::string& algo_type, int max_channels) {\n    int nm_nk = 0;\n    int c4 = 0;\n\n    if (algo_type == \"seg\") {\n        nm_nk = 32;\n        c4 = std::max(get_width(256, gw, max_channels) / 4, nm_nk);\n    } else if (algo_type == \"pose\") {\n        nm_nk = kNumberOfPoints * 3;\n        c4 = std::max(get_width(256, gw, max_channels) / 4, kNumberOfPoints * 3);\n    }\n\n    auto cv0 = convBnSiLU(network, weightMap, input, c4, {3, 3}, 1, lname + \".0\");\n    auto cv1 = convBnSiLU(network, weightMap, *cv0->getOutput(0), c4, {3, 3}, 1, lname + \".1\");\n    float* cv2_bais_value = (float*)weightMap[lname + \".2\" + \".bias\"].values;\n    int cv2_bais_len = weightMap[lname + \".2\" + \".bias\"].count;\n    nvinfer1::Weights cv2_bais{nvinfer1::DataType::kFLOAT, cv2_bais_value, cv2_bais_len};\n    auto cv2 = network->addConvolutionNd(*cv1->getOutput(0), nm_nk, nvinfer1::DimsHW{1, 1},\n                                         weightMap[lname + \".2\" + \".weight\"], cv2_bais);\n    cv2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    nvinfer1::IShuffleLayer* cv2_shuffle = network->addShuffle(*cv2->getOutput(0));\n    cv2_shuffle->setReshapeDimensions(nvinfer1::Dims3{kBatchSize, nm_nk, grid_shape});\n\n    return cv2_shuffle;\n}\n\nvoid calculateStrides(nvinfer1::IElementWiseLayer* conv_layers[], int size, int reference_size, int strides[]) {\n    for (int i = 0; i < size; ++i) {\n        nvinfer1::ILayer* layer = conv_layers[i];\n        nvinfer1::Dims dims = layer->getOutput(0)->getDimensions();\n        int feature_map_size = dims.d[2];\n        strides[i] = reference_size / feature_map_size;\n    }\n}\n\nvoid calculateStrides(nvinfer1::ILayer* conv_layers[], int size, int reference_size, int strides[]) {\n    for (int i = 0; i < size; ++i) {\n        nvinfer1::ILayer* layer = conv_layers[i];\n        nvinfer1::Dims dims = layer->getOutput(0)->getDimensions();\n        int feature_map_size = dims.d[2];\n        strides[i] = reference_size / feature_map_size;\n    }\n}\n\nnvinfer1::IHostMemory* buildEngineYolov12Cls(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                             nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                             std::string& type, int max_channels) {\n\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    nvinfer1::ITensor* data =\n            network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kClsInputH, kClsInputW});\n    assert(data);\n\n    nvinfer1::ILayer* conv0 = Conv(network, weightMap, *data, get_width(64, gw, max_channels), \"model.0\", 3, 2);\n    nvinfer1::ILayer* conv1 =\n            Conv(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), \"model.1\", 3, 2, 1, 2);\n\n    bool c3k2 = false;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        c3k2 = true;\n    }\n    float mlp_ratio = 2.0;\n    bool residual = true;\n    if (type == \"l\" || type == \"x\") {\n        //mlp_ratio = 1.5;  // if use the official's pretrained model,you are supposed to use 1.5\n        mlp_ratio = 1;  // your ownself 's model\n        // residual = true;\n    }\n\n    nvinfer1::ILayer* conv2 = C3K2(network, weightMap, *conv1->getOutput(0), get_width(256, gw, max_channels),\n                                   get_depth(2, gd), \"model.2\", c3k2, 0.25);\n    nvinfer1::ILayer* conv3 =\n            Conv(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), \"model.3\", 3, 2, 1, 4);\n    nvinfer1::ILayer* conv4 = C3K2(network, weightMap, *conv3->getOutput(0), get_width(512, gw, max_channels),\n                                   get_depth(2, gd), \"model.4\", c3k2, 0.25);\n    nvinfer1::ILayer* conv5 =\n            Conv(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), \"model.5\", 3, 2);\n    nvinfer1::ILayer* conv6 = A2C2f(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                                    get_depth(4, gd), \"model.6\", true, 1, residual, mlp_ratio);\n    nvinfer1::ILayer* conv7 =\n            Conv(network, weightMap, *conv6->getOutput(0), get_width(1024, gw, max_channels), \"model.7\", 3, 2);\n    nvinfer1::ILayer* conv8 = A2C2f(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                                    get_depth(4, gd), \"model.8\", true, 1, residual, mlp_ratio);\n\n    nvinfer1::ILayer* conv_class = Conv(network, weightMap, *conv8->getOutput(0), 1280, \"model.9.conv\");\n    nvinfer1::Dims dim = conv_class->getOutput(0)->getDimensions();\n    assert(dim.nbDims == 4);\n    nvinfer1::IPoolingLayer* pool2 = network->addPoolingNd(*conv_class->getOutput(0), nvinfer1::PoolingType::kAVERAGE,\n                                                           nvinfer1::DimsHW{dim.d[2], dim.d[3]});\n\n    nvinfer1::IShuffleLayer* shuffle_0 = network->addShuffle(*pool2->getOutput(0));\n    shuffle_0->setReshapeDimensions(nvinfer1::Dims2{kBatchSize, 1280});\n    auto linear_weight = weightMap[\"model.9.linear.weight\"];\n    auto constant_weight = network->addConstant(nvinfer1::Dims2{kClsNumClass, 1280}, linear_weight);\n    auto constant_bias =\n            network->addConstant(nvinfer1::Dims2{kBatchSize, kClsNumClass}, weightMap[\"model.9.linear.bias\"]);\n    auto linear_matrix_multipy =\n            network->addMatrixMultiply(*shuffle_0->getOutput(0), nvinfer1::MatrixOperation::kNONE,\n                                       *constant_weight->getOutput(0), nvinfer1::MatrixOperation::kTRANSPOSE);\n    auto yolo = network->addElementWise(*linear_matrix_multipy->getOutput(0), *constant_bias->getOutput(0),\n                                        nvinfer1::ElementWiseOperation::kSUM);\n    assert(yolo);\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    // Set the maximum batch size and workspace size\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 16 * (1 << 20));\n\n    // Configuration according to the precision mode being used\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform supports int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(kBatchSize, kClsInputW, kClsInputH, kInputQuantizationFolder,\n                                                  \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    // Begin building the engine; this may take a while\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Cleanup the network definition and allocated weights\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov12Det(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                             nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                             int& max_channels, std::string& type) {\n\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    // =====================   input   ===================================================\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kInputH, kInputW});\n    assert(data);\n\n    // =====================   backbone   ===================================================\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), {3, 3}, 2, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, *conv0->getOutput(0),\n                                                    get_width(128, gw, max_channels), {3, 3}, 2, \"model.1\", 1, 2);\n\n    bool c3k2 = false;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        c3k2 = true;\n    }\n    float mlp_ratio = 2.0;\n    bool residual = false;\n    if (type == \"l\" || type == \"x\") {\n        mlp_ratio = 1.5;  // see the yolov12-seg/ultralytics/nn/tasks.py/parse_model()\n        residual = true;\n    }\n    /*   nvinfer1::IElementWiseLayer* C3K2(nvinfer1::INetworkDefinition * network,\n                                      std::map<std::string, nvinfer1::Weights> & weightMap, nvinfer1::ITensor & input,\n                                      int c2, int n, std::string lname, bool c3k, float e, int g, bool shortcut)*/\n    nvinfer1::IElementWiseLayer* conv2 =\n            C3K2(network, weightMap, *conv1->getOutput(0), get_width(256, gw, max_channels), get_depth(2, gd),\n                 \"model.2\", c3k2, 0.25);\n\n    nvinfer1::IElementWiseLayer* conv3 = convBnSiLU(network, weightMap, *conv2->getOutput(0),\n                                                    get_width(256, gw, max_channels), {3, 3}, 2, \"model.3\", 1, 4);\n    nvinfer1::IElementWiseLayer* conv4 =\n            C3K2(network, weightMap, *conv3->getOutput(0), get_width(512, gw, max_channels), get_depth(2, gd),\n                 \"model.4\", c3k2, 0.25);\n    nvinfer1::IElementWiseLayer* conv5 = convBnSiLU(network, weightMap, *conv4->getOutput(0),\n                                                    get_width(512, gw, max_channels), {3, 3}, 2, \"model.5\");\n\n    /*nvinfer1::ILayer* A2C2f(nvinfer1::INetworkDefinition * network, std::map<std::string, nvinfer1::Weights> weightMap,\n                            nvinfer1::ITensor & input, int c2, int n, std::string lname, bool a2, int area,\n                            bool residual, float mlp_ratio, float e, int g, bool shortcut)*/\n    nvinfer1::ILayer* conv6 = A2C2f(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                                    get_depth(4, gd), \"model.6\", true, 4, residual, mlp_ratio);\n\n    nvinfer1::IElementWiseLayer* conv7 = convBnSiLU(network, weightMap, *conv6->getOutput(0),\n                                                    get_width(1024, gw, max_channels), {3, 3}, 2, \"model.7\");\n    nvinfer1::ILayer* conv8 = A2C2f(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                                    get_depth(4, gd), \"model.8\", true, 1, residual, mlp_ratio);\n\n    // =========================  neck ====================================================================\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n\n    nvinfer1::IResizeLayer* upsample9 = network->addResize(*conv8->getOutput(0));\n    upsample9->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample9->setScales(scale, 4);\n    nvinfer1::ITensor* inputTensors10[] = {upsample9->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat10 = network->addConcatenation(inputTensors10, 2);\n    /*nvinfer1::ILayer* A2C2f(nvinfer1::INetworkDefinition * network, std::map<std::string, nvinfer1::Weights> weightMap,\n                            nvinfer1::ITensor & input, int c2, std::string lname, int n, bool a2, int area,\n                            bool residual, float mlp_ratio, float e, int g, bool shortcut)*/\n    nvinfer1::ILayer* conv11 = A2C2f(network, weightMap, *cat10->getOutput(0), get_width(512, gw, max_channels),\n                                     get_depth(2, gd), \"model.11\", false, -1, residual, mlp_ratio);\n\n    nvinfer1::IResizeLayer* upsample12 = network->addResize(*conv11->getOutput(0));\n    upsample12->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample12->setScales(scale, 4);\n    nvinfer1::ITensor* inputTensors13[] = {upsample12->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat13 = network->addConcatenation(inputTensors13, 2);\n    nvinfer1::ILayer* conv14 = A2C2f(network, weightMap, *cat13->getOutput(0), get_width(256, gw, max_channels),\n                                     get_depth(2, gd), \"model.14\", false, -1, residual, mlp_ratio);\n\n    nvinfer1::IElementWiseLayer* conv15 = convBnSiLU(network, weightMap, *conv14->getOutput(0),\n                                                     get_width(256, gw, max_channels), {3, 3}, 2, \"model.15\");\n    nvinfer1::ITensor* inputTensors16[] = {conv15->getOutput(0), conv11->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat16 = network->addConcatenation(inputTensors16, 2);\n    nvinfer1::ILayer* conv17 = A2C2f(network, weightMap, *cat16->getOutput(0), get_width(512, gw, max_channels),\n                                     get_depth(2, gd), \"model.17\", false, -1, residual, mlp_ratio);\n\n    nvinfer1::IElementWiseLayer* conv18 = convBnSiLU(network, weightMap, *conv17->getOutput(0),\n                                                     get_width(512, gw, max_channels), {3, 3}, 2, \"model.18\");\n    nvinfer1::ITensor* inputTensors19[] = {conv18->getOutput(0), conv8->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat19 = network->addConcatenation(inputTensors19, 2);\n    nvinfer1::IElementWiseLayer* conv20 = C3K2(network, weightMap, *cat19->getOutput(0),\n                                               get_width(1024, gw, max_channels), get_depth(2, gd), \"model.20\", true);\n\n    // =============================== output ===================================================================\n    int c2 = std::max(std::max(16, get_width(256, gw, max_channels) / 4), 16 * 4);\n    int c3 = std::max(get_width(256, gw, max_channels), std::min(kNumClass, 100));\n\n    // output0   location\n    nvinfer1::IElementWiseLayer* conv21_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv14->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv21_cv2_0_0->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv2_0_2 =\n            network->addConvolutionNd(*conv21_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv2.0.2.weight\"], weightMap[\"model.21.cv2.0.2.bias\"]);\n    conv21_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    // output0 classes\n    auto* conv21_cv3_0_0_0 = DWConv(network, weightMap, *conv14->getOutput(0), get_width(256, gw, max_channels), {3, 3},\n                                    1, \"model.21.cv3.0.0.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv3_0_0_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_0_0_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.0.0.1\");\n\n    auto* conv21_cv3_0_1_0 =\n            DWConv(network, weightMap, *conv21_cv3_0_0_1->getOutput(0), c3, {3, 3}, 1, \"model.21.cv3.0.1.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv3_0_1_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_0_1_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.0.1.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv3_0_1_2 =\n            network->addConvolutionNd(*conv21_cv3_0_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv3.0.2.weight\"], weightMap[\"model.21.cv3.0.2.bias\"]);\n    conv21_cv3_0_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv21_cv3_0_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n\n    nvinfer1::ITensor* inputTensors21_0[] = {conv21_cv2_0_2->getOutput(0), conv21_cv3_0_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21_0 = network->addConcatenation(inputTensors21_0, 2);\n\n    // out1 location\n    nvinfer1::IElementWiseLayer* conv21_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv17->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv21_cv2_1_0->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv2_1_2 =\n            network->addConvolutionNd(*conv21_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv2.1.2.weight\"], weightMap[\"model.21.cv2.1.2.bias\"]);\n    conv21_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    // out1 classes\n    auto* conv21_cv3_1_0_0 = DWConv(network, weightMap, *conv17->getOutput(0), get_width(512, gw, max_channels), {3, 3},\n                                    1, \"model.21.cv3.1.0.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv3_1_0_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_1_0_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.1.0.1\");\n    auto* conv21_cv3_1_1_0 =\n            DWConv(network, weightMap, *conv21_cv3_1_0_1->getOutput(0), c3, {3, 3}, 1, \"model.21.cv3.1.1.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv3_1_1_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_1_1_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.1.1.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv3_1_1_2 =\n            network->addConvolutionNd(*conv21_cv3_1_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv3.1.2.weight\"], weightMap[\"model.21.cv3.1.2.bias\"]);\n    conv21_cv3_1_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv21_cv3_1_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n\n    nvinfer1::ITensor* inputTensors21_1[] = {conv21_cv2_1_2->getOutput(0), conv21_cv3_1_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21_1 = network->addConcatenation(inputTensors21_1, 2);\n\n    // out2 location\n    nvinfer1::IElementWiseLayer* conv21_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv20->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv21_cv2_2_0->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv2_2_2 =\n            network->addConvolutionNd(*conv21_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv2.2.2.weight\"], weightMap[\"model.21.cv2.2.2.bias\"]);\n    conv21_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    // out2 classes\n    auto* conv21_cv3_2_0_0 = DWConv(network, weightMap, *conv20->getOutput(0), get_width(1024, gw, max_channels),\n                                    {3, 3}, 1, \"model.21.cv3.2.0.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv3_2_0_1 =\n            convBnSiLU(network, weightMap, *conv20->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.2.0.1\");\n    auto* conv21_cv3_2_1_0 =\n            DWConv(network, weightMap, *conv21_cv3_2_0_1->getOutput(0), c3, {3, 3}, 1, \"model.21.cv3.2.1.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv3_2_1_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_2_1_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.2.1.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv3_2_1_2 =\n            network->addConvolutionNd(*conv21_cv3_2_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv3.2.2.weight\"], weightMap[\"model.21.cv3.2.2.bias\"]);\n    conv21_cv3_2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv3_2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    nvinfer1::ITensor* inputTensor21_2[] = {conv21_cv2_2_2->getOutput(0), conv21_cv3_2_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21_2 = network->addConcatenation(inputTensor21_2, 2);\n\n    // ============================================ yolov12  detect =========================================\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle21_0 = network->addShuffle(*cat21_0->getOutput(0));\n    shuffle21_0->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split21_0_0 = network->addSlice(\n            *shuffle21_0->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split21_0_1 =\n            network->addSlice(*shuffle21_0->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])},\n                              nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* dfl21_0 =\n            DFL(network, weightMap, *split21_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.21.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor22_dfl_0[] = {dfl21_0->getOutput(0), split21_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_0 = network->addConcatenation(inputTensor22_dfl_0, 2);\n    cat22_dfl_0->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle21_1 = network->addShuffle(*cat21_1->getOutput(0));\n    shuffle21_1->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split21_1_0 = network->addSlice(\n            *shuffle21_1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split21_1_1 =\n            network->addSlice(*shuffle21_1->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl21_1 =\n            DFL(network, weightMap, *split21_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.21.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor22_dfl_1[] = {dfl21_1->getOutput(0), split21_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_1 = network->addConcatenation(inputTensor22_dfl_1, 2);\n    cat22_dfl_1->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle21_2 = network->addShuffle(*cat21_2->getOutput(0));\n    shuffle21_2->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split21_2_0 = network->addSlice(\n            *shuffle21_2->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split21_2_1 =\n            network->addSlice(*shuffle21_2->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl21_2 =\n            DFL(network, weightMap, *split21_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.21.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor22_dfl_2[] = {dfl21_2->getOutput(0), split21_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_2 = network->addConcatenation(inputTensor22_dfl_2, 2);\n    cat22_dfl_2->setAxis(1);\n\n    nvinfer1::IPluginV2Layer* yolo =\n            addYoLoLayer(network, std::vector<nvinfer1::IConcatenationLayer*>{cat22_dfl_0, cat22_dfl_1, cat22_dfl_2},\n                         strides, stridesLength, true, false, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 64 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(kBatchSize, kInputW, kInputH, kInputQuantizationFolder,\n                                                  \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov12Seg(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                             nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                             int& max_channels, std::string& type) {\n\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    // =====================   input   ===================================================\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kInputH, kInputW});\n    assert(data);\n\n    // =====================   backbone   ===================================================\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), {3, 3}, 2, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, *conv0->getOutput(0),\n                                                    get_width(128, gw, max_channels), {3, 3}, 2, \"model.1\", 1, 2);\n\n    bool c3k2 = false;\n    if (type == \"m\" || type == \"l\" || type == \"x\") {\n        c3k2 = true;\n    }\n    float mlp_ratio = 2.0;\n    bool residual = true;\n    if (type == \"l\" || type == \"x\") {\n        mlp_ratio = 1;  // see the yolov12-seg/ultralytics/nn/tasks.py/parse_model()\n        // residual = true;\n    }\n    nvinfer1::IElementWiseLayer* conv2 =\n            C3K2(network, weightMap, *conv1->getOutput(0), get_width(256, gw, max_channels), get_depth(2, gd),\n                 \"model.2\", c3k2, 0.25);\n\n    nvinfer1::IElementWiseLayer* conv3 = convBnSiLU(network, weightMap, *conv2->getOutput(0),\n                                                    get_width(256, gw, max_channels), {3, 3}, 2, \"model.3\", 1, 4);\n    nvinfer1::IElementWiseLayer* conv4 =\n            C3K2(network, weightMap, *conv3->getOutput(0), get_width(512, gw, max_channels), get_depth(2, gd),\n                 \"model.4\", c3k2, 0.25);\n    nvinfer1::IElementWiseLayer* conv5 = convBnSiLU(network, weightMap, *conv4->getOutput(0),\n                                                    get_width(512, gw, max_channels), {3, 3}, 2, \"model.5\");\n    nvinfer1::ILayer* conv6 = A2C2f(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                                    get_depth(4, gd), \"model.6\", true, 4, residual, mlp_ratio);\n    nvinfer1::IElementWiseLayer* conv7 = convBnSiLU(network, weightMap, *conv6->getOutput(0),\n                                                    get_width(1024, gw, max_channels), {3, 3}, 2, \"model.7\");\n    nvinfer1::ILayer* conv8 = A2C2f(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                                    get_depth(4, gd), \"model.8\", true, 1, residual, mlp_ratio);\n\n    // =========================  neck ====================================================================\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n\n    nvinfer1::IResizeLayer* upsample9 = network->addResize(*conv8->getOutput(0));\n    upsample9->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample9->setScales(scale, 4);\n    nvinfer1::ITensor* inputTensors10[] = {upsample9->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat10 = network->addConcatenation(inputTensors10, 2);\n    nvinfer1::ILayer* conv11 = A2C2f(network, weightMap, *cat10->getOutput(0), get_width(512, gw, max_channels),\n                                     get_depth(2, gd), \"model.11\", false, -1, residual, mlp_ratio);\n\n    nvinfer1::IResizeLayer* upsample12 = network->addResize(*conv11->getOutput(0));\n    upsample12->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample12->setScales(scale, 4);\n    nvinfer1::ITensor* inputTensors13[] = {upsample12->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat13 = network->addConcatenation(inputTensors13, 2);\n    nvinfer1::ILayer* conv14 = A2C2f(network, weightMap, *cat13->getOutput(0), get_width(256, gw, max_channels),\n                                     get_depth(2, gd), \"model.14\", false, -1, residual, mlp_ratio);\n\n    nvinfer1::IElementWiseLayer* conv15 = convBnSiLU(network, weightMap, *conv14->getOutput(0),\n                                                     get_width(256, gw, max_channels), {3, 3}, 2, \"model.15\");\n    nvinfer1::ITensor* inputTensors16[] = {conv15->getOutput(0), conv11->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat16 = network->addConcatenation(inputTensors16, 2);\n    nvinfer1::ILayer* conv17 = A2C2f(network, weightMap, *cat16->getOutput(0), get_width(512, gw, max_channels),\n                                     get_depth(2, gd), \"model.17\", false, -1, residual, mlp_ratio);\n\n    nvinfer1::IElementWiseLayer* conv18 = convBnSiLU(network, weightMap, *conv17->getOutput(0),\n                                                     get_width(512, gw, max_channels), {3, 3}, 2, \"model.18\");\n    nvinfer1::ITensor* inputTensors19[] = {conv18->getOutput(0), conv8->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat19 = network->addConcatenation(inputTensors19, 2);\n    nvinfer1::IElementWiseLayer* conv20 = C3K2(network, weightMap, *cat19->getOutput(0),\n                                               get_width(1024, gw, max_channels), get_depth(2, gd), \"model.20\", true);\n\n    // =============================== output ===================================================================\n    int c2 = std::max(std::max(16, get_width(256, gw, max_channels) / 4), 16 * 4);\n    int c3 = std::max(get_width(256, gw, max_channels), std::min(kNumClass, 100));\n\n    // output0   location\n    nvinfer1::IElementWiseLayer* conv21_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv14->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv21_cv2_0_0->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv2_0_2 =\n            network->addConvolutionNd(*conv21_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv2.0.2.weight\"], weightMap[\"model.21.cv2.0.2.bias\"]);\n    conv21_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    // output0 classes\n    auto* conv21_cv3_0_0_0 = DWConv(network, weightMap, *conv14->getOutput(0), get_width(256, gw, max_channels), {3, 3},\n                                    1, \"model.21.cv3.0.0.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv3_0_0_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_0_0_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.0.0.1\");\n\n    auto* conv21_cv3_0_1_0 =\n            DWConv(network, weightMap, *conv21_cv3_0_0_1->getOutput(0), c3, {3, 3}, 1, \"model.21.cv3.0.1.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv3_0_1_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_0_1_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.0.1.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv3_0_1_2 =\n            network->addConvolutionNd(*conv21_cv3_0_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv3.0.2.weight\"], weightMap[\"model.21.cv3.0.2.bias\"]);\n    conv21_cv3_0_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv21_cv3_0_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n\n    nvinfer1::ITensor* inputTensors21_0[] = {conv21_cv2_0_2->getOutput(0), conv21_cv3_0_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21_0 = network->addConcatenation(inputTensors21_0, 2);\n\n    // out1 location\n    nvinfer1::IElementWiseLayer* conv21_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv17->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv21_cv2_1_0->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv2_1_2 =\n            network->addConvolutionNd(*conv21_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv2.1.2.weight\"], weightMap[\"model.21.cv2.1.2.bias\"]);\n    conv21_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    // out1 classes\n    auto* conv21_cv3_1_0_0 = DWConv(network, weightMap, *conv17->getOutput(0), get_width(512, gw, max_channels), {3, 3},\n                                    1, \"model.21.cv3.1.0.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv3_1_0_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_1_0_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.1.0.1\");\n    auto* conv21_cv3_1_1_0 =\n            DWConv(network, weightMap, *conv21_cv3_1_0_1->getOutput(0), c3, {3, 3}, 1, \"model.21.cv3.1.1.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv3_1_1_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_1_1_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.1.1.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv3_1_1_2 =\n            network->addConvolutionNd(*conv21_cv3_1_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv3.1.2.weight\"], weightMap[\"model.21.cv3.1.2.bias\"]);\n    conv21_cv3_1_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv21_cv3_1_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n\n    nvinfer1::ITensor* inputTensors21_1[] = {conv21_cv2_1_2->getOutput(0), conv21_cv3_1_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21_1 = network->addConcatenation(inputTensors21_1, 2);\n\n    // out2 location\n    nvinfer1::IElementWiseLayer* conv21_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv20->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv21_cv2_2_0->getOutput(0), c2, {3, 3}, 1, \"model.21.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv2_2_2 =\n            network->addConvolutionNd(*conv21_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv2.2.2.weight\"], weightMap[\"model.21.cv2.2.2.bias\"]);\n    conv21_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    // out2 classes\n    auto* conv21_cv3_2_0_0 = DWConv(network, weightMap, *conv20->getOutput(0), get_width(1024, gw, max_channels),\n                                    {3, 3}, 1, \"model.21.cv3.2.0.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv3_2_0_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_2_0_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.2.0.1\");\n    auto* conv21_cv3_2_1_0 =\n            DWConv(network, weightMap, *conv21_cv3_2_0_1->getOutput(0), c3, {3, 3}, 1, \"model.21.cv3.2.1.0\");\n    nvinfer1::IElementWiseLayer* conv21_cv3_2_1_1 =\n            convBnSiLU(network, weightMap, *conv21_cv3_2_1_0->getOutput(0), c3, {1, 1}, 1, \"model.21.cv3.2.1.1\");\n    nvinfer1::IConvolutionLayer* conv21_cv3_2_1_2 =\n            network->addConvolutionNd(*conv21_cv3_2_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.21.cv3.2.2.weight\"], weightMap[\"model.21.cv3.2.2.bias\"]);\n    conv21_cv3_2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv21_cv3_2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    nvinfer1::ITensor* inputTensor21_2[] = {conv21_cv2_2_2->getOutput(0), conv21_cv3_2_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat21_2 = network->addConcatenation(inputTensor21_2, 2);\n\n    // ============================================ yolov12  detect =========================================\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle21_0 = network->addShuffle(*cat21_0->getOutput(0));\n    shuffle21_0->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split21_0_0 = network->addSlice(\n            *shuffle21_0->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split21_0_1 =\n            network->addSlice(*shuffle21_0->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])},\n                              nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* dfl21_0 =\n            DFL(network, weightMap, *split21_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.21.dfl.conv.weight\");\n    auto proto_coef_0 = cv4_conv_combined(network, weightMap, *conv14->getOutput(0), \"model.21.cv4.0\",\n                                          (kInputH / strides[0]) * (kInputW / strides[0]), gw, \"seg\", max_channels);\n    nvinfer1::ITensor* inputTensor22_dfl_0[] = {dfl21_0->getOutput(0), split21_0_1->getOutput(0),\n                                                proto_coef_0->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_0 = network->addConcatenation(inputTensor22_dfl_0, 3);\n    cat22_dfl_0->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle21_1 = network->addShuffle(*cat21_1->getOutput(0));\n    shuffle21_1->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split21_1_0 = network->addSlice(\n            *shuffle21_1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split21_1_1 =\n            network->addSlice(*shuffle21_1->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl21_1 =\n            DFL(network, weightMap, *split21_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.21.dfl.conv.weight\");\n    auto proto_coef_1 = cv4_conv_combined(network, weightMap, *conv17->getOutput(0), \"model.21.cv4.1\",\n                                          (kInputH / strides[1]) * (kInputW / strides[1]), gw, \"seg\", max_channels);\n    nvinfer1::ITensor* inputTensor22_dfl_1[] = {dfl21_1->getOutput(0), split21_1_1->getOutput(0),\n                                                proto_coef_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_1 = network->addConcatenation(inputTensor22_dfl_1, 3);\n    cat22_dfl_1->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle21_2 = network->addShuffle(*cat21_2->getOutput(0));\n    shuffle21_2->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split21_2_0 = network->addSlice(\n            *shuffle21_2->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split21_2_1 =\n            network->addSlice(*shuffle21_2->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl21_2 =\n            DFL(network, weightMap, *split21_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.21.dfl.conv.weight\");\n    auto proto_coef_2 = cv4_conv_combined(network, weightMap, *conv20->getOutput(0), \"model.21.cv4.2\",\n                                          (kInputH / strides[2]) * (kInputW / strides[2]), gw, \"seg\", max_channels);\n    nvinfer1::ITensor* inputTensor22_dfl_2[] = {dfl21_2->getOutput(0), split21_2_1->getOutput(0),\n                                                proto_coef_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_2 = network->addConcatenation(inputTensor22_dfl_2, 3);\n    cat22_dfl_2->setAxis(1);\n\n    nvinfer1::IPluginV2Layer* yolo =\n            addYoLoLayer(network, std::vector<nvinfer1::IConcatenationLayer*>{cat22_dfl_0, cat22_dfl_1, cat22_dfl_2},\n                         strides, stridesLength, true, false, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    auto proto = Proto(network, weightMap, *conv14->getOutput(0), \"model.21.proto\", gw, max_channels);\n    proto->getOutput(0)->setName(kProtoTensorName);\n    network->markOutput(*proto->getOutput(0));\n\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(kBatchSize, kInputW, kInputH, kInputQuantizationFolder,\n                                                  \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n"
  },
  {
    "path": "yolov12-tubro/src/postprocess.cpp",
    "content": "#include \"postprocess.h\"\n#include \"utils.h\"\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    float l, r, t, b;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n\n    if (r_h > r_w) {\n        l = bbox[0];\n        r = bbox[2];\n        t = bbox[1] - (kInputH - r_w * img.rows) / 2;\n        b = bbox[3] - (kInputH - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - (kInputW - r_h * img.cols) / 2;\n        r = bbox[2] - (kInputW - r_h * img.cols) / 2;\n        t = bbox[1];\n        b = bbox[3];\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\ncv::Rect get_rect_obb(cv::Mat& img, float bbox[4]) {\n    float l, r, t, b;\n    float r_w = kObbInputW / (img.cols * 1.0);\n    float r_h = kObbInputH / (img.rows * 1.0);\n\n    if (r_h > r_w) {\n        l = bbox[0];\n        r = bbox[2];\n        t = bbox[1] - (kObbInputH - r_w * img.rows) / 2;\n        b = bbox[3] - (kObbInputH - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - (kObbInputW - r_h * img.cols) / 2;\n        r = bbox[2] - (kObbInputW - r_h * img.cols) / 2;\n        t = bbox[1];\n        b = bbox[3];\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\ncv::Rect get_rect_adapt_landmark(cv::Mat& img, float bbox[4], float lmk[kNumberOfPoints * 3]) {\n    float l, r, t, b;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n    if (r_h > r_w) {\n        l = bbox[0] / r_w;\n        r = bbox[2] / r_w;\n        t = (bbox[1] - (kInputH - r_w * img.rows) / 2) / r_w;\n        b = (bbox[3] - (kInputH - r_w * img.rows) / 2) / r_w;\n        for (int i = 0; i < kNumberOfPoints * 3; i += 3) {\n            lmk[i] /= r_w;\n            lmk[i + 1] = (lmk[i + 1] - (kInputH - r_w * img.rows) / 2) / r_w;\n            // lmk[i + 2]\n        }\n    } else {\n        l = (bbox[0] - (kInputW - r_h * img.cols) / 2) / r_h;\n        r = (bbox[2] - (kInputW - r_h * img.cols) / 2) / r_h;\n        t = bbox[1] / r_h;\n        b = bbox[3] / r_h;\n        for (int i = 0; i < kNumberOfPoints * 3; i += 3) {\n            lmk[i] = (lmk[i] - (kInputW - r_h * img.cols) / 2) / r_h;\n            lmk[i + 1] /= r_h;\n            // lmk[i + 2]\n        }\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\nstatic float iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n            (std::max)(lbox[0], rbox[0]),\n            (std::min)(lbox[2], rbox[2]),\n            (std::max)(lbox[1], rbox[1]),\n            (std::min)(lbox[3], rbox[3]),\n    };\n\n    if (interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS = (interBox[1] - interBox[0]) * (interBox[3] - interBox[2]);\n    float unionBoxS = (lbox[2] - lbox[0]) * (lbox[3] - lbox[1]) + (rbox[2] - rbox[0]) * (rbox[3] - rbox[1]) - interBoxS;\n    return interBoxS / unionBoxS;\n}\n\nstatic bool cmp(const Detection& a, const Detection& b) {\n    if (a.conf == b.conf) {\n        return a.bbox[0] < b.bbox[0];\n    }\n    return a.conf > b.conf;\n}\n\nvoid nms(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh) {\n    int det_size = sizeof(Detection) / sizeof(float);\n    std::map<float, std::vector<Detection>> m;\n\n    for (int i = 0; i < output[0]; i++) {\n        if (output[1 + det_size * i + 4] <= conf_thresh || isnan(output[1 + det_size * i + 4]))\n            continue;\n        Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        if (m.count(det.class_id) == 0)\n            m.emplace(det.class_id, std::vector<Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                    dets.erase(dets.begin() + n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\nvoid batch_nms(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size,\n               float conf_thresh, float nms_thresh) {\n    res_batch.resize(batch_size);\n    for (int i = 0; i < batch_size; i++) {\n        nms(res_batch[i], &output[i * output_size], conf_thresh, nms_thresh);\n    }\n}\n\nvoid process_decode_ptr_host(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element, cv::Mat& img,\n                             int count) {\n    Detection det;\n    for (int i = 0; i < count; i++) {\n        int basic_pos = 1 + i * bbox_element;\n        int keep_flag = decode_ptr_host[basic_pos + 6];\n        if (keep_flag == 1) {\n            det.bbox[0] = decode_ptr_host[basic_pos + 0];\n            det.bbox[1] = decode_ptr_host[basic_pos + 1];\n            det.bbox[2] = decode_ptr_host[basic_pos + 2];\n            det.bbox[3] = decode_ptr_host[basic_pos + 3];\n            det.conf = decode_ptr_host[basic_pos + 4];\n            det.class_id = decode_ptr_host[basic_pos + 5];\n            res.push_back(det);\n        }\n    }\n}\n\nvoid batch_process(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                   int bbox_element, const std::vector<cv::Mat>& img_batch) {\n    res_batch.resize(batch_size);\n    int count = static_cast<int>(*decode_ptr_host);\n    count = std::min(count, kMaxNumOutputBbox);\n    for (int i = 0; i < batch_size; i++) {\n        auto& img = const_cast<cv::Mat&>(img_batch[i]);\n        process_decode_ptr_host(res_batch[i], &decode_ptr_host[i * count], bbox_element, img, count);\n    }\n}\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        cv::Mat img = img_batch[i];\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect(img, res[j].bbox);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                        cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n        }\n    }\n}\n\nvoid draw_bbox_keypoints_line(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    const std::vector<std::pair<int, int>> skeleton_pairs = {\n            {0, 1}, {0, 2},  {0, 5}, {0, 6},  {1, 2},   {1, 3},   {2, 4},   {5, 6},   {5, 7},  {5, 11},\n            {6, 8}, {6, 12}, {7, 9}, {8, 10}, {11, 12}, {11, 13}, {12, 14}, {13, 15}, {14, 16}};\n\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        cv::Mat img = img_batch[i];\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect_adapt_landmark(img, res[j].bbox, res[j].keypoints);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                        cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n\n            for (int k = 0; k < kNumberOfPoints * 3; k += 3) {\n                if (res[j].keypoints[k + 2] > 0.5) {\n                    cv::circle(img, cv::Point((int)res[j].keypoints[k], (int)res[j].keypoints[k + 1]), 3,\n                               cv::Scalar(0, 0x27, 0xC1), -1);\n                }\n            }\n\n            for (const auto& bone : skeleton_pairs) {\n                int kp1_idx = bone.first * 3;\n                int kp2_idx = bone.second * 3;\n                if (res[j].keypoints[kp1_idx + 2] > 0.5 && res[j].keypoints[kp2_idx + 2] > 0.5) {\n                    cv::Point p1((int)res[j].keypoints[kp1_idx], (int)res[j].keypoints[kp1_idx + 1]);\n                    cv::Point p2((int)res[j].keypoints[kp2_idx], (int)res[j].keypoints[kp2_idx + 1]);\n                    cv::line(img, p1, p2, cv::Scalar(0, 0x27, 0xC1), 2);\n                }\n            }\n        }\n    }\n}\n\ncv::Mat scale_mask(cv::Mat mask, cv::Mat img) {\n    int x, y, w, h;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = kInputW;\n        h = r_w * img.rows;\n        x = 0;\n        y = (kInputH - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = kInputH;\n        x = (kInputW - w) / 2;\n        y = 0;\n    }\n    cv::Rect r(x, y, w, h);\n    cv::Mat res;\n    cv::resize(mask(r), res, img.size());\n    return res;\n}\n\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks,\n                    std::unordered_map<int, std::string>& labels_map) {\n    static std::vector<uint32_t> colors = {0xFF3838, 0xFF9D97, 0xFF701F, 0xFFB21D, 0xCFD231, 0x48F90A, 0x92CC17,\n                                           0x3DDB86, 0x1A9334, 0x00D4BB, 0x2C99A8, 0x00C2FF, 0x344593, 0x6473FF,\n                                           0x0018EC, 0x8438FF, 0x520085, 0xCB38FF, 0xFF95C8, 0xFF37C7};\n    for (size_t i = 0; i < dets.size(); i++) {\n        cv::Mat img_mask = scale_mask(masks[i], img);\n        auto color = colors[(int)dets[i].class_id % colors.size()];\n        auto bgr = cv::Scalar(color & 0xFF, color >> 8 & 0xFF, color >> 16 & 0xFF);\n\n        cv::Rect r = get_rect(img, dets[i].bbox);\n        for (int x = r.x; x < r.x + r.width; x++) {\n            for (int y = r.y; y < r.y + r.height; y++) {\n                float val = img_mask.at<float>(y, x);\n                if (val <= 0.5)\n                    continue;\n                img.at<cv::Vec3b>(y, x)[0] = img.at<cv::Vec3b>(y, x)[0] / 2 + bgr[0] / 2;\n                img.at<cv::Vec3b>(y, x)[1] = img.at<cv::Vec3b>(y, x)[1] / 2 + bgr[1] / 2;\n                img.at<cv::Vec3b>(y, x)[2] = img.at<cv::Vec3b>(y, x)[2] / 2 + bgr[2] / 2;\n            }\n        }\n\n        cv::rectangle(img, r, bgr, 2);\n\n        // Get the size of the text\n        cv::Size textSize =\n                cv::getTextSize(labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf),\n                                cv::FONT_HERSHEY_PLAIN, 1.2, 2, NULL);\n        // Set the top left corner of the rectangle\n        cv::Point topLeft(r.x, r.y - textSize.height);\n\n        // Set the bottom right corner of the rectangle\n        cv::Point bottomRight(r.x + textSize.width, r.y + textSize.height);\n\n        // Set the thickness of the rectangle lines\n        int lineThickness = 2;\n\n        // Draw the rectangle on the image\n        cv::rectangle(img, topLeft, bottomRight, bgr, -1);\n\n        cv::putText(img, labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf),\n                    cv::Point(r.x, r.y + 4), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar::all(0xFF), 2);\n    }\n}\n\nvoid process_decode_ptr_host_obb(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element,\n                                 cv::Mat& img, int count) {\n    Detection det;\n    for (int i = 0; i < count; i++) {\n        int basic_pos = 1 + i * bbox_element;\n        int keep_flag = decode_ptr_host[basic_pos + 6];\n        if (keep_flag == 1) {\n            det.bbox[0] = decode_ptr_host[basic_pos + 0];\n            det.bbox[1] = decode_ptr_host[basic_pos + 1];\n            det.bbox[2] = decode_ptr_host[basic_pos + 2];\n            det.bbox[3] = decode_ptr_host[basic_pos + 3];\n            det.conf = decode_ptr_host[basic_pos + 4];\n            det.class_id = decode_ptr_host[basic_pos + 5];\n            det.angle = decode_ptr_host[basic_pos + 7];\n            res.push_back(det);\n        }\n    }\n}\n\nvoid batch_process_obb(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                       int bbox_element, const std::vector<cv::Mat>& img_batch) {\n    res_batch.resize(batch_size);\n    int count = static_cast<int>(*decode_ptr_host);\n    count = std::min(count, kMaxNumOutputBbox);\n    for (int i = 0; i < batch_size; i++) {\n        auto& img = const_cast<cv::Mat&>(img_batch[i]);\n        process_decode_ptr_host_obb(res_batch[i], &decode_ptr_host[i * count], bbox_element, img, count);\n    }\n}\n\nstd::tuple<float, float, float> convariance_matrix(Detection res) {\n    float w = res.bbox[2];\n    float h = res.bbox[3];\n\n    float a = w * w / 12.0;\n    float b = h * h / 12.0;\n    float c = res.angle;\n\n    float cos_r = std::cos(c);\n    float sin_r = std::sin(c);\n\n    float cos_r2 = cos_r * cos_r;\n    float sin_r2 = sin_r * sin_r;\n\n    float a_val = a * cos_r2 + b * sin_r2;\n    float b_val = a * sin_r2 + b * cos_r2;\n    float c_val = (a - b) * cos_r * sin_r;\n\n    return std::make_tuple(a_val, b_val, c_val);\n}\n\nstatic float probiou(const Detection& res1, const Detection& res2, float eps = 1e-7) {\n    // Calculate the prob iou between oriented bounding boxes, https://arxiv.org/pdf/2106.06072v1.pdf.\n    float a1, b1, c1, a2, b2, c2;\n    std::tuple<float, float, float> matrix1 = {a1, b1, c1};\n    std::tuple<float, float, float> matrix2 = {a2, b2, c2};\n    matrix1 = convariance_matrix(res1);\n    matrix2 = convariance_matrix(res2);\n    a1 = std::get<0>(matrix1);\n    b1 = std::get<1>(matrix1);\n    c1 = std::get<2>(matrix1);\n    a2 = std::get<0>(matrix2);\n    b2 = std::get<1>(matrix2);\n    c2 = std::get<2>(matrix2);\n\n    float x1 = res1.bbox[0], y1 = res1.bbox[1];\n    float x2 = res2.bbox[0], y2 = res2.bbox[1];\n\n    float t1 = ((a1 + a2) * std::pow(y1 - y2, 2) + (b1 + b2) * std::pow(x1 - x2, 2)) /\n               ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2) + eps);\n    float t2 = ((c1 + c2) * (x2 - x1) * (y1 - y2)) / ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2) + eps);\n    float t3 = std::log(\n            ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2)) /\n                    (4 * std::sqrt(std::max(a1 * b1 - c1 * c1, 0.0f)) * std::sqrt(std::max(a2 * b2 - c2 * c2, 0.0f)) +\n                     eps) +\n            eps);\n\n    float bd = 0.25f * t1 + 0.5f * t2 + 0.5f * t3;\n    bd = std::max(std::min(bd, 100.0f), eps);\n    float hd = std::sqrt(1.0 - std::exp(-bd) + eps);\n\n    return 1 - hd;\n}\n\nvoid nms_obb(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh) {\n    int det_size = sizeof(Detection) / sizeof(float);\n    std::map<float, std::vector<Detection>> m;\n\n    for (int i = 0; i < output[0]; i++) {\n\n        if (output[1 + det_size * i + 4] <= conf_thresh)\n            continue;\n        Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        if (m.count(det.class_id) == 0)\n            m.emplace(det.class_id, std::vector<Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (probiou(item, dets[n]) >= nms_thresh) {\n                    dets.erase(dets.begin() + n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\nvoid batch_nms_obb(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size,\n                   float conf_thresh, float nms_thresh) {\n    res_batch.resize(batch_size);\n    for (int i = 0; i < batch_size; i++) {\n        nms_obb(res_batch[i], &output[i * output_size], conf_thresh, nms_thresh);\n    }\n}\n\nstatic std::vector<cv::Point> get_corner(cv::Mat& img, const Detection& box) {\n    float cos_value, sin_value;\n\n    // Calculate center point and width/height\n    float x1 = box.bbox[0];\n    float y1 = box.bbox[1];\n    float w = box.bbox[2];\n    float h = box.bbox[3];\n    float angle = box.angle * 180.0f / CV_PI;  // Convert radians to degrees\n\n    // Print original angle\n    std::cout << \"Original angle: \" << angle << std::endl;\n\n    // Swap width and height if height is greater than or equal to width\n    if (h >= w) {\n        std::swap(w, h);\n        angle = fmod(angle + 90.0f, 180.0f);  // Adjust angle to be within [0, 180)\n    }\n\n    // Ensure the angle is between 0 and 180 degrees\n    if (angle < 0) {\n        angle += 360.0f;  // Convert to positive value\n    }\n    if (angle > 180.0f) {\n        angle -= 180.0f;  // Subtract 180 from angles greater than 180\n    }\n\n    // Print adjusted angle\n    std::cout << \"Adjusted angle: \" << angle << std::endl;\n\n    // Convert to normal angle value\n    float normal_angle = fmod(angle, 180.0f);\n    if (normal_angle < 0) {\n        normal_angle += 180.0f;  // Ensure it's a positive value\n    }\n\n    // Print normal angle value\n    std::cout << \"Normal angle: \" << normal_angle << std::endl;\n\n    cos_value = std::cos(angle * CV_PI / 180.0f);  // Convert to radians\n    sin_value = std::sin(angle * CV_PI / 180.0f);\n\n    // Calculate each corner point\n    float l = x1 - w / 2;  // Left boundary\n    float r = x1 + w / 2;  // Right boundary\n    float t = y1 - h / 2;  // Top boundary\n    float b = y1 + h / 2;  // Bottom boundary\n\n    // Use get_rect function to scale the coordinates\n    float bbox[4] = {l, t, r, b};\n    cv::Rect rect = get_rect_obb(img, bbox);\n\n    float x_ = (rect.x + rect.x + rect.width) / 2;   // Center x\n    float y_ = (rect.y + rect.y + rect.height) / 2;  // Center y\n    float width = rect.width;                        // Width\n    float height = rect.height;                      // Height\n\n    // Calculate each corner point\n    std::vector<cv::Point> corner_points(4);\n    float vec1x = width / 2 * cos_value;\n    float vec1y = width / 2 * sin_value;\n    float vec2x = -height / 2 * sin_value;\n    float vec2y = height / 2 * cos_value;\n\n    corner_points[0] = cv::Point(int(round(x_ + vec1x + vec2x)), int(round(y_ + vec1y + vec2y)));  // Top-left corner\n    corner_points[1] = cv::Point(int(round(x_ + vec1x - vec2x)), int(round(y_ + vec1y - vec2y)));  // Top-right corner\n    corner_points[2] =\n            cv::Point(int(round(x_ - vec1x - vec2x)), int(round(y_ - vec1y - vec2y)));  // Bottom-right corner\n    corner_points[3] = cv::Point(int(round(x_ - vec1x + vec2x)), int(round(y_ - vec1y + vec2y)));  // Bottom-left corner\n\n    // Check and adjust corner points to ensure the rectangle is parallel to image boundaries\n    for (auto& point : corner_points) {\n        point.x = std::max(0, std::min(point.x, img.cols - 1));\n        point.y = std::max(0, std::min(point.y, img.rows - 1));\n    }\n\n    return corner_points;\n}\n\nvoid draw_bbox_obb(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    static std::vector<uint32_t> colors = {0xFF3838, 0xFF9D97, 0xFF701F, 0xFFB21D, 0xCFD231, 0x48F90A, 0x92CC17,\n                                           0x3DDB86, 0x1A9334, 0x00D4BB, 0x2C99A8, 0x00C2FF, 0x344593, 0x6473FF,\n                                           0x0018EC, 0x8438FF, 0x520085, 0xCB38FF, 0xFF95C8, 0xFF37C7};\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        auto& img = img_batch[i];\n        for (auto& obj : res) {\n            auto color = colors[(int)obj.class_id % colors.size()];\n            auto bgr = cv::Scalar(color & 0xFF, color >> 8 & 0xFF, color >> 16 & 0xFF);\n            auto corner_points = get_corner(img, obj);\n            cv::polylines(img, std::vector<std::vector<cv::Point>>{corner_points}, true, bgr, 1);\n\n            auto text = (std::to_string((int)(obj.class_id)) + \":\" + to_string_with_precision(obj.conf));\n            cv::Size textsize = cv::getTextSize(text, 0, 0.3, 1, nullptr);\n\n            int width = textsize.width;\n            int height = textsize.height;\n            bool outside = (corner_points[0].y - height >= 3) ? true : false;\n            cv::Point p1(corner_points[0].x, corner_points[0].y), p2;\n            p2.x = corner_points[0].x + width;\n            if (outside) {\n                p2.y = corner_points[0].y - height - 3;\n            } else {\n                p2.y = corner_points[0].y + height + 3;\n            }\n            cv::rectangle(img, p1, p2, bgr, -1, cv::LINE_AA);\n            cv::putText(\n                    img, text,\n                    cv::Point(corner_points[0].x, (outside ? corner_points[0].y - 2 : corner_points[0].y + height + 2)),\n                    0, 0.3, cv::Scalar::all(255), 1, cv::LINE_AA);\n        }\n    }\n}\n"
  },
  {
    "path": "yolov12-tubro/src/postprocess.cu",
    "content": "//\n// Created by lindsay on 23-7-17.\n//\n#include \"postprocess.h\"\n#include \"types.h\"\n\nstatic __global__ void decode_kernel_obb(float* predict, int num_bboxes, float confidence_threshold, float* parray,\n                                         int max_objects) {\n    float count = predict[0];\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    if (position >= count)\n        return;\n\n    float* pitem = predict + 1 + position * (sizeof(Detection) / sizeof(float));\n    int index = atomicAdd(parray, 1);\n    if (index >= max_objects)\n        return;\n\n    float confidence = pitem[4];\n\n    if (confidence < confidence_threshold)\n        return;\n    //[center_x center_y w h conf class_id  mask[32] keypoints[51] angle]\n    float cx = pitem[0];\n    float cy = pitem[1];\n    float width = pitem[2];\n    float height = pitem[3];\n    float label = pitem[5];\n    float angle = pitem[89];\n\n    float* pout_item = parray + 1 + index * bbox_element;\n    *pout_item++ = cx;\n    *pout_item++ = cy;\n    *pout_item++ = width;\n    *pout_item++ = height;\n    *pout_item++ = confidence;\n    *pout_item++ = label;\n    *pout_item++ = 1;  // 1 = keep, 0 = ignore\n    *pout_item++ = angle;\n}\n\nstatic __global__ void decode_kernel(float* predict, int num_bboxes, float confidence_threshold, float* parray,\n                                     int max_objects) {\n    float count = predict[0];\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    if (position >= count)\n        return;\n\n    float* pitem = predict + 1 + position * (sizeof(Detection) / sizeof(float));\n    int index = atomicAdd(parray, 1);\n    if (index >= max_objects)\n        return;\n\n    float confidence = pitem[4];\n    if (confidence < confidence_threshold)\n        return;\n\n    float left = pitem[0];\n    float top = pitem[1];\n    float right = pitem[2];\n    float bottom = pitem[3];\n    float label = pitem[5];\n\n    float* pout_item = parray + 1 + index * bbox_element;\n    *pout_item++ = left;\n    *pout_item++ = top;\n    *pout_item++ = right;\n    *pout_item++ = bottom;\n    *pout_item++ = confidence;\n    *pout_item++ = label;\n    *pout_item++ = 1;  // 1 = keep, 0 = ignore\n}\n\nstatic __device__ float box_iou(float aleft, float atop, float aright, float abottom, float bleft, float btop,\n                                float bright, float bbottom) {\n    float cleft = max(aleft, bleft);\n    float ctop = max(atop, btop);\n    float cright = min(aright, bright);\n    float cbottom = min(abottom, bbottom);\n    float c_area = max(cright - cleft, 0.0f) * max(cbottom - ctop, 0.0f);\n    if (c_area == 0.0f)\n        return 0.0f;\n\n    float a_area = max(0.0f, aright - aleft) * max(0.0f, abottom - atop);\n    float b_area = max(0.0f, bright - bleft) * max(0.0f, bbottom - btop);\n    return c_area / (a_area + b_area - c_area);\n}\n\nstatic __global__ void nms_kernel(float* bboxes, int max_objects, float threshold) {\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    int count = min(static_cast<int>(bboxes[0]), max_objects);\n    if (position >= count)\n        return;\n\n    float* pcurrent = bboxes + 1 + position * bbox_element;\n    for (int i = 0; i < count; ++i) {\n        float* pitem = bboxes + 1 + i * bbox_element;\n        if (i == position || pcurrent[5] != pitem[5])\n            continue;\n        if (pitem[4] >= pcurrent[4]) {\n            if (pitem[4] == pcurrent[4] && i < position)\n                continue;\n            float iou =\n                    box_iou(pcurrent[0], pcurrent[1], pcurrent[2], pcurrent[3], pitem[0], pitem[1], pitem[2], pitem[3]);\n            if (iou > threshold) {\n                pcurrent[6] = 0;\n                return;\n            }\n        }\n    }\n}\n\nstatic __device__ void convariance_matrix(float w, float h, float r, float& a, float& b, float& c) {\n    float a_val = w * w / 12.0f;\n    float b_val = h * h / 12.0f;\n    float cos_r = cosf(r);\n    float sin_r = sinf(r);\n\n    a = a_val * cos_r * cos_r + b_val * sin_r * sin_r;\n    b = a_val * sin_r * sin_r + b_val * cos_r * cos_r;\n    c = (a_val - b_val) * sin_r * cos_r;\n}\n\nstatic __device__ float box_probiou(float cx1, float cy1, float w1, float h1, float r1, float cx2, float cy2, float w2,\n                                    float h2, float r2, float eps = 1e-7) {\n\n    // Calculate the prob iou between oriented bounding boxes, https://arxiv.org/pdf/2106.06072v1.pdf.\n    float a1, b1, c1, a2, b2, c2;\n    convariance_matrix(w1, h1, r1, a1, b1, c1);\n    convariance_matrix(w2, h2, r2, a2, b2, c2);\n\n    float t1 = ((a1 + a2) * powf(cy1 - cy2, 2) + (b1 + b2) * powf(cx1 - cx2, 2)) /\n               ((a1 + a2) * (b1 + b2) - powf(c1 + c2, 2) + eps);\n    float t2 = ((c1 + c2) * (cx2 - cx1) * (cy1 - cy2)) / ((a1 + a2) * (b1 + b2) - powf(c1 + c2, 2) + eps);\n    float t3 = logf(((a1 + a2) * (b1 + b2) - powf(c1 + c2, 2)) /\n                            (4 * sqrtf(fmaxf(a1 * b1 - c1 * c1, 0.0f)) * sqrtf(fmaxf(a2 * b2 - c2 * c2, 0.0f)) + eps) +\n                    eps);\n    float bd = 0.25f * t1 + 0.5f * t2 + 0.5f * t3;\n    bd = fmaxf(fminf(bd, 100.0f), eps);\n    float hd = sqrtf(1.0f - expf(-bd) + eps);\n    return 1 - hd;\n}\n\nstatic __global__ void nms_kernel_obb(float* bboxes, int max_objects, float threshold) {\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    int count = min(static_cast<int>(bboxes[0]), max_objects);\n    if (position >= count)\n        return;\n\n    float* pcurrent = bboxes + 1 + position * bbox_element;\n    for (int i = 0; i < count; ++i) {\n        float* pitem = bboxes + 1 + i * bbox_element;\n        if (i == position || pcurrent[5] != pitem[5])\n            continue;\n        if (pitem[4] >= pcurrent[4]) {\n            if (pitem[4] == pcurrent[4] && i < position)\n                continue;\n            float iou = box_probiou(pcurrent[0], pcurrent[1], pcurrent[2], pcurrent[3], pcurrent[7], pitem[0], pitem[1],\n                                    pitem[2], pitem[3], pitem[7]);\n            if (iou > threshold) {\n                pcurrent[6] = 0;\n                return;\n            }\n        }\n    }\n}\n\nvoid cuda_decode(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                 cudaStream_t stream) {\n    int block = 256;\n    int grid = ceil(num_bboxes / (float)block);\n    decode_kernel<<<grid, block, 0, stream>>>((float*)predict, num_bboxes, confidence_threshold, parray, max_objects);\n}\n\nvoid cuda_nms(float* parray, float nms_threshold, int max_objects, cudaStream_t stream) {\n    int block = max_objects < 256 ? max_objects : 256;\n    int grid = ceil(max_objects / (float)block);\n    nms_kernel<<<grid, block, 0, stream>>>(parray, max_objects, nms_threshold);\n}\n\nvoid cuda_decode_obb(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                     cudaStream_t stream) {\n    int block = 256;\n    int grid = ceil(num_bboxes / (float)block);\n    decode_kernel_obb<<<grid, block, 0, stream>>>((float*)predict, num_bboxes, confidence_threshold, parray,\n                                                  max_objects);\n}\n\nvoid cuda_nms_obb(float* parray, float nms_threshold, int max_objects, cudaStream_t stream) {\n    int block = max_objects < 256 ? max_objects : 256;\n    int grid = ceil(max_objects / (float)block);\n    nms_kernel_obb<<<grid, block, 0, stream>>>(parray, max_objects, nms_threshold);\n}\n"
  },
  {
    "path": "yolov12-tubro/src/preprocess.cu",
    "content": "#include \"cuda_utils.h\"\n#include \"preprocess.h\"\n\nstatic uint8_t* img_buffer_host = nullptr;\nstatic uint8_t* img_buffer_device = nullptr;\n\n__global__ void warpaffine_kernel(uint8_t* src, int src_line_size, int src_width, int src_height, float* dst,\n                                  int dst_width, int dst_height, uint8_t const_value_st, AffineMatrix d2s, int edge) {\n    int position = blockDim.x * blockIdx.x + threadIdx.x;\n    if (position >= edge)\n        return;\n\n    float m_x1 = d2s.value[0];\n    float m_y1 = d2s.value[1];\n    float m_z1 = d2s.value[2];\n    float m_x2 = d2s.value[3];\n    float m_y2 = d2s.value[4];\n    float m_z2 = d2s.value[5];\n\n    int dx = position % dst_width;\n    int dy = position / dst_width;\n    float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;\n    float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;\n    float c0, c1, c2;\n\n    if (src_x <= -1 || src_x >= src_width || src_y <= -1 || src_y >= src_height) {\n        // out of range\n        c0 = const_value_st;\n        c1 = const_value_st;\n        c2 = const_value_st;\n    } else {\n        int y_low = floorf(src_y);\n        int x_low = floorf(src_x);\n        int y_high = y_low + 1;\n        int x_high = x_low + 1;\n\n        uint8_t const_value[] = {const_value_st, const_value_st, const_value_st};\n        float ly = src_y - y_low;\n        float lx = src_x - x_low;\n        float hy = 1 - ly;\n        float hx = 1 - lx;\n        float w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;\n        uint8_t* v1 = const_value;\n        uint8_t* v2 = const_value;\n        uint8_t* v3 = const_value;\n        uint8_t* v4 = const_value;\n\n        if (y_low >= 0) {\n            if (x_low >= 0)\n                v1 = src + y_low * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v2 = src + y_low * src_line_size + x_high * 3;\n        }\n\n        if (y_high < src_height) {\n            if (x_low >= 0)\n                v3 = src + y_high * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v4 = src + y_high * src_line_size + x_high * 3;\n        }\n\n        c0 = w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0];\n        c1 = w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1];\n        c2 = w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2];\n    }\n\n    // bgr to rgb\n    float t = c2;\n    c2 = c0;\n    c0 = t;\n\n    // normalization\n    c0 = c0 / 255.0f;\n    c1 = c1 / 255.0f;\n    c2 = c2 / 255.0f;\n\n    // rgbrgbrgb to rrrgggbbb\n    int area = dst_width * dst_height;\n    float* pdst_c0 = dst + dy * dst_width + dx;\n    float* pdst_c1 = pdst_c0 + area;\n    float* pdst_c2 = pdst_c1 + area;\n    *pdst_c0 = c0;\n    *pdst_c1 = c1;\n    *pdst_c2 = c2;\n}\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream) {\n    int img_size = src_width * src_height * 3;\n    // copy data to pinned memory\n    memcpy(img_buffer_host, src, img_size);\n    // copy data to device memory\n    CUDA_CHECK(cudaMemcpyAsync(img_buffer_device, img_buffer_host, img_size, cudaMemcpyHostToDevice, stream));\n\n    AffineMatrix s2d, d2s;\n    float scale = std::min(dst_height / (float)src_height, dst_width / (float)src_width);\n\n    s2d.value[0] = scale;\n    s2d.value[1] = 0;\n    s2d.value[2] = -scale * src_width * 0.5 + dst_width * 0.5;\n    s2d.value[3] = 0;\n    s2d.value[4] = scale;\n    s2d.value[5] = -scale * src_height * 0.5 + dst_height * 0.5;\n    cv::Mat m2x3_s2d(2, 3, CV_32F, s2d.value);\n    cv::Mat m2x3_d2s(2, 3, CV_32F, d2s.value);\n    cv::invertAffineTransform(m2x3_s2d, m2x3_d2s);\n\n    memcpy(d2s.value, m2x3_d2s.ptr<float>(0), sizeof(d2s.value));\n\n    int jobs = dst_height * dst_width;\n    int threads = 256;\n    int blocks = ceil(jobs / (float)threads);\n    warpaffine_kernel<<<blocks, threads, 0, stream>>>(img_buffer_device, src_width * 3, src_width, src_height, dst,\n                                                      dst_width, dst_height, 128, d2s, jobs);\n}\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream) {\n    int dst_size = dst_width * dst_height * 3;\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        cuda_preprocess(img_batch[i].ptr(), img_batch[i].cols, img_batch[i].rows, &dst[dst_size * i], dst_width,\n                        dst_height, stream);\n        CUDA_CHECK(cudaStreamSynchronize(stream));\n    }\n}\n\nvoid cuda_preprocess_init(int max_image_size) {\n    // prepare input data in pinned memory\n    CUDA_CHECK(cudaMallocHost((void**)&img_buffer_host, max_image_size * 3));\n    // prepare input data in device memory\n    CUDA_CHECK(cudaMalloc((void**)&img_buffer_device, max_image_size * 3));\n}\n\nvoid cuda_preprocess_destroy() {\n    CUDA_CHECK(cudaFree(img_buffer_device));\n    CUDA_CHECK(cudaFreeHost(img_buffer_host));\n}\n"
  },
  {
    "path": "yolov12-tubro/yolov12_cls.cpp",
    "content": "#include \"calibrator.h\"\r\n#include \"config.h\"\r\n#include \"cuda_utils.h\"\r\n#include \"logging.h\"\r\n#include \"model.h\"\r\n#include \"utils.h\"\r\n\r\n#include <chrono>\r\n#include <cmath>\r\n#include <iostream>\r\n#include <numeric>\r\n#include <opencv2/opencv.hpp>\r\n\r\nusing namespace nvinfer1;\r\n\r\nstatic Logger gLogger;\r\nconst static int kOutputSize = kClsNumClass;\r\n\r\nvoid batch_preprocess(std::vector<cv::Mat>& imgs, float* output, int dst_width = 224, int dst_height = 224) {\r\n    for (size_t b = 0; b < imgs.size(); b++) {\r\n        int h = imgs[b].rows;\r\n        int w = imgs[b].cols;\r\n        int m = std::min(h, w);\r\n        int top = (h - m) / 2;\r\n        int left = (w - m) / 2;\r\n        cv::Mat img = imgs[b](cv::Rect(left, top, m, m));\r\n        cv::resize(img, img, cv::Size(dst_width, dst_height), 0, 0, cv::INTER_LINEAR);\r\n        cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\r\n        img.convertTo(img, CV_32F, 1 / 255.0);\r\n\r\n        std::vector<cv::Mat> channels(3);\r\n        cv::split(img, channels);\r\n\r\n        // CHW format\r\n        for (int c = 0; c < 3; ++c) {\r\n            int i = 0;\r\n            for (int row = 0; row < dst_height; ++row) {\r\n                for (int col = 0; col < dst_width; ++col) {\r\n                    output[b * 3 * dst_height * dst_width + c * dst_height * dst_width + i] =\r\n                            channels[c].at<float>(row, col);\r\n                    ++i;\r\n                }\r\n            }\r\n        }\r\n    }\r\n}\r\n\r\nstd::vector<float> softmax(float* prob, int n) {\r\n    std::vector<float> res;\r\n    float sum = 0.0f;\r\n    float t;\r\n    for (int i = 0; i < n; i++) {\r\n        t = expf(prob[i]);\r\n        res.push_back(t);\r\n        sum += t;\r\n    }\r\n    for (int i = 0; i < n; i++) {\r\n        res[i] /= sum;\r\n    }\r\n    return res;\r\n}\r\n\r\nstd::vector<int> topk(const std::vector<float>& vec, int k) {\r\n    std::vector<int> topk_index;\r\n    std::vector<size_t> vec_index(vec.size());\r\n    std::iota(vec_index.begin(), vec_index.end(), 0);\r\n\r\n    std::sort(vec_index.begin(), vec_index.end(),\r\n              [&vec](size_t index_1, size_t index_2) { return vec[index_1] > vec[index_2]; });\r\n\r\n    int k_num = std::min<int>(vec.size(), k);\r\n\r\n    for (int i = 0; i < k_num; ++i) {\r\n        topk_index.push_back(vec_index[i]);\r\n    }\r\n\r\n    return topk_index;\r\n}\r\n\r\nstd::vector<std::string> read_classes(std::string file_name) {\r\n    std::vector<std::string> classes;\r\n    std::ifstream ifs(file_name, std::ios::in);\r\n    if (!ifs.is_open()) {\r\n        std::cerr << file_name << \" is not found, pls refer to README and download it.\" << std::endl;\r\n        assert(0);\r\n    }\r\n    std::string s;\r\n    while (std::getline(ifs, s)) {\r\n        classes.push_back(s);\r\n    }\r\n    ifs.close();\r\n    return classes;\r\n}\r\n\r\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, float& gd, float& gw,\r\n                std::string& img_dir, std::string& type, int& max_channels) {\r\n    if (argc < 4)\r\n        return false;\r\n    if (std::string(argv[1]) == \"-s\" && (argc == 5)) {\r\n        wts = std::string(argv[2]);\r\n        engine = std::string(argv[3]);\r\n        auto net = std::string(argv[4]);\r\n        if (net[0] == 'n') {\r\n            gd = 0.50;\r\n            gw = 0.25;\r\n            max_channels = 1024;\r\n            type = \"n\";\r\n        } else if (net[0] == 's') {\r\n            gd = 0.50;\r\n            gw = 0.50;\r\n            max_channels = 1024;\r\n            type = \"s\";\r\n        } else if (net[0] == 'm') {\r\n            gd = 0.50;\r\n            gw = 1.00;\r\n            max_channels = 512;\r\n            type = \"m\";\r\n        } else if (net[0] == 'l') {\r\n            gd = 1.0;\r\n            gw = 1.0;\r\n            max_channels = 512;\r\n            type = \"l\";\r\n        } else if (net[0] == 'x') {\r\n            gd = 1.0;\r\n            gw = 1.50;\r\n            max_channels = 512;\r\n            type = \"x\";\r\n        } else {\r\n            return false;\r\n        }\r\n    } else if (std::string(argv[1]) == \"-d\" && argc == 4) {\r\n        engine = std::string(argv[2]);\r\n        img_dir = std::string(argv[3]);\r\n    } else {\r\n        return false;\r\n    }\r\n    return true;\r\n}\r\n\r\nvoid prepare_buffers(ICudaEngine* engine, float** gpu_input_buffer, float** gpu_output_buffer, float** cpu_input_buffer,\r\n                     float** output_buffer_host) {\r\n    assert(engine->getNbBindings() == 2);\r\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\r\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\r\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\r\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\r\n    assert(inputIndex == 0);\r\n    assert(outputIndex == 1);\r\n    // Create GPU buffers on device\r\n    CUDA_CHECK(cudaMalloc((void**)gpu_input_buffer, kBatchSize * 3 * kClsInputH * kClsInputW * sizeof(float)));\r\n    CUDA_CHECK(cudaMalloc((void**)gpu_output_buffer, kBatchSize * kOutputSize * sizeof(float)));\r\n\r\n    *cpu_input_buffer = new float[kBatchSize * 3 * kClsInputH * kClsInputW];\r\n    *output_buffer_host = new float[kBatchSize * kOutputSize];\r\n}\r\n\r\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* input, float* output,\r\n           int batchSize) {\r\n    CUDA_CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * 3 * kClsInputH * kClsInputW * sizeof(float),\r\n                               cudaMemcpyHostToDevice, stream));\r\n    context.enqueueV2(buffers, stream, nullptr);\r\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\r\n                               stream));\r\n    cudaStreamSynchronize(stream);\r\n}\r\n\r\nvoid serialize_engine(float& gd, float& gw, std::string& wts_name, std::string& engine_name, std::string& type,\r\n                      int max_channels) {\r\n    // Create builder\r\n    IBuilder* builder = createInferBuilder(gLogger);\r\n    IBuilderConfig* config = builder->createBuilderConfig();\r\n    // Create model to populate the network, then set the outputs and create an engine\r\n    IHostMemory* serialized_engine = nullptr;\r\n    serialized_engine = buildEngineYolov12Cls(builder, config, DataType::kFLOAT, wts_name, gd, gw, type, max_channels);\r\n    assert(serialized_engine);\r\n    // Save engine to file\r\n    std::ofstream p(engine_name, std::ios::binary);\r\n    if (!p) {\r\n        std::cerr << \"Could not open plan output file\" << std::endl;\r\n        assert(false);\r\n    }\r\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\r\n\r\n    // Close everything down\r\n    delete serialized_engine;\r\n    delete config;\r\n    delete builder;\r\n}\r\n\r\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\r\n                        IExecutionContext** context) {\r\n    std::ifstream file(engine_name, std::ios::binary);\r\n    if (!file.good()) {\r\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\r\n        assert(false);\r\n    }\r\n    size_t size = 0;\r\n    file.seekg(0, file.end);\r\n    size = file.tellg();\r\n    file.seekg(0, file.beg);\r\n    char* serialized_engine = new char[size];\r\n    assert(serialized_engine);\r\n    file.read(serialized_engine, size);\r\n    file.close();\r\n\r\n    *runtime = createInferRuntime(gLogger);\r\n    assert(*runtime);\r\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\r\n    assert(*engine);\r\n    *context = (*engine)->createExecutionContext();\r\n    assert(*context);\r\n    delete[] serialized_engine;\r\n}\r\n\r\nint main(int argc, char** argv) {\r\n    // yolov12-cls -s ../models/yolov12n-cls.wts ../models/yolov12-cls.fp32.trt n\r\n    // yolov12-cls -d ../models/yolov12n-cls.fp32.trt ../images\r\n    cudaSetDevice(kGpuId);\r\n    std::string wts_name;\r\n    std::string engine_name;\r\n    float gd = 0.0f, gw = 0.0f;\r\n    std::string img_dir;\r\n    std::string type;\r\n    int max_channels = 0;\r\n\r\n    if (!parse_args(argc, argv, wts_name, engine_name, gd, gw, img_dir, type, max_channels)) {\r\n        std::cerr << \"arguments not right!\" << std::endl;\r\n        std::cerr << \"./yolov12-cls -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to plan file\" << std::endl;\r\n        std::cerr << \"./yolov12-cls -d [.engine] ../images  // deserialize plan file and run inference\" << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    // Create a model using the API directly and serialize it to a file\r\n    if (!wts_name.empty()) {\r\n        serialize_engine(gd, gw, wts_name, engine_name, type, max_channels);\r\n        return 0;\r\n    }\r\n\r\n    // Deserialize the engine from file\r\n    IRuntime* runtime = nullptr;\r\n    ICudaEngine* engine = nullptr;\r\n    IExecutionContext* context = nullptr;\r\n    deserialize_engine(engine_name, &runtime, &engine, &context);\r\n    cudaStream_t stream;\r\n    CUDA_CHECK(cudaStreamCreate(&stream));\r\n\r\n    // Prepare cpu and gpu buffers\r\n    float* device_buffers[2];\r\n    float* cpu_input_buffer = nullptr;\r\n    float* output_buffer_host = nullptr;\r\n    prepare_buffers(engine, &device_buffers[0], &device_buffers[1], &cpu_input_buffer, &output_buffer_host);\r\n\r\n    // Read images from directory\r\n    std::vector<std::string> file_names;\r\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\r\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    // Read imagenet labels\r\n    auto classes = read_classes(\"imagenet_classes.txt\");\r\n\r\n    // batch predict\r\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\r\n        // Get a batch of images\r\n        std::vector<cv::Mat> img_batch;\r\n        std::vector<std::string> img_name_batch;\r\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\r\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\r\n            img_batch.push_back(img);\r\n            img_name_batch.push_back(file_names[j]);\r\n        }\r\n\r\n        // Preprocess\r\n        batch_preprocess(img_batch, cpu_input_buffer);\r\n\r\n        // Run inference\r\n        auto start = std::chrono::system_clock::now();\r\n        infer(*context, stream, (void**)device_buffers, cpu_input_buffer, output_buffer_host, kBatchSize);\r\n        auto end = std::chrono::system_clock::now();\r\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\r\n                  << \"ms\" << std::endl;\r\n\r\n        // Postprocess and get top-k result\r\n        for (size_t b = 0; b < img_name_batch.size(); b++) {\r\n            float* p = &output_buffer_host[b * kOutputSize];\r\n            auto res = softmax(p, kOutputSize);\r\n            auto topk_idx = topk(res, 3);\r\n            std::cout << img_name_batch[b] << std::endl;\r\n            for (auto idx : topk_idx) {\r\n                std::cout << \"  \" << classes[idx] << \" \" << res[idx] << std::endl;\r\n            }\r\n        }\r\n    }\r\n\r\n    // Release stream and buffers\r\n    cudaStreamDestroy(stream);\r\n    CUDA_CHECK(cudaFree(device_buffers[0]));\r\n    CUDA_CHECK(cudaFree(device_buffers[1]));\r\n    delete[] cpu_input_buffer;\r\n    delete[] output_buffer_host;\r\n    // Destroy the engine\r\n    delete context;\r\n    delete engine;\r\n    delete runtime;\r\n    return 0;\r\n}\r\n"
  },
  {
    "path": "yolov12-tubro/yolov12_cls_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport os\nimport shutil\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport torch\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\nclass YoLov12TRT(object):\n    \"\"\"\n    description: A YOLOv12 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n        self.mean = (0.485, 0.456, 0.406)\n        self.std = (0.229, 0.224, 0.225)\n\n        for binding in engine:\n            print('binding:', binding, engine.get_binding_shape(binding))\n            self.batch_size = engine.get_binding_shape(binding)[0]\n            size = trt.volume(engine.get_binding_shape(\n                binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_input_image = np.empty(\n            shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            batch_image_raw.append(image_raw)\n            input_image = self.preprocess_cls_image(image_raw)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            classes_ls, predicted_conf_ls, category_id_ls = self.postprocess_cls(\n                output)\n            cv2.putText(batch_image_raw[i], str(\n                classes_ls), (10, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 1, cv2.LINE_AA)\n            print(classes_ls, predicted_conf_ls)\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_cls_image(self, raw_bgr_image, dst_width=224, dst_height=224):\n\n        \"\"\"\n            description: Convert BGR image to RGB,\n                         crop the center square frame,\n                         resize it to target size, normalize to [0,1],\n                         transform to NCHW format.\n            param:\n                raw_bgr_image: numpy array, raw BGR image\n                dst_width: int, target image width\n                dst_height: int, target image height\n            return:\n                image:  the processed image\n                image_raw: the original image\n                h: original height\n                w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        # Crop the center square frame\n        m = min(h, w)\n        top = (h - m) // 2\n        left = (w - m) // 2\n        image = raw_bgr_image[top:top + m, left:left + m]\n\n        # Resize the image with target size while maintaining ratio\n        image = cv2.resize(image, (dst_width, dst_height), interpolation=cv2.INTER_LINEAR)\n\n        # Convert BGR to RGB\n        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)\n\n        # Normalize to [0,1]\n        image = image.astype(np.float32) / 255.0\n\n        # HWC to CHW format\n        image = image.transpose(2, 0, 1)\n\n        # CHW to NCHW format (add batch dimension)\n        image = np.expand_dims(image, axis=0)\n\n        # Convert the image to row-major order, also known as \"C order\"\n        image = np.ascontiguousarray(image)\n\n        batch_data = np.expand_dims(image, axis=0)\n\n        return batch_data\n\n    def postprocess_cls(self, output_data):\n        classes_ls = []\n        predicted_conf_ls = []\n        category_id_ls = []\n        output_data = output_data.reshape(self.batch_size, -1)\n        output_data = torch.Tensor(output_data)\n        p = torch.nn.functional.softmax(output_data, dim=1)\n        score, index = torch.topk(p, 3)\n        for ind in range(index.shape[0]):\n            input_category_id = index[ind][0].item()  # 716\n            category_id_ls.append(input_category_id)\n            predicted_confidence = score[ind][0].item()\n            predicted_conf_ls.append(predicted_confidence)\n            classes_ls.append(classes[input_category_id])\n        return classes_ls, predicted_conf_ls, category_id_ls\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov12_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov12_wrapper = yolov12_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov12_wrapper.infer(\n            self.yolov12_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(\n            self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov12_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov12_wrapper = yolov12_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov12_wrapper.infer(\n            self.yolov12_wrapper.get_raw_image_zeros())\n        print(\n            'warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\n# with open(\"imagenet_classes.txt\") as f:\n#     classes = [line.strip() for line in f.readlines()]\n\nclasses = [\"daisy\", \"dandelion\", \"rose\", \"sunflower\", \"tulip\"]\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    engine_file_path = \"build/yolov12n-cls-5.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov12TRT instance\n    yolov12_wrapper = YoLov12TRT(engine_file_path)\n    try:\n        print('batch size is', yolov12_wrapper.batch_size)\n\n        image_dir = \"images\"\n        image_path_batches = get_img_path_batches(\n            yolov12_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov12_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov12_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov12_wrapper.destroy()\n"
  },
  {
    "path": "yolov12-tubro/yolov12_det.cpp",
    "content": "\n#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"utils.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\n\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, float& gd, float& gw, int& max_channels,\n                      std::string& type) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine = nullptr;\n\n    serialized_engine = buildEngineYolov12Det(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels, type);\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_buffer_host, float** decode_ptr_host, float** decode_ptr_device,\n                    std::string cuda_post_process) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n    if (cuda_post_process == \"c\") {\n        *output_buffer_host = new float[kBatchSize * kOutputSize];\n    } else if (cuda_post_process == \"g\") {\n        if (kBatchSize > 1) {\n            std::cerr << \"Do not yet support GPU post processing for multiple batches\" << std::endl;\n            exit(0);\n        }\n        // Allocate memory for decode_ptr_host and copy to device\n        *decode_ptr_host = new float[1 + kMaxNumOutputBbox * bbox_element];\n        CUDA_CHECK(cudaMalloc((void**)decode_ptr_device, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element)));\n    }\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchsize,\n           float* decode_ptr_host, float* decode_ptr_device, int model_bboxes, std::string cuda_post_process) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueueV2(buffers, stream, nullptr);\n    if (cuda_post_process == \"c\") {\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n    } else if (cuda_post_process == \"g\") {\n        CUDA_CHECK(\n                cudaMemsetAsync(decode_ptr_device, 0, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), stream));\n        cuda_decode((float*)buffers[1], model_bboxes, kConfThresh, decode_ptr_device, kMaxNumOutputBbox, stream);\n        cuda_nms(decode_ptr_device, kNmsThresh, kMaxNumOutputBbox, stream);  //cuda nms\n        CUDA_CHECK(cudaMemcpyAsync(decode_ptr_host, decode_ptr_device,\n                                   sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference and gpu postprocess time: \"\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir, std::string& type,\n                std::string& cuda_post_process, float& gd, float& gw, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && (argc == 5)) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        auto sub_type = std::string(argv[4]);\n\n        if (sub_type[0] == 'n') {\n            gd = 0.50;\n            gw = 0.25;\n            max_channels = 1024;\n            type = \"n\";\n        } else if (sub_type[0] == 's') {\n            gd = 0.50;\n            gw = 0.50;\n            max_channels = 1024;\n            type = \"s\";\n        } else if (sub_type[0] == 'm') {\n            gd = 0.50;\n            gw = 1.00;\n            max_channels = 512;\n            type = \"m\";\n        } else if (sub_type[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n            type = \"l\";\n        } else if (sub_type[0] == 'x') {\n            gd = 1.0;\n            gw = 1.50;\n            max_channels = 512;\n            type = \"x\";\n        } else {\n            return false;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 5) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n        cuda_post_process = std::string(argv[4]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    // yolov12_det -s ../models/yolov12n.wts ../models/yolov12n.fp32.trt n\n    // yolov12_det -d ../models/yolov12n.fp32.trt ../images c\n    cudaSetDevice(kGpuId);\n    std::string wts_name;\n    std::string engine_name;\n    std::string img_dir;\n    std::string cuda_post_process;\n    std::string type;\n    int model_bboxes;\n    float gd = 0, gw = 0;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir, type, cuda_post_process, gd, gw, max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolov12_det -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to \"\n                     \"plan file\"\n                  << std::endl;\n        std::cerr << \"./yolov12_det -d [.engine] ../images  [c/g]// deserialize plan file and run inference\"\n                  << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, gd, gw, max_channels, type);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* output_buffer_host = nullptr;\n    float* decode_ptr_host = nullptr;\n    float* decode_ptr_device = nullptr;\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host, &decode_ptr_host,\n                   &decode_ptr_device, cuda_post_process);\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n        // Run inference\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize, decode_ptr_host,\n              decode_ptr_device, model_bboxes, cuda_post_process);\n        // Save the first 100 values of output_buffer_host, one per line\n        //        std::ofstream out(\"../models/output.txt\");\n        //        for (int j = 0; j < 100; j++) {\n        //            out << output_buffer_host[j] << std::endl;\n        //        }\n        //        out.close();\n        std::vector<std::vector<Detection>> res_batch;\n        if (cuda_post_process == \"c\") {\n            // NMS\n            batch_nms(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n        } else if (cuda_post_process == \"g\") {\n            //Process gpu decode and nms results\n            batch_process(res_batch, decode_ptr_host, img_batch.size(), bbox_element, img_batch);\n        }\n        // Draw bounding boxes\n        draw_bbox(img_batch, res_batch);\n        // Save images\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    CUDA_CHECK(cudaFree(decode_ptr_device));\n    delete[] decode_ptr_host;\n    delete[] output_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    // Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < kOutputSize; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov12-tubro/yolov12_det_trt.py",
    "content": "\"\"\"\r\nAn example that uses TensorRT's Python api to make inferences.\r\n\"\"\"\r\nimport ctypes\r\nimport os\r\nimport shutil\r\nimport random\r\nimport sys\r\nimport threading\r\nimport time\r\nimport cv2\r\nimport numpy as np\r\nimport pycuda.autoinit  # noqa: F401\r\nimport pycuda.driver as cuda\r\nimport tensorrt as trt\r\n\r\nCONF_THRESH = 0.5\r\nIOU_THRESHOLD = 0.4\r\nPOSE_NUM = 17 * 3\r\nDET_NUM = 6\r\nSEG_NUM = 32\r\nOBB_NUM = 1\r\n\r\n\r\ndef get_img_path_batches(batch_size, img_dir):\r\n    ret = []\r\n    batch = []\r\n    for root, dirs, files in os.walk(img_dir):\r\n        for name in files:\r\n            if len(batch) == batch_size:\r\n                ret.append(batch)\r\n                batch = []\r\n            batch.append(os.path.join(root, name))\r\n    if len(batch) > 0:\r\n        ret.append(batch)\r\n    return ret\r\n\r\n\r\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\r\n    \"\"\"\r\n    description: Plots one bounding box on image img,\r\n                 this function comes from YoLo11 project.\r\n    param:\r\n        x:      a box likes [x1,y1,x2,y2]\r\n        img:    a opencv image object\r\n        color:  color to draw rectangle, such as (0,255,0)\r\n        label:  str\r\n        line_thickness: int\r\n    return:\r\n        no return\r\n\r\n    \"\"\"\r\n    tl = (\r\n            line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\r\n    )  # line/font thickness\r\n    color = color or [random.randint(0, 255) for _ in range(3)]\r\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\r\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\r\n    if label:\r\n        tf = max(tl - 1, 1)  # font thickness\r\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\r\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\r\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\r\n        cv2.putText(\r\n            img,\r\n            label,\r\n            (c1[0], c1[1] - 2),\r\n            0,\r\n            tl / 3,\r\n            [225, 255, 255],\r\n            thickness=tf,\r\n            lineType=cv2.LINE_AA,\r\n        )\r\n\r\n\r\nclass YoLo12TRT(object):\r\n    \"\"\"\r\n    description: A YOLO11 class that warps TensorRT ops, preprocess and postprocess ops.\r\n    \"\"\"\r\n\r\n    def __init__(self, engine_file_path):\r\n        # Create a Context on this device,\r\n        self.ctx = cuda.Device(0).make_context()\r\n        stream = cuda.Stream()\r\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\r\n        runtime = trt.Runtime(TRT_LOGGER)\r\n\r\n        # Deserialize the engine from file\r\n        with open(engine_file_path, \"rb\") as f:\r\n            engine = runtime.deserialize_cuda_engine(f.read())\r\n        context = engine.create_execution_context()\r\n\r\n        host_inputs = []\r\n        cuda_inputs = []\r\n        host_outputs = []\r\n        cuda_outputs = []\r\n        bindings = []\r\n\r\n        for binding in engine:\r\n            print('bingding:', binding, engine.get_binding_shape(binding))\r\n            self.batch_size = engine.get_binding_shape(binding)[0]\r\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\r\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\r\n            # Allocate host and device buffers\r\n            host_mem = cuda.pagelocked_empty(size, dtype)\r\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\r\n            # Append the device buffer to device bindings.\r\n            bindings.append(int(cuda_mem))\r\n            # Append to the appropriate list.\r\n            if engine.binding_is_input(binding):\r\n                self.input_w = engine.get_binding_shape(binding)[-1]\r\n                self.input_h = engine.get_binding_shape(binding)[-2]\r\n                host_inputs.append(host_mem)\r\n                cuda_inputs.append(cuda_mem)\r\n            else:\r\n                host_outputs.append(host_mem)\r\n                cuda_outputs.append(cuda_mem)\r\n\r\n        # Store\r\n        self.stream = stream\r\n        self.context = context\r\n        self.engine = engine\r\n        self.host_inputs = host_inputs\r\n        self.cuda_inputs = cuda_inputs\r\n        self.host_outputs = host_outputs\r\n        self.cuda_outputs = cuda_outputs\r\n        self.bindings = bindings\r\n        self.det_output_length = host_outputs[0].shape[0]\r\n\r\n    def infer(self, raw_image_generator):\r\n        threading.Thread.__init__(self)\r\n        # Make self the active context, pushing it on top of the context stack.\r\n        self.ctx.push()\r\n        # Restore\r\n        stream = self.stream\r\n        context = self.context\r\n        host_inputs = self.host_inputs\r\n        cuda_inputs = self.cuda_inputs\r\n        host_outputs = self.host_outputs\r\n        cuda_outputs = self.cuda_outputs\r\n        bindings = self.bindings\r\n        # Do image preprocess\r\n        batch_image_raw = []\r\n        batch_origin_h = []\r\n        batch_origin_w = []\r\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\r\n        for i, image_raw in enumerate(raw_image_generator):\r\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\r\n            batch_image_raw.append(image_raw)\r\n            batch_origin_h.append(origin_h)\r\n            batch_origin_w.append(origin_w)\r\n            np.copyto(batch_input_image[i], input_image)\r\n        batch_input_image = np.ascontiguousarray(batch_input_image)\r\n\r\n        # Copy input image to host buffer\r\n        np.copyto(host_inputs[0], batch_input_image.ravel())\r\n        start = time.time()\r\n        # Transfer input data  to the GPU.\r\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\r\n        # Run inference.\r\n        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\r\n        # Transfer predictions back from the GPU.\r\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\r\n        # Synchronize the stream\r\n        stream.synchronize()\r\n        end = time.time()\r\n        # Remove any context from the top of the context stack, deactivating it.\r\n        self.ctx.pop()\r\n        # Here we use the first row of output in that batch_size = 1\r\n        output = host_outputs[0]\r\n        # Do postprocess\r\n        for i in range(self.batch_size):\r\n            result_boxes, result_scores, result_classid = self.post_process(\r\n                output[i * self.det_output_length: (i + 1) * self.det_output_length], batch_origin_h[i],\r\n                batch_origin_w[i]\r\n            )\r\n            # Draw rectangles and labels on the original image\r\n            for j in range(len(result_boxes)):\r\n                box = result_boxes[j]\r\n                plot_one_box(\r\n                    box,\r\n                    batch_image_raw[i],\r\n                    label=\"{}:{:.2f}\".format(\r\n                        categories[int(result_classid[j])], result_scores[j]\r\n                    ),\r\n                )\r\n        return batch_image_raw, end - start\r\n\r\n    def destroy(self):\r\n        # Remove any context from the top of the context stack, deactivating it.\r\n        self.ctx.pop()\r\n\r\n    def get_raw_image(self, image_path_batch):\r\n        \"\"\"\r\n        description: Read an image from image path\r\n        \"\"\"\r\n        for img_path in image_path_batch:\r\n            yield cv2.imread(img_path)\r\n\r\n    def get_raw_image_zeros(self, image_path_batch=None):\r\n        \"\"\"\r\n        description: Ready data for warmup\r\n        \"\"\"\r\n        for _ in range(self.batch_size):\r\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\r\n\r\n    def preprocess_image(self, raw_bgr_image):\r\n        \"\"\"\r\n        description: Convert BGR image to RGB,\r\n                     resize and pad it to target size, normalize to [0,1],\r\n                     transform to NCHW format.\r\n        param:\r\n            input_image_path: str, image path\r\n        return:\r\n            image:  the processed image\r\n            image_raw: the original image\r\n            h: original height\r\n            w: original width\r\n        \"\"\"\r\n        image_raw = raw_bgr_image\r\n        h, w, c = image_raw.shape\r\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\r\n        # Calculate widht and height and paddings\r\n        r_w = self.input_w / w\r\n        r_h = self.input_h / h\r\n        if r_h > r_w:\r\n            tw = self.input_w\r\n            th = int(r_w * h)\r\n            tx1 = tx2 = 0\r\n            ty1 = int((self.input_h - th) / 2)\r\n            ty2 = self.input_h - th - ty1\r\n        else:\r\n            tw = int(r_h * w)\r\n            th = self.input_h\r\n            tx1 = int((self.input_w - tw) / 2)\r\n            tx2 = self.input_w - tw - tx1\r\n            ty1 = ty2 = 0\r\n        # Resize the image with long side while maintaining ratio\r\n        image = cv2.resize(image, (tw, th))\r\n        # Pad the short side with (128,128,128)\r\n        image = cv2.copyMakeBorder(\r\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\r\n        )\r\n        image = image.astype(np.float32)\r\n        # Normalize to [0,1]\r\n        image /= 255.0\r\n        # HWC to CHW format:\r\n        image = np.transpose(image, [2, 0, 1])\r\n        # CHW to NCHW format\r\n        image = np.expand_dims(image, axis=0)\r\n        # Convert the image to row-major order, also known as \"C order\":\r\n        image = np.ascontiguousarray(image)\r\n        return image, image_raw, h, w\r\n\r\n    def xywh2xyxy(self, origin_h, origin_w, x):\r\n        \"\"\"\r\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\r\n        param:\r\n            origin_h:   height of original image\r\n            origin_w:   width of original image\r\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\r\n        return:\r\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\r\n        \"\"\"\r\n        y = np.zeros_like(x)\r\n        r_w = self.input_w / origin_w\r\n        r_h = self.input_h / origin_h\r\n        if r_h > r_w:\r\n            y[:, 0] = x[:, 0]\r\n            y[:, 2] = x[:, 2]\r\n            y[:, 1] = x[:, 1] - (self.input_h - r_w * origin_h) / 2\r\n            y[:, 3] = x[:, 3] - (self.input_h - r_w * origin_h) / 2\r\n            y /= r_w\r\n        else:\r\n            y[:, 0] = x[:, 0] - (self.input_w - r_h * origin_w) / 2\r\n            y[:, 2] = x[:, 2] - (self.input_w - r_h * origin_w) / 2\r\n            y[:, 1] = x[:, 1]\r\n            y[:, 3] = x[:, 3]\r\n            y /= r_h\r\n\r\n        return y\r\n\r\n    def post_process(self, output, origin_h, origin_w):\r\n        \"\"\"\r\n        description: postprocess the prediction\r\n        param:\r\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...]\r\n            origin_h:   height of original image\r\n            origin_w:   width of original image\r\n        return:\r\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\r\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\r\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\r\n        \"\"\"\r\n        num_values_per_detection = DET_NUM + SEG_NUM + POSE_NUM + OBB_NUM\r\n        # Get the num of boxes detected\r\n        num = int(output[0])\r\n        print(\"There are {} detections in the picture!!!\".format(num))\r\n        # Reshape to a two dimentional ndarray\r\n        # pred = np.reshape(output[1:], (-1, 38))[:num, :]\r\n        pred = np.reshape(output[1:], (-1, num_values_per_detection))[:num, :]\r\n        # Do nms\r\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\r\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\r\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\r\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\r\n        return result_boxes, result_scores, result_classid\r\n\r\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\r\n        \"\"\"\r\n        description: compute the IoU of two bounding boxes\r\n        param:\r\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\r\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\r\n            x1y1x2y2: select the coordinate format\r\n        return:\r\n            iou: computed iou\r\n        \"\"\"\r\n        if not x1y1x2y2:\r\n            # Transform from center and width to exact coordinates\r\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\r\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\r\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\r\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\r\n        else:\r\n            # Get the coordinates of bounding boxes\r\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\r\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\r\n\r\n        # Get the coordinates of the intersection rectangle\r\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\r\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\r\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\r\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\r\n        # Intersection area\r\n        inter_area = (np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None)\r\n                      * np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None))\r\n        # Union Area\r\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\r\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\r\n\r\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\r\n\r\n        return iou\r\n\r\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\r\n        \"\"\"\r\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\r\n        Non-Maximum Suppression to further filter detections.\r\n        param:\r\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\r\n            origin_h: original image height\r\n            origin_w: original image width\r\n            conf_thres: a confidence threshold to filter detections\r\n            nms_thres: a iou threshold to filter detections\r\n        return:\r\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\r\n        \"\"\"\r\n        # Get the boxes that score > CONF_THRESH\r\n        boxes = prediction[prediction[:, 4] >= conf_thres]\r\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\r\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\r\n        # clip the coordinates\r\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1)\r\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1)\r\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1)\r\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1)\r\n        # Object confidence\r\n        confs = boxes[:, 4]\r\n        # Sort by the confs\r\n        boxes = boxes[np.argsort(-confs)]\r\n        # Perform non-maximum suppression\r\n        keep_boxes = []\r\n        while boxes.shape[0]:\r\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\r\n            label_match = boxes[0, -1] == boxes[:, -1]\r\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\r\n            invalid = large_overlap & label_match\r\n            keep_boxes += [boxes[0]]\r\n            boxes = boxes[~invalid]\r\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\r\n        return boxes\r\n\r\n\r\nclass inferThread(threading.Thread):\r\n    def __init__(self, yolo11_wrapper, image_path_batch):\r\n        threading.Thread.__init__(self)\r\n        self.yolo11_wrapper = yolo11_wrapper\r\n        self.image_path_batch = image_path_batch\r\n\r\n    def run(self):\r\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(self.yolo11_wrapper.get_raw_image(self.image_path_batch))\r\n        for i, img_path in enumerate(self.image_path_batch):\r\n            parent, filename = os.path.split(img_path)\r\n            save_name = os.path.join('output', filename)\r\n            # Save image\r\n            cv2.imwrite(save_name, batch_image_raw[i])\r\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\r\n\r\n\r\nclass warmUpThread(threading.Thread):\r\n    def __init__(self, yolo11_wrapper):\r\n        threading.Thread.__init__(self)\r\n        self.yolo11_wrapper = yolo11_wrapper\r\n\r\n    def run(self):\r\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(self.yolo11_wrapper.get_raw_image_zeros())\r\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    # load custom plugin and engine\r\n    PLUGIN_LIBRARY = \"build/libmyplugins.so\"\r\n    engine_file_path = \"build/yolov12n-det.engine\"\r\n\r\n    if len(sys.argv) > 1:\r\n        engine_file_path = sys.argv[1]\r\n    if len(sys.argv) > 2:\r\n        PLUGIN_LIBRARY = sys.argv[2]\r\n\r\n    ctypes.CDLL(PLUGIN_LIBRARY)\r\n\r\n    # load coco labels\r\n    categories = [\"object\"]\r\n\r\n    # categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\",\r\n    #               \"traffic light\",\r\n    #               \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\",\r\n    #               \"cow\", \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\r\n    #               \"frisbee\",\r\n    #               \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\",\r\n    #               \"surfboard\",\r\n    #               \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\r\n    #               \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\r\n    #               \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\r\n    #               \"cell phone\",\r\n    #               \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\r\n    #               \"teddy bear\",\r\n    #               \"hair drier\", \"toothbrush\"]\r\n\r\n    if os.path.exists('output/'):\r\n        shutil.rmtree('output/')\r\n    os.makedirs('output/')\r\n    # a YoLo11TRT instance\r\n    yolov12_wrapper = YoLo12TRT(engine_file_path)\r\n    try:\r\n        print('batch size is', yolov12_wrapper.batch_size)\r\n\r\n        image_dir = \"images\"\r\n        image_path_batches = get_img_path_batches(yolov12_wrapper.batch_size, image_dir)\r\n\r\n        for i in range(10):\r\n            # create a new thread to do warm_up\r\n            thread1 = warmUpThread(yolov12_wrapper)\r\n            thread1.start()\r\n            thread1.join()\r\n        for batch in image_path_batches:\r\n            # create a new thread to do inference\r\n            thread1 = inferThread(yolov12_wrapper, batch)\r\n            thread1.start()\r\n            thread1.join()\r\n    finally:\r\n        # destroy the instance\r\n        yolov12_wrapper.destroy()\r\n"
  },
  {
    "path": "yolov12-tubro/yolov12_seg.cpp",
    "content": "\r\n#include <fstream>\r\n#include <iostream>\r\n#include <opencv2/opencv.hpp>\r\n#include \"cuda_utils.h\"\r\n#include \"logging.h\"\r\n#include \"model.h\"\r\n#include \"postprocess.h\"\r\n#include \"preprocess.h\"\r\n#include \"utils.h\"\r\n\r\nLogger gLogger;\r\nusing namespace nvinfer1;\r\nconst int kOutputSize = kMaxNumOutputBbox * (sizeof(Detection) - sizeof(float) * 51) / sizeof(float) + 1;\r\nconst static int kOutputSegSize = 32 * (kInputH / 4) * (kInputW / 4);\r\n\r\nstatic cv::Rect get_downscale_rect(float bbox[4], float scale) {\r\n\r\n    float left = bbox[0];\r\n    float top = bbox[1];\r\n    float right = bbox[0] + bbox[2];\r\n    float bottom = bbox[1] + bbox[3];\r\n\r\n    left = left < 0 ? 0 : left;\r\n    top = top < 0 ? 0 : top;\r\n    right = right > kInputW ? kInputW : right;\r\n    bottom = bottom > kInputH ? kInputH : bottom;\r\n\r\n    left /= scale;\r\n    top /= scale;\r\n    right /= scale;\r\n    bottom /= scale;\r\n    return cv::Rect(int(left), int(top), int(right - left), int(bottom - top));\r\n}\r\n\r\nstd::vector<cv::Mat> process_mask(const float* proto, int proto_size, std::vector<Detection>& dets) {\r\n\r\n    std::vector<cv::Mat> masks;\r\n    for (size_t i = 0; i < dets.size(); i++) {\r\n\r\n        cv::Mat mask_mat = cv::Mat::zeros(kInputH / 4, kInputW / 4, CV_32FC1);\r\n        auto r = get_downscale_rect(dets[i].bbox, 4);\r\n\r\n        for (int x = r.x; x < r.x + r.width; x++) {\r\n            for (int y = r.y; y < r.y + r.height; y++) {\r\n                float e = 0.0f;\r\n                for (int j = 0; j < 32; j++) {\r\n                    e += dets[i].mask[j] * proto[j * proto_size / 32 + y * mask_mat.cols + x];\r\n                }\r\n                e = 1.0f / (1.0f + expf(-e));\r\n                mask_mat.at<float>(y, x) = e;\r\n            }\r\n        }\r\n        cv::resize(mask_mat, mask_mat, cv::Size(kInputW, kInputH));\r\n        masks.push_back(mask_mat);\r\n    }\r\n    return masks;\r\n}\r\n\r\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, std::string& type, float& gd, float& gw,\r\n                      int& max_channels) {\r\n    IBuilder* builder = createInferBuilder(gLogger);\r\n    IBuilderConfig* config = builder->createBuilderConfig();\r\n    IHostMemory* serialized_engine = nullptr;\r\n\r\n    serialized_engine = buildEngineYolov12Seg(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels, type);\r\n\r\n    assert(serialized_engine);\r\n    std::ofstream p(engine_name, std::ios::binary);\r\n    if (!p) {\r\n        std::cout << \"could not open plan output file\" << std::endl;\r\n        assert(false);\r\n    }\r\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\r\n\r\n    delete serialized_engine;\r\n    delete config;\r\n    delete builder;\r\n}\r\n\r\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\r\n                        IExecutionContext** context) {\r\n    std::ifstream file(engine_name, std::ios::binary);\r\n    if (!file.good()) {\r\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\r\n        assert(false);\r\n    }\r\n    size_t size = 0;\r\n    file.seekg(0, file.end);\r\n    size = file.tellg();\r\n    file.seekg(0, file.beg);\r\n    char* serialized_engine = new char[size];\r\n    assert(serialized_engine);\r\n    file.read(serialized_engine, size);\r\n    file.close();\r\n\r\n    *runtime = createInferRuntime(gLogger);\r\n    assert(*runtime);\r\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\r\n    assert(*engine);\r\n    *context = (*engine)->createExecutionContext();\r\n    assert(*context);\r\n    delete[] serialized_engine;\r\n}\r\n\r\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\r\n                    float** output_seg_buffer_device, float** output_buffer_host, float** output_seg_buffer_host,\r\n                    float** decode_ptr_host, float** decode_ptr_device, std::string cuda_post_process) {\r\n    assert(engine->getNbBindings() == 3);\r\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\r\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\r\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\r\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\r\n    const int outputIndex_seg = engine->getBindingIndex(\"proto\");\r\n\r\n    assert(inputIndex == 0);\r\n    assert(outputIndex == 1);\r\n    assert(outputIndex_seg == 2);\r\n    // Create GPU buffers on device\r\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\r\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\r\n    CUDA_CHECK(cudaMalloc((void**)output_seg_buffer_device, kBatchSize * kOutputSegSize * sizeof(float)));\r\n\r\n    if (cuda_post_process == \"c\") {\r\n        *output_buffer_host = new float[kBatchSize * kOutputSize];\r\n        *output_seg_buffer_host = new float[kBatchSize * kOutputSegSize];\r\n    } else if (cuda_post_process == \"g\") {\r\n        if (kBatchSize > 1) {\r\n            std::cerr << \"Do not yet support GPU post processing for multiple batches\" << std::endl;\r\n            exit(0);\r\n        }\r\n        // Allocate memory for decode_ptr_host and copy to device\r\n        *decode_ptr_host = new float[1 + kMaxNumOutputBbox * bbox_element];\r\n        CUDA_CHECK(cudaMalloc((void**)decode_ptr_device, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element)));\r\n    }\r\n}\r\n\r\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, float* output_seg,\r\n           int batchsize, float* decode_ptr_host, float* decode_ptr_device, int model_bboxes,\r\n           std::string cuda_post_process) {\r\n    // infer on the batch asynchronously, and DMA output back to host\r\n    auto start = std::chrono::system_clock::now();\r\n    context.enqueueV2(buffers, stream, nullptr);\r\n    if (cuda_post_process == \"c\") {\r\n\r\n        std::cout << \"kOutputSize:\" << kOutputSize << std::endl;\r\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\r\n                                   stream));\r\n        std::cout << \"kOutputSegSize:\" << kOutputSegSize << std::endl;\r\n        CUDA_CHECK(cudaMemcpyAsync(output_seg, buffers[2], batchsize * kOutputSegSize * sizeof(float),\r\n                                   cudaMemcpyDeviceToHost, stream));\r\n\r\n        auto end = std::chrono::system_clock::now();\r\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\r\n                  << \"ms\" << std::endl;\r\n    } else if (cuda_post_process == \"g\") {\r\n        CUDA_CHECK(\r\n                cudaMemsetAsync(decode_ptr_device, 0, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), stream));\r\n        cuda_decode((float*)buffers[1], model_bboxes, kConfThresh, decode_ptr_device, kMaxNumOutputBbox, stream);\r\n        cuda_nms(decode_ptr_device, kNmsThresh, kMaxNumOutputBbox, stream);  //cuda nms\r\n        CUDA_CHECK(cudaMemcpyAsync(decode_ptr_host, decode_ptr_device,\r\n                                   sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), cudaMemcpyDeviceToHost,\r\n                                   stream));\r\n        auto end = std::chrono::system_clock::now();\r\n        std::cout << \"inference and gpu postprocess time: \"\r\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\r\n    }\r\n\r\n    CUDA_CHECK(cudaStreamSynchronize(stream));\r\n}\r\n\r\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir, std::string& type,\r\n                std::string& cuda_post_process, std::string& labels_filename, float& gd, float& gw, int& max_channels) {\r\n    if (argc < 4)\r\n        return false;\r\n    if (std::string(argv[1]) == \"-s\" && argc == 5) {\r\n        wts = std::string(argv[2]);\r\n        engine = std::string(argv[3]);\r\n        std::string sub_type = std::string(argv[4]);\r\n        if (sub_type[0] == 'n') {\r\n            gd = 0.50;\r\n            gw = 0.25;\r\n            max_channels = 1024;\r\n            type = \"n\";\r\n        } else if (sub_type[0] == 's') {\r\n            gd = 0.50;\r\n            gw = 0.50;\r\n            max_channels = 1024;\r\n            type = \"s\";\r\n        } else if (sub_type[0] == 'm') {\r\n            gd = 0.50;\r\n            gw = 1.00;\r\n            max_channels = 512;\r\n            type = \"m\";\r\n        } else if (sub_type[0] == 'l') {\r\n            gd = 1.0;\r\n            gw = 1.0;\r\n            max_channels = 512;\r\n            type = \"l\";\r\n        } else if (sub_type[0] == 'x') {\r\n            gd = 1.0;\r\n            gw = 1.50;\r\n            max_channels = 512;\r\n            type = \"x\";\r\n        } else {\r\n            return false;\r\n        }\r\n    } else if (std::string(argv[1]) == \"-d\" && argc == 6) {\r\n        engine = std::string(argv[2]);\r\n        img_dir = std::string(argv[3]);\r\n        cuda_post_process = std::string(argv[4]);\r\n        labels_filename = std::string(argv[5]);\r\n    } else {\r\n        return false;\r\n    }\r\n    return true;\r\n}\r\n\r\nint main(int argc, char** argv) {\r\n    // yolo11_seg -s ../models/yolo11n-seg.wts ../models/yolo11n-seg.fp32.trt n\r\n    // yolo11_seg -d ../models/yolo11n-seg.fp32.trt ../images c coco.txt\r\n    cudaSetDevice(kGpuId);\r\n    std::string wts_name;\r\n    std::string engine_name;\r\n    std::string img_dir;\r\n    std::string type;\r\n    std::string cuda_post_process;\r\n    std::string labels_filename = \"coco.txt\";\r\n    int model_bboxes;\r\n    float gd = 0.0f, gw = 0.0f;\r\n    int max_channels = 0;\r\n\r\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir, type, cuda_post_process, labels_filename, gd, gw,\r\n                    max_channels)) {\r\n        std::cerr << \"Arguments not right!\" << std::endl;\r\n        std::cerr << \"./yolo11_seg -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to plan file\" << std::endl;\r\n        std::cerr << \"./yolo11_seg -d [.engine] ../images  [c/g] coco_file// deserialize plan file and run inference\"\r\n                  << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    // Create a model using the API directly and serialize it to a file\r\n    if (!wts_name.empty()) {\r\n        serialize_engine(wts_name, engine_name, type, gd, gw, max_channels);\r\n        return 0;\r\n    }\r\n\r\n    // Deserialize the engine from file\r\n    IRuntime* runtime = nullptr;\r\n    ICudaEngine* engine = nullptr;\r\n    IExecutionContext* context = nullptr;\r\n    deserialize_engine(engine_name, &runtime, &engine, &context);\r\n    cudaStream_t stream;\r\n    CUDA_CHECK(cudaStreamCreate(&stream));\r\n    cuda_preprocess_init(kMaxInputImageSize);\r\n    auto out_dims = engine->getBindingDimensions(1);\r\n    model_bboxes = out_dims.d[0];\r\n    // Prepare cpu and gpu buffers\r\n    float* device_buffers[3];\r\n    float* output_buffer_host = nullptr;\r\n    float* output_seg_buffer_host = nullptr;\r\n    float* decode_ptr_host = nullptr;\r\n    float* decode_ptr_device = nullptr;\r\n\r\n    // Read images from directory\r\n    std::vector<std::string> file_names;\r\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\r\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    std::unordered_map<int, std::string> labels_map;\r\n    read_labels(labels_filename, labels_map);\r\n    assert(kNumClass == labels_map.size());\r\n\r\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &device_buffers[2], &output_buffer_host,\r\n                   &output_seg_buffer_host, &decode_ptr_host, &decode_ptr_device, cuda_post_process);\r\n\r\n    // // batch predict\r\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\r\n        // Get a batch of images\r\n        std::vector<cv::Mat> img_batch;\r\n        std::vector<std::string> img_name_batch;\r\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\r\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\r\n            img_batch.push_back(img);\r\n            img_name_batch.push_back(file_names[j]);\r\n        }\r\n        // Preprocess\r\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\r\n        // Run inference\r\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, output_seg_buffer_host, kBatchSize,\r\n              decode_ptr_host, decode_ptr_device, model_bboxes, cuda_post_process);\r\n        std::vector<std::vector<Detection>> res_batch;\r\n        if (cuda_post_process == \"c\") {\r\n            // NMS\r\n            batch_nms(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\r\n            for (size_t b = 0; b < img_batch.size(); b++) {\r\n                auto& res = res_batch[b];\r\n                cv::Mat img = img_batch[b];\r\n                auto masks = process_mask(&output_seg_buffer_host[b * kOutputSegSize], kOutputSegSize, res);\r\n                draw_mask_bbox(img, res, masks, labels_map);\r\n                cv::imwrite(\"_\" + img_name_batch[b], img);\r\n            }\r\n        } else if (cuda_post_process == \"g\") {\r\n            // Process gpu decode and nms results\r\n            // batch_process(res_batch, decode_ptr_host, img_batch.size(), bbox_element, img_batch);\r\n            // todo seg in gpu\r\n            std::cerr << \"seg_postprocess is not support in gpu right now\" << std::endl;\r\n        }\r\n    }\r\n\r\n    // Release stream and buffers\r\n    cudaStreamDestroy(stream);\r\n    CUDA_CHECK(cudaFree(device_buffers[0]));\r\n    CUDA_CHECK(cudaFree(device_buffers[1]));\r\n    CUDA_CHECK(cudaFree(device_buffers[2]));\r\n    CUDA_CHECK(cudaFree(decode_ptr_device));\r\n    delete[] decode_ptr_host;\r\n    delete[] output_buffer_host;\r\n    delete[] output_seg_buffer_host;\r\n    cuda_preprocess_destroy();\r\n    // Destroy the engine\r\n    delete context;\r\n    delete engine;\r\n    delete runtime;\r\n\r\n    // Print histogram of the output distribution\r\n    // std::cout << \"\\nOutput:\\n\\n\";\r\n    // for (unsigned int i = 0; i < kOutputSize; i++)\r\n    //{\r\n    //    std::cout << prob[i] << \", \";\r\n    //    if (i % 10 == 0) std::cout << std::endl;\r\n    //}\r\n    // std::cout << std::endl;\r\n\r\n    return 0;\r\n}\r\n"
  },
  {
    "path": "yolov12-tubro/yolov12_seg_trt.py",
    "content": "\"\"\"\r\nAn example that uses TensorRT's Python api to make inferences.\r\n\"\"\"\r\nimport ctypes\r\nimport os\r\nimport shutil\r\nimport random\r\nimport sys\r\nimport threading\r\nimport time\r\nimport cv2\r\nimport numpy as np\r\nimport pycuda.autoinit  # noqa: F401\r\nimport pycuda.driver as cuda\r\nimport tensorrt as trt\r\n\r\nCONF_THRESH = 0.5\r\nIOU_THRESHOLD = 0.4\r\nPOSE_NUM = 17 * 3\r\nDET_NUM = 6\r\nSEG_NUM = 32\r\nOBB_NUM = 1\r\n\r\n\r\ndef get_img_path_batches(batch_size, img_dir):\r\n    ret = []\r\n    batch = []\r\n    for root, dirs, files in os.walk(img_dir):\r\n        for name in files:\r\n            if len(batch) == batch_size:\r\n                ret.append(batch)\r\n                batch = []\r\n            batch.append(os.path.join(root, name))\r\n    if len(batch) > 0:\r\n        ret.append(batch)\r\n    return ret\r\n\r\n\r\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\r\n    \"\"\"\r\n    description: Plots one bounding box on image img,\r\n                 this function comes from YoLo11 project.\r\n    param:\r\n        x:      a box likes [x1,y1,x2,y2]\r\n        img:    a opencv image object\r\n        color:  color to draw rectangle, such as (0,255,0)\r\n        label:  str\r\n        line_thickness: int\r\n    return:\r\n        no return\r\n\r\n    \"\"\"\r\n    tl = (\r\n            line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\r\n    )  # line/font thickness\r\n    color = color or [random.randint(0, 255) for _ in range(3)]\r\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\r\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\r\n    if label:\r\n        tf = max(tl - 1, 1)  # font thickness\r\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\r\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\r\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\r\n        cv2.putText(\r\n            img,\r\n            label,\r\n            (c1[0], c1[1] - 2),\r\n            0,\r\n            tl / 3,\r\n            [225, 255, 255],\r\n            thickness=tf,\r\n            lineType=cv2.LINE_AA,\r\n        )\r\n\r\n\r\nclass YoLo12TRT(object):\r\n    \"\"\"\r\n    description: A YOLO11 class that warps TensorRT ops, preprocess and postprocess ops.\r\n    \"\"\"\r\n\r\n    def __init__(self, engine_file_path):\r\n        # Create a Context on this device,\r\n        self.ctx = cuda.Device(0).make_context()\r\n        stream = cuda.Stream()\r\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\r\n        runtime = trt.Runtime(TRT_LOGGER)\r\n\r\n        # Deserialize the engine from file\r\n        with open(engine_file_path, \"rb\") as f:\r\n            engine = runtime.deserialize_cuda_engine(f.read())\r\n        context = engine.create_execution_context()\r\n\r\n        host_inputs = []\r\n        cuda_inputs = []\r\n        host_outputs = []\r\n        cuda_outputs = []\r\n        bindings = []\r\n\r\n        for binding in engine:\r\n            print('bingding:', binding, engine.get_binding_shape(binding))\r\n            self.batch_size = engine.get_binding_shape(binding)[0]\r\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\r\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\r\n            # Allocate host and device buffers\r\n            host_mem = cuda.pagelocked_empty(size, dtype)\r\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\r\n            # Append the device buffer to device bindings.\r\n            bindings.append(int(cuda_mem))\r\n            # Append to the appropriate list.\r\n            if engine.binding_is_input(binding):\r\n                self.input_w = engine.get_binding_shape(binding)[-1]\r\n                self.input_h = engine.get_binding_shape(binding)[-2]\r\n                host_inputs.append(host_mem)\r\n                cuda_inputs.append(cuda_mem)\r\n            else:\r\n                host_outputs.append(host_mem)\r\n                cuda_outputs.append(cuda_mem)\r\n\r\n        # Store\r\n        self.stream = stream\r\n        self.context = context\r\n        self.engine = engine\r\n        self.host_inputs = host_inputs\r\n        self.cuda_inputs = cuda_inputs\r\n        self.host_outputs = host_outputs\r\n        self.cuda_outputs = cuda_outputs\r\n        self.bindings = bindings\r\n\r\n        # Data length\r\n        self.det_output_length = host_outputs[0].shape[0]\r\n        self.seg_output_length = host_outputs[1].shape[0]\r\n        self.seg_w = int(self.input_w / 4)\r\n        self.seg_h = int(self.input_h / 4)\r\n        self.seg_c = int(self.seg_output_length / (self.seg_w * self.seg_w))\r\n        self.det_row_output_length = self.seg_c + DET_NUM + POSE_NUM + OBB_NUM\r\n\r\n        # Draw mask\r\n        self.colors_obj = Colors()\r\n\r\n    def infer(self, raw_image_generator):\r\n        threading.Thread.__init__(self)\r\n        # Make self the active context, pushing it on top of the context stack.\r\n        self.ctx.push()\r\n        # Restore\r\n        stream = self.stream\r\n        context = self.context\r\n        host_inputs = self.host_inputs\r\n        cuda_inputs = self.cuda_inputs\r\n        host_outputs = self.host_outputs\r\n        cuda_outputs = self.cuda_outputs\r\n        bindings = self.bindings\r\n        # Do image preprocess\r\n        batch_image_raw = []\r\n        batch_origin_h = []\r\n        batch_origin_w = []\r\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\r\n        for i, image_raw in enumerate(raw_image_generator):\r\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\r\n            batch_image_raw.append(image_raw)\r\n            batch_origin_h.append(origin_h)\r\n            batch_origin_w.append(origin_w)\r\n            np.copyto(batch_input_image[i], input_image)\r\n        batch_input_image = np.ascontiguousarray(batch_input_image)\r\n\r\n        # Copy input image to host buffer\r\n        np.copyto(host_inputs[0], batch_input_image.ravel())\r\n        start = time.time()\r\n        # Transfer input data  to the GPU.\r\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\r\n        # Run inference.\r\n        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\r\n        # Transfer predictions back from the GPU.\r\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\r\n        cuda.memcpy_dtoh_async(host_outputs[1], cuda_outputs[1], stream)\r\n\r\n        # Synchronize the stream\r\n        stream.synchronize()\r\n        end = time.time()\r\n        # Remove any context from the top of the context stack, deactivating it.\r\n        self.ctx.pop()\r\n        # Here we use the first row of output in that batch_size = 1\r\n        output = host_outputs[0]\r\n        output_proto_mask = host_outputs[1]\r\n        # Do postprocess\r\n        for i in range(self.batch_size):\r\n            result_boxes, result_scores, result_classid, result_proto_coef = self.post_process(\r\n                output[i * self.det_output_length: (i + 1) * self.det_output_length], batch_origin_h[i],\r\n                batch_origin_w[i]\r\n            )\r\n\r\n            if result_proto_coef.shape[0] == 0:\r\n                continue\r\n            result_masks = self.process_mask(output_proto_mask, result_proto_coef, result_boxes, batch_origin_h[i],\r\n                                             batch_origin_w[i])\r\n\r\n            self.draw_mask(result_masks, colors_=[self.colors_obj(x, True) for x in result_classid],\r\n                           im_src=batch_image_raw[i])\r\n\r\n            # Draw rectangles and labels on the original image\r\n            for j in range(len(result_boxes)):\r\n                box = result_boxes[j]\r\n                plot_one_box(\r\n                    box,\r\n                    batch_image_raw[i],\r\n                    label=\"{}:{:.2f}\".format(\r\n                        categories[int(result_classid[j])], result_scores[j]\r\n                    ),\r\n                )\r\n        return batch_image_raw, end - start\r\n\r\n    def destroy(self):\r\n        # Remove any context from the top of the context stack, deactivating it.\r\n        self.ctx.pop()\r\n\r\n    def get_raw_image(self, image_path_batch):\r\n        \"\"\"\r\n        description: Read an image from image path\r\n        \"\"\"\r\n        for img_path in image_path_batch:\r\n            yield cv2.imread(img_path)\r\n\r\n    def get_raw_image_zeros(self, image_path_batch=None):\r\n        \"\"\"\r\n        description: Ready data for warmup\r\n        \"\"\"\r\n        for _ in range(self.batch_size):\r\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\r\n\r\n    def preprocess_image(self, raw_bgr_image):\r\n        \"\"\"\r\n        description: Convert BGR image to RGB,\r\n                     resize and pad it to target size, normalize to [0,1],\r\n                     transform to NCHW format.\r\n        param:\r\n            input_image_path: str, image path\r\n        return:\r\n            image:  the processed image\r\n            image_raw: the original image\r\n            h: original height\r\n            w: original width\r\n        \"\"\"\r\n        image_raw = raw_bgr_image\r\n        h, w, c = image_raw.shape\r\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\r\n        # Calculate widht and height and paddings\r\n        r_w = self.input_w / w\r\n        r_h = self.input_h / h\r\n        if r_h > r_w:\r\n            tw = self.input_w\r\n            th = int(r_w * h)\r\n            tx1 = tx2 = 0\r\n            ty1 = int((self.input_h - th) / 2)\r\n            ty2 = self.input_h - th - ty1\r\n        else:\r\n            tw = int(r_h * w)\r\n            th = self.input_h\r\n            tx1 = int((self.input_w - tw) / 2)\r\n            tx2 = self.input_w - tw - tx1\r\n            ty1 = ty2 = 0\r\n        # Resize the image with long side while maintaining ratio\r\n        image = cv2.resize(image, (tw, th))\r\n        # Pad the short side with (128,128,128)\r\n        image = cv2.copyMakeBorder(\r\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\r\n        )\r\n        image = image.astype(np.float32)\r\n        # Normalize to [0,1]\r\n        image /= 255.0\r\n        # HWC to CHW format:\r\n        image = np.transpose(image, [2, 0, 1])\r\n        # CHW to NCHW format\r\n        image = np.expand_dims(image, axis=0)\r\n        # Convert the image to row-major order, also known as \"C order\":\r\n        image = np.ascontiguousarray(image)\r\n        return image, image_raw, h, w\r\n\r\n    def xywh2xyxy(self, origin_h, origin_w, x):\r\n        \"\"\"\r\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\r\n        param:\r\n            origin_h:   height of original image\r\n            origin_w:   width of original image\r\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\r\n        return:\r\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\r\n        \"\"\"\r\n        y = np.zeros_like(x)\r\n        r_w = self.input_w / origin_w\r\n        r_h = self.input_h / origin_h\r\n        if r_h > r_w:\r\n            y[:, 0] = x[:, 0]\r\n            y[:, 2] = x[:, 2]\r\n            y[:, 1] = x[:, 1] - (self.input_h - r_w * origin_h) / 2\r\n            y[:, 3] = x[:, 3] - (self.input_h - r_w * origin_h) / 2\r\n            y /= r_w\r\n        else:\r\n            y[:, 0] = x[:, 0] - (self.input_w - r_h * origin_w) / 2\r\n            y[:, 2] = x[:, 2] - (self.input_w - r_h * origin_w) / 2\r\n            y[:, 1] = x[:, 1]\r\n            y[:, 3] = x[:, 3]\r\n            y /= r_h\r\n\r\n        return y\r\n\r\n    def post_process(self, output, origin_h, origin_w):\r\n        \"\"\"\r\n        description: postprocess the prediction\r\n        param:\r\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...]\r\n            origin_h:   height of original image\r\n            origin_w:   width of original image\r\n        return:\r\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\r\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\r\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\r\n        \"\"\"\r\n        # Get the num of boxes detected\r\n        num = int(output[0])\r\n        print(\"There are {} detections \".format(num))\r\n        # Reshape to a two dimentional ndarray\r\n        pred = np.reshape(output[1:], (-1, self.det_row_output_length))[:num, :]\r\n\r\n        # Do nms\r\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\r\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\r\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\r\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\r\n        result_proto_coef = boxes[:, DET_NUM:int(DET_NUM + SEG_NUM)] if len(boxes) else np.array([])\r\n        return result_boxes, result_scores, result_classid, result_proto_coef\r\n\r\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\r\n        \"\"\"\r\n        description: compute the IoU of two bounding boxes\r\n        param:\r\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\r\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\r\n            x1y1x2y2: select the coordinate format\r\n        return:\r\n            iou: computed iou\r\n        \"\"\"\r\n        if not x1y1x2y2:\r\n            # Transform from center and width to exact coordinates\r\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\r\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\r\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\r\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\r\n        else:\r\n            # Get the coordinates of bounding boxes\r\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\r\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\r\n\r\n        # Get the coordinates of the intersection rectangle\r\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\r\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\r\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\r\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\r\n        # Intersection area\r\n        inter_area = (np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None)\r\n                      * np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None))\r\n        # Union Area\r\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\r\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\r\n\r\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\r\n\r\n        return iou\r\n\r\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\r\n        \"\"\"\r\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\r\n        Non-Maximum Suppression to further filter detections.\r\n        param:\r\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\r\n            origin_h: original image height\r\n            origin_w: original image width\r\n            conf_thres: a confidence threshold to filter detections\r\n            nms_thres: a iou threshold to filter detections\r\n        return:\r\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\r\n        \"\"\"\r\n        # Get the boxes that score > CONF_THRESH\r\n        boxes = prediction[prediction[:, 4] >= conf_thres]\r\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\r\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\r\n        # clip the coordinates\r\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1)\r\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1)\r\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1)\r\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1)\r\n        # Object confidence\r\n        confs = boxes[:, 4]\r\n        # Sort by the confs\r\n        boxes = boxes[np.argsort(-confs)]\r\n        # Perform non-maximum suppression\r\n        keep_boxes = []\r\n        while boxes.shape[0]:\r\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\r\n            label_match = boxes[0, 5] == boxes[:, 5]\r\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\r\n            invalid = large_overlap & label_match\r\n            keep_boxes += [boxes[0]]\r\n            boxes = boxes[~invalid]\r\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\r\n        return boxes\r\n\r\n    def sigmoid(self, x):\r\n        return 1 / (1 + np.exp(-x))\r\n\r\n    def scale_mask(self, mask, ih, iw):\r\n        mask = cv2.resize(mask, (self.input_w, self.input_h))\r\n        r_w = self.input_w / (iw * 1.0)\r\n        r_h = self.input_h / (ih * 1.0)\r\n        if r_h > r_w:\r\n            w = self.input_w\r\n            h = int(r_w * ih)\r\n            x = 0\r\n            y = int((self.input_h - h) / 2)\r\n        else:\r\n            w = int(r_h * iw)\r\n            h = self.input_h\r\n            x = int((self.input_w - w) / 2)\r\n            y = 0\r\n        crop = mask[y:y + h, x:x + w]\r\n        crop = cv2.resize(crop, (iw, ih))\r\n        return crop\r\n\r\n    def process_mask(self, output_proto_mask, result_proto_coef, result_boxes, ih, iw):\r\n        \"\"\"\r\n        description: Mask pred by yolo11 instance segmentation ,\r\n        param:\r\n            output_proto_mask: prototype mask e.g. (32, 160, 160) for 640x640 input\r\n            result_proto_coef: prototype mask coefficients (n, 32), n represents n results\r\n            result_boxes     :\r\n            ih: rows of original image\r\n            iw: cols of original image\r\n        return:\r\n            mask_result: (n, ih, iw)\r\n        \"\"\"\r\n        result_proto_masks = output_proto_mask.reshape(self.seg_c, self.seg_h, self.seg_w)\r\n        c, mh, mw = result_proto_masks.shape\r\n        print(result_proto_masks.shape)\r\n        print(result_proto_coef.shape)\r\n        masks = self.sigmoid((result_proto_coef @ result_proto_masks.astype(np.float32).reshape(c, -1))).reshape(-1, mh,\r\n                                                                                                                 mw)\r\n\r\n        mask_result = []\r\n        for mask, box in zip(masks, result_boxes):\r\n            mask_s = np.zeros((ih, iw))\r\n            crop_mask = self.scale_mask(mask, ih, iw)\r\n            x1 = int(box[0])\r\n            y1 = int(box[1])\r\n            x2 = int(box[2])\r\n            y2 = int(box[3])\r\n            crop = crop_mask[y1:y2, x1:x2]\r\n            crop = np.where(crop >= 0.5, 1, 0)\r\n            crop = crop.astype(np.uint8)\r\n            mask_s[y1:y2, x1:x2] = crop\r\n\r\n            mask_result.append(mask_s)\r\n        mask_result = np.array(mask_result)\r\n        return mask_result\r\n\r\n    def draw_mask(self, masks, colors_, im_src, alpha=0.5):\r\n        \"\"\"\r\n        description: Draw mask on image ,\r\n        param:\r\n            masks  : result_mask\r\n            colors_: color to draw mask\r\n            im_src : original image\r\n            alpha  : scale between original  image and mask\r\n        return:\r\n            no return\r\n        \"\"\"\r\n        if len(masks) == 0:\r\n            return\r\n        masks = np.asarray(masks, dtype=np.uint8)\r\n        masks = np.ascontiguousarray(masks.transpose(1, 2, 0))\r\n        masks = np.asarray(masks, dtype=np.float32)\r\n        colors_ = np.asarray(colors_, dtype=np.float32)\r\n        s = masks.sum(2, keepdims=True).clip(0, 1)\r\n        masks = (masks @ colors_).clip(0, 255)\r\n        im_src[:] = masks * alpha + im_src * (1 - s * alpha)\r\n\r\n\r\nclass inferThread(threading.Thread):\r\n    def __init__(self, yolo11_wrapper, image_path_batch):\r\n        threading.Thread.__init__(self)\r\n        self.yolo11_wrapper = yolo11_wrapper\r\n        self.image_path_batch = image_path_batch\r\n\r\n    def run(self):\r\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(self.yolo11_wrapper.get_raw_image(self.image_path_batch))\r\n        for i, img_path in enumerate(self.image_path_batch):\r\n            parent, filename = os.path.split(img_path)\r\n            save_name = os.path.join('output', filename)\r\n            # Save image\r\n            cv2.imwrite(save_name, batch_image_raw[i])\r\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\r\n\r\n\r\nclass warmUpThread(threading.Thread):\r\n    def __init__(self, yolo11_wrapper):\r\n        threading.Thread.__init__(self)\r\n        self.yolo11_wrapper = yolo11_wrapper\r\n\r\n    def run(self):\r\n        batch_image_raw, use_time = self.yolo11_wrapper.infer(self.yolo11_wrapper.get_raw_image_zeros())\r\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\r\n\r\n\r\nclass Colors:\r\n    def __init__(self):\r\n        hexs = ('FF95C8', 'FF3838', 'FF9D97', 'FF701F', 'FFB21D', 'CFD231', '48F90A',\r\n                '92CC17', '3DDB86', '1A9334', '00D4BB', '2C99A8', '00C2FF',\r\n                '344593', '6473FF', '0018EC', '8438FF', '520085', 'CB38FF',\r\n                'FF37C7')\r\n        self.palette = [self.hex2rgb(f'#{c}') for c in hexs]\r\n        self.n = len(self.palette)\r\n\r\n    def __call__(self, i, bgr=False):\r\n        c = self.palette[int(i) % self.n]\r\n        return (c[2], c[1], c[0]) if bgr else c\r\n\r\n    @staticmethod\r\n    def hex2rgb(h):  # rgb order (PIL)\r\n        return tuple(int(h[1 + i:1 + i + 2], 16) for i in (0, 2, 4))\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    # load custom plugin and engine\r\n    PLUGIN_LIBRARY = 'build/libmyplugins.so'\r\n    engine_file_path = \"build/yolov12n-seg-4.engine\"\r\n\r\n    if len(sys.argv) > 1:\r\n        engine_file_path = sys.argv[1]\r\n    if len(sys.argv) > 2:\r\n        PLUGIN_LIBRARY = sys.argv[2]\r\n\r\n    ctypes.CDLL(PLUGIN_LIBRARY)\r\n\r\n    # load coco labels\r\n    categories = [\"QT\", \"CT\", \"VT\", \"XT\"]\r\n\r\n    # categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\",\r\n    #               \"traffic light\",\r\n    #               \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\",\r\n    #               \"cow\", \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\r\n    #               \"frisbee\",\r\n    #               \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\",\r\n    #               \"surfboard\",\r\n    #               \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\r\n    #               \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\r\n    #               \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\r\n    #               \"cell phone\",\r\n    #               \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\r\n    #               \"teddy bear\",\r\n    #               \"hair drier\", \"toothbrush\"]\r\n\r\n    if os.path.exists('output/'):\r\n        shutil.rmtree('output/')\r\n    os.makedirs('output/')\r\n    # a YoLo11TRT instance\r\n    yolov12_wrapper = YoLo12TRT(engine_file_path)\r\n    try:\r\n        print('batch size is', yolov12_wrapper.batch_size)\r\n\r\n        image_dir = \"images\"\r\n        image_path_batches = get_img_path_batches(yolov12_wrapper.batch_size, image_dir)\r\n\r\n        for i in range(10):\r\n            # create a new thread to do warm_up\r\n            thread1 = warmUpThread(yolov12_wrapper)\r\n            thread1.start()\r\n            thread1.join()\r\n        for batch in image_path_batches:\r\n            # create a new thread to do inference\r\n            thread1 = inferThread(yolov12_wrapper, batch)\r\n            thread1.start()\r\n            thread1.join()\r\n    finally:\r\n        # destroy the instance\r\n        yolov12_wrapper.destroy()\r\n"
  },
  {
    "path": "yolov13/CMakeLists.txt",
    "content": "\n\n\ncmake_minimum_required(VERSION 3.10)\n\nproject(yolov13)\n\n# Set up environment-based paths for CUDA and TensorRT\nif(DEFINED ENV{CUDA_HOME})\n  set(CUDA_TOOLKIT_ROOT_DIR $ENV{CUDA_HOME})\nelse()\n  set(CUDA_TOOLKIT_ROOT_DIR \"/usr/local/cuda\")\nendif()\n\nif(DEFINED ENV{TENSORRT_DIR})\n  set(TENSORRT_ROOT $ENV{TENSORRT_DIR})\nelse()\n  set(TENSORRT_ROOT \"/opt/TensorRT-8.6.1.6\")\nendif()\n\nmessage(STATUS \"Using CUDA from: ${CUDA_TOOLKIT_ROOT_DIR}\")\nmessage(STATUS \"Using TensorRT from: ${TENSORRT_ROOT}\")\n\nadd_definitions(-std=c++11)\nadd_definitions(-DAPI_EXPORTS)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nset(CMAKE_CUDA_COMPILER ${CUDA_TOOLKIT_ROOT_DIR}/bin/nvcc)\nenable_language(CUDA)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\ninclude_directories(${PROJECT_SOURCE_DIR}/plugin)\n\n# CUDA and TensorRT configuration\nif(CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n  message(\"embed_platform on\")\n  include_directories(${CUDA_TOOLKIT_ROOT_DIR}/targets/aarch64-linux/include)\n  link_directories(${CUDA_TOOLKIT_ROOT_DIR}/targets/aarch64-linux/lib)\n  include_directories(${TENSORRT_ROOT}/include)\n  link_directories(${TENSORRT_ROOT}/lib)\nelse()\n  message(\"embed_platform off\")\n  include_directories(${CUDA_TOOLKIT_ROOT_DIR}/include)\n  link_directories(${CUDA_TOOLKIT_ROOT_DIR}/lib64)\n  include_directories(${TENSORRT_ROOT}/include)\n  link_directories(${TENSORRT_ROOT}/lib)\nendif()\n\nadd_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/plugin/yololayer.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\nfind_package(OpenCV REQUIRED)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nfile(GLOB_RECURSE SRCS ${PROJECT_SOURCE_DIR}/src/*.cpp ${PROJECT_SOURCE_DIR}/src/*.cu)\n\nadd_executable(yolov13-det ${PROJECT_SOURCE_DIR}/yolov13_det.cpp ${SRCS})\ntarget_link_libraries(yolov13-det nvinfer)\ntarget_link_libraries(yolov13-det cudart)\ntarget_link_libraries(yolov13-det myplugins)\ntarget_link_libraries(yolov13-det ${OpenCV_LIBS})\n"
  },
  {
    "path": "yolov13/gen_wts.py",
    "content": "import sys  # noqa: F401\nimport argparse\nimport os\nimport struct\nimport torch\n\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description='Convert .pt file to .wts')\n    parser.add_argument('-w', '--weights', required=True,\n                        help='Input weights (.pt) file path (required)')\n    parser.add_argument(\n        '-o', '--output', help='Output (.wts) file path (optional)')\n\n    args = parser.parse_args()\n    if not os.path.isfile(args.weights):\n        raise SystemExit('Invalid input file')\n    if not args.output:\n        args.output = os.path.splitext(args.weights)[0] + '.wts'\n    elif os.path.isdir(args.output):\n        args.output = os.path.join(\n            args.output,\n            os.path.splitext(os.path.basename(args.weights))[0] + '.wts')\n    return args.weights, args.output\n\n\npt_file, wts_file = parse_args()\n\nprint('Generating .wts for detection model')\n\n# Load model\nprint(f'Loading {pt_file}')\n\n# Initialize\ndevice = 'cpu'\n\n# Load model\nmodel = torch.load(pt_file, map_location=device, weights_only=False)['model'].float()  # load to FP32\n\n# Anchor handling for detection model\nanchor_grid = model.model[-1].anchors * model.model[-1].stride[..., None, None]\ndelattr(model.model[-1], 'anchors')\n\nmodel.to(device).eval()\n\nwith open(wts_file, 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\n\n# python3 gen_wts.py -w your_model.pt -o output_name.wts\n"
  },
  {
    "path": "yolov13/include/block.h",
    "content": "#pragma once\n\n#include <map>\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n\nusing namespace std;\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file);\n\nnvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                      std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                      std::string lname, float eps);\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, std::vector<int> k, int s, std::string lname, int p = 0, int g = 1,\n                                        int d = 1);\n\nnvinfer1::ILayer* Conv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                       nvinfer1::ITensor& input, int c_out, std::string lname, int k = 1, int s = 1, int padding = 0,\n                       int g = 1, bool act = true);\n\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int grid, int k, int s, int p, std::string lname);\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network,\n                                       std::vector<nvinfer1::IConcatenationLayer*> dets, const int* px_arry,\n                                       int px_arry_num);\n\nnvinfer1::IElementWiseLayer* C3k(nvinfer1::INetworkDefinition* network,\n                                 std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c2,\n                                 std::string lname, int n = 1, bool shortcut = true, int g = 1, float e = 0.5,\n                                 int k = 3);\n\nnvinfer1::IElementWiseLayer* C3K2(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c2,\n                                  int n, std::string lname, bool c3k = false, float e = 0.5, int g = 1,\n                                  bool shortcut = true);\n\nnvinfer1::ILayer* AAttn(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                        nvinfer1::ITensor& input, int dim, int num_heads, std::string lname, int area = 1);\n\nnvinfer1::ILayer* DWConv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int ch, std::vector<int> k, int s, std::string lname);\n\nnvinfer1::IElementWiseLayer* ABlock(nvinfer1::INetworkDefinition* network,\n                                    std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                    int dim, int num_heads, std::string lname, float mlp_ratio = 1.2, int area = 1);\n\nnvinfer1::ILayer* A2C2f(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>,\n                        nvinfer1::ITensor& input, int c2, int n, std::string lname, bool a2 = true, int area = 1,\n                        bool residual = false, float mlp_ratio = 2.0, float e = 0.5, int g = 1, bool shortcut = true);\n\nnvinfer1::IElementWiseLayer* DSConv(nvinfer1::INetworkDefinition* network,\n                                    std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                    int c_in, int c_out, std::string lname, int k = 3, int s = 1, int p = 0, int d = 1,\n                                    bool bias = false);\n\nnvinfer1::ILayer* DSBottleneck(nvinfer1::INetworkDefinition* network,\n                               std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                               int c2, std::string lname, bool shortcut = true, float e = 0.5, int k1 = 3, int k2 = 5,\n                               int d2 = 1);\n\nnvinfer1::ILayer* DSC3k(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                        nvinfer1::ITensor& input, int c2, int n, std::string lname, bool shortcut = true, int g = 1,\n                        float e = 0.5, int k1 = 3, int k2 = 5, int d2 = 1);\n\nnvinfer1::ILayer* DSC3K2(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int c2, std::string lname, int n = 1, bool dsc3k = false,\n                         float e = 0.5, int g = 1, bool shortcut = true, int k1 = 3, int k2 = 7, int d2 = 1);\n\nnvinfer1::ILayer* FuseModule(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             std::vector<nvinfer1::ITensor*>& input, int c_in, bool channel_adjust, std::string lname);\n\n// nvinfer1::ILayer* FuseModule(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n//                              std::vector<nvinfer1::ITensor*>input, int c_in, bool channel_adjust, std::string lname);\n\nnvinfer1::ISoftMaxLayer* AdaHyperedgeGen(nvinfer1::INetworkDefinition* network,\n                                         std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                         int node_dim, int num_hyperedges, std::string lname, int num_heads = 4,\n                                         std::string context = \"both\");\n\nnvinfer1::IElementWiseLayer* GELU(nvinfer1::INetworkDefinition* network, nvinfer1::ITensor& input);\n\nnvinfer1::IElementWiseLayer* AdaHGConv(nvinfer1::INetworkDefinition* network,\n                                       std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                       int embed_dim, std::string lname, int num_hyperedges = 16, int num_heads = 4,\n                                       std::string context = \"both\");\n\nnvinfer1::IShuffleLayer* AdaHGComputation(nvinfer1::INetworkDefinition* network,\n                                          std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                          int embed_dim, std::string lname, int num_hyperedges = 16, int num_heads = 8,\n                                          std::string context = \"both\");\n\nnvinfer1::ILayer* C3AH(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                       nvinfer1::ITensor& input, int c2, std::string lname, float e = 1.0, int num_hyperedges = 8,\n                       std::string context = \"both\");\n\nnvinfer1::ILayer* HyperACE(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                           std::vector<nvinfer1::ITensor*> input, int c1, int c2, std::string lname, int n = 1,\n                           int num_hyperedges = 8, bool dsc3k = false, bool shortcut = false, float e1 = 0.5,\n                           float e2 = 1, std::string context = \"both\", bool channel_adjust = true);\n\nnvinfer1::IElementWiseLayer* FullPad_Tunnel(nvinfer1::INetworkDefinition* network,\n                                            std::map<std::string, nvinfer1::Weights> weightMap,\n                                            std::vector<nvinfer1::ITensor*> input, std::string lname);\n\nnvinfer1::ILayer* DownsampleConv(nvinfer1::INetworkDefinition* network,\n                                 std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                 int in_channels, std::string lname, bool channel_adjust = true);\n\nvoid cout_dim(nvinfer1::ITensor& input);\n"
  },
  {
    "path": "yolov13/include/calibrator.h",
    "content": "#ifndef ENTROPY_CALIBRATOR_H\n#define ENTROPY_CALIBRATOR_H\n\n#include <NvInfer.h>\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\n//! \\class Int8EntropyCalibrator2\n//!\n//! \\brief Implements Entropy calibrator 2.\n//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.\n//!\nclass Int8EntropyCalibrator2 : public nvinfer1::IInt8EntropyCalibrator2 {\n   public:\n    Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name,\n                           const char* input_blob_name, bool read_cache = true);\n    virtual ~Int8EntropyCalibrator2();\n    int getBatchSize() const TRT_NOEXCEPT override;\n    bool getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT override;\n    const void* readCalibrationCache(size_t& length) TRT_NOEXCEPT override;\n    void writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT override;\n\n   private:\n    int batchsize_;\n    int input_w_;\n    int input_h_;\n    int img_idx_;\n    std::string img_dir_;\n    std::vector<std::string> img_files_;\n    size_t input_count_;\n    std::string calib_table_name_;\n    const char* input_blob_name_;\n    bool read_cache_;\n    void* device_input_;\n    std::vector<char> calib_cache_;\n};\n\n#endif  // ENTROPY_CALIBRATOR_H\n"
  },
  {
    "path": "yolov13/include/config.h",
    "content": "#define USE_FP16\n// #define USE_FP32\n// #define USE_INT8\n\nconst static char* kInputTensorName = \"images\";\nconst static char* kOutputTensorName = \"output\";\nconst static int kNumClass = 80;\nconst static int kBatchSize = 1;\nconst static int kGpuId = 0;\nconst static int kInputH = 640;\nconst static int kInputW = 640;\nconst static float kNmsThresh = 0.45f;\nconst static float kConfThresh = 0.5f;\nconst static int kMaxInputImageSize = 3000 * 3000;\nconst static int kMaxNumOutputBbox = 1000;\n//Quantization input image folder path\nconst static char* kInputQuantizationFolder = \"./tensorrtx-int8calib-data/coco_calib\";\n"
  },
  {
    "path": "yolov13/include/cuda_utils.h",
    "content": "#ifndef TRTX_CUDA_UTILS_H_\n#define TRTX_CUDA_UTILS_H_\n\n#include <cuda_runtime_api.h>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n#endif  // CUDA_CHECK\n\n#endif  // TRTX_CUDA_UTILS_H_\n"
  },
  {
    "path": "yolov13/include/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"NvInferRuntimeCommon.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "yolov13/include/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#include \"NvInfer.h\"\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "yolov13/include/model.h",
    "content": "#pragma once\n\n#include <assert.h>\n#include <string>\n#include \"NvInfer.h\"\n\nnvinfer1::IHostMemory* buildEngineYolov13Det(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                             nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                             int& max_channels, std::string& type);\n\nnvinfer1::IHostMemory* buildEngineYolov13Det_debug(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                                   nvinfer1::DataType dt, const std::string& wts_path, float& gd,\n                                                   float& gw, int& max_channels, std::string& type);\n"
  },
  {
    "path": "yolov13/include/postprocess.h",
    "content": "#pragma once\n\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\n// Preprocessing functions\ncv::Rect get_rect(cv::Mat& img, float bbox[4]);\n\n// Processing functions\nvoid batch_process(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                   int bbox_element, const std::vector<cv::Mat>& img_batch);\n\nvoid process_decode_ptr_host(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element, cv::Mat& img,\n                             int count);\n\n// NMS functions\nvoid nms(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh = 0.5);\n\nvoid batch_nms(std::vector<std::vector<Detection>>& batch_res, float* output, int batch_size, int output_size,\n               float conf_thresh, float nms_thresh = 0.5);\n\n// CUDA-related functions\nvoid cuda_decode(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                 cudaStream_t stream);\n\nvoid cuda_nms(float* parray, float nms_threshold, int max_objects, cudaStream_t stream);\n\n// Drawing functions\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n"
  },
  {
    "path": "yolov13/include/preprocess.h",
    "content": "#pragma once\n\n#include <map>\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\nvoid cuda_preprocess_init(int max_image_size);\n\nvoid cuda_preprocess_destroy();\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream);\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream);\n"
  },
  {
    "path": "yolov13/include/types.h",
    "content": "#pragma once\n#include \"config.h\"\n\nstruct alignas(float) Detection {\n    //center_x center_y w h\n    float bbox[4];\n    float conf;  // bbox_conf * cls_conf\n    float class_id;\n};\n\nstruct AffineMatrix {\n    float value[3];\n};\n\nconst int bbox_element =\n        sizeof(AffineMatrix) / sizeof(float) + 1;  // left, top, right, bottom, confidence, class, keepflag\n"
  },
  {
    "path": "yolov13/include/utils.h",
    "content": "#pragma once\n#include <dirent.h>\n#include <fstream>\n#include <opencv2/opencv.hpp>\n\nstatic inline cv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h) {\n    int w, h, x, y;\n    float r_w = input_w / (img.cols * 1.0);\n    float r_h = input_h / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = input_w;\n        h = r_w * img.rows;\n        x = 0;\n        y = (input_h - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = input_h;\n        x = (input_w - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n    cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\nstatic inline int read_files_in_dir(const char* p_dir_name, std::vector<std::string>& file_names) {\n    DIR* p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 && strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            //            std::cout << \"Found file: \" << cur_file_name << std::endl;\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\n// Function to trim leading and trailing whitespace from a string\nstatic inline std::string trim_leading_whitespace(const std::string& str) {\n    size_t first = str.find_first_not_of(' ');\n    if (std::string::npos == first) {\n        return str;\n    }\n    size_t last = str.find_last_not_of(' ');\n    return str.substr(first, (last - first + 1));\n}\n\n// Src: https://stackoverflow.com/questions/16605967\nstatic inline std::string to_string_with_precision(const float a_value, const int n = 2) {\n    std::ostringstream out;\n    out.precision(n);\n    out << std::fixed << a_value;\n    return out.str();\n}\n\nstatic inline int read_labels(const std::string labels_filename, std::unordered_map<int, std::string>& labels_map) {\n    std::ifstream file(labels_filename);\n    // Read each line of the file\n    std::string line;\n    int index = 0;\n    while (std::getline(file, line)) {\n        // Strip the line of any leading or trailing whitespace\n        line = trim_leading_whitespace(line);\n\n        // Add the stripped line to the labels_map, using the loop index as the key\n        labels_map[index] = line;\n        index++;\n    }\n    // Close the file\n    file.close();\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov13/plugin/geluKernel.cu",
    "content": "/*\n * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n * SPDX-License-Identifier: Apache-2.0\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#include <cuda.h>\n#if CUDA_VERSION >= 10010\n\n#include <cstring>\n#include <vector>\n\n#include \"NvInfer.h\"\n#include \"common/bertCommon.h\"\n#include \"common/common.cuh\"\n#include \"common/serialize.hpp\"\n#include \"geluPlugin.h\"\n\nusing namespace nvinfer1;\n\nnamespace nvinfer1 {\nnamespace plugin {\nnamespace bert {\n\n// constants for approximating the normal cdf\nconstexpr float A = 0.5f;\nconstexpr float B = 0.7978845608028654f;    // sqrt(2.0/M_PI)\nconstexpr float C = 0.035677408136300125f;  // 0.044715 * sqrt(2.0/M_PI)\n\ntemplate <typename T, unsigned TPB>\n__global__ void geluKernel(const T a, const T b, const T c, int n, const T* input, T* output) {\n    const int idx = blockIdx.x * TPB + threadIdx.x;\n\n    if (idx < n) {\n        const T in = input[idx];\n        const T cdf = a + a * tanh(in * (c * in * in + b));\n        output[idx] = in * cdf;\n    }\n}\n\nint computeGelu(cudaStream_t stream, int n, const float* input, float* output) {\n    constexpr int blockSize = 256;\n    const int gridSize = (n + blockSize - 1) / blockSize;\n    geluKernel<float, blockSize><<<gridSize, blockSize, 0, stream>>>(A, B, C, n, input, output);\n\n    PLUGIN_CHECK(cudaPeekAtLastError());\n    return 0;\n}\n\nint computeGelu(cudaStream_t stream, int n, const half* input, half* output) {\n    constexpr int blockSize = 256;\n\n    if (0 == (n & 1)) {\n        const int n2 = n / 2;\n\n        const int gridSize = (n2 + blockSize - 1) / blockSize;\n        const half2 A2 = __floats2half2_rn(A, A);\n        const half2 B2 = __floats2half2_rn(B, B);\n        const half2 C2 = __floats2half2_rn(C, C);\n        const half2* input2 = reinterpret_cast<const half2*>(input);\n        half2* output2 = reinterpret_cast<half2*>(output);\n        geluKernel<half2, blockSize><<<gridSize, blockSize, 0, stream>>>(A2, B2, C2, n2, input2, output2);\n    } else {\n        const int gridSize = (n + blockSize - 1) / blockSize;\n        geluKernel<half, blockSize><<<gridSize, blockSize, 0, stream>>>(A, B, C, n, input, output);\n    }\n\n    PLUGIN_CHECK(cudaPeekAtLastError());\n    return 0;\n}\n\ntemplate <typename T, int TPB>\n__global__ void geluBiasKernel(const T a, const T b, const T c, T* output, const T* input, const T* bias,\n                               const int ld) {\n\n    const int offset = blockIdx.x * ld;\n\n    for (int it = threadIdx.x; it < ld; it += TPB) {\n        const int idx = it + offset;\n        const T in = input[idx] + bias[it];\n        const T cdf = a + a * tanh(in * (c * in * in + b));\n        output[idx] = in * cdf;\n    }\n}\n\nint computeGeluBias(float* output, const float* input, const float* bias, const int ld, const int cols,\n                    cudaStream_t stream) {\n    geluBiasKernel<float, 256><<<cols, 256, 0, stream>>>(A, B, C, output, input, bias, ld);\n    return cudaPeekAtLastError();\n}\n\nint computeGeluBias(half* output, const half* input, const half* bias, const int ld, const int cols,\n                    cudaStream_t stream) {\n    if (ld & 1) {\n        geluBiasKernel<half, 256><<<cols, 256, 0, stream>>>(A, B, C, output, input, bias, ld);\n    } else {\n\n        const half2 A2 = __floats2half2_rn(A, A);\n        const half2 B2 = __floats2half2_rn(B, B);\n        const half2 C2 = __floats2half2_rn(C, C);\n        const int ld2 = ld / 2;\n        const half2* input2 = reinterpret_cast<const half2*>(input);\n        const half2* bias2 = reinterpret_cast<const half2*>(bias);\n        half2* output2 = reinterpret_cast<half2*>(output);\n        geluBiasKernel<half2, 256><<<cols, 256, 0, stream>>>(A2, B2, C2, output2, input2, bias2, ld2);\n    }\n\n    return cudaPeekAtLastError();\n}\n\n}  // namespace bert\n}  // namespace plugin\n}  // namespace nvinfer1\n#endif  // CUDA_VERSION >= 10010\n"
  },
  {
    "path": "yolov13/plugin/yololayer.cu",
    "content": "#include <assert.h>\n#include <math.h>\n#include <iostream>\n#include <vector>\n#include \"cuda_utils.h\"\n#include \"types.h\"\n#include \"yololayer.h\"\n\nnamespace Tn {\ntemplate <typename T>\nvoid write(char*& buffer, const T& val) {\n    *reinterpret_cast<T*>(buffer) = val;\n    buffer += sizeof(T);\n}\n\ntemplate <typename T>\nvoid read(const char*& buffer, T& val) {\n    val = *reinterpret_cast<const T*>(buffer);\n    buffer += sizeof(T);\n}\n}  // namespace Tn\n\n__device__ float sigmoid(float x) {\n    return 1.0f / (1.0f + exp(-x));\n}\n\nnamespace nvinfer1 {\nYoloLayerPlugin::YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, const int* strides,\n                                 int stridesLength) {\n    mClassCount = classCount;\n    mYoloV13NetWidth = netWidth;\n    mYoloV13netHeight = netHeight;\n    mMaxOutObject = maxOut;\n    mStridesLength = stridesLength;\n    mStrides = new int[stridesLength];\n    memcpy(mStrides, strides, stridesLength * sizeof(int));\n}\n\nYoloLayerPlugin::~YoloLayerPlugin() {\n    if (mStrides != nullptr) {\n        delete[] mStrides;\n        mStrides = nullptr;\n    }\n}\n\nYoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length) {\n    using namespace Tn;\n    const char *d = reinterpret_cast<const char*>(data), *a = d;\n    read(d, mClassCount);\n    read(d, mThreadCount);\n    read(d, mYoloV13NetWidth);\n    read(d, mYoloV13netHeight);\n    read(d, mMaxOutObject);\n    read(d, mStridesLength);\n    mStrides = new int[mStridesLength];\n    for (int i = 0; i < mStridesLength; ++i) {\n        read(d, mStrides[i]);\n    }\n\n    assert(d == a + length);\n}\n\nvoid YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT {\n    using namespace Tn;\n    char *d = static_cast<char*>(buffer), *a = d;\n    write(d, mClassCount);\n    write(d, mThreadCount);\n    write(d, mYoloV13NetWidth);\n    write(d, mYoloV13netHeight);\n    write(d, mMaxOutObject);\n    write(d, mStridesLength);\n    for (int i = 0; i < mStridesLength; ++i) {\n        write(d, mStrides[i]);\n    }\n\n    assert(d == a + getSerializationSize());\n}\n\nsize_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT {\n    return sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mYoloV13netHeight) + sizeof(mYoloV13NetWidth) +\n           sizeof(mMaxOutObject) + sizeof(mStridesLength) + sizeof(int) * mStridesLength;\n}\n\nint YoloLayerPlugin::initialize() TRT_NOEXCEPT {\n    return 0;\n}\n\nnvinfer1::Dims YoloLayerPlugin::getOutputDimensions(int index, const nvinfer1::Dims* inputs,\n                                                    int nbInputDims) TRT_NOEXCEPT {\n    int total_size = mMaxOutObject * sizeof(Detection) / sizeof(float);\n    return nvinfer1::Dims3(total_size + 1, 1, 1);\n}\n\nvoid YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT {\n    mPluginNamespace = pluginNamespace;\n}\n\nconst char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT {\n    return mPluginNamespace;\n}\n\nnvinfer1::DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes,\n                                                      int nbInputs) const TRT_NOEXCEPT {\n    return nvinfer1::DataType::kFLOAT;\n}\n\nbool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                                   int nbInputs) const TRT_NOEXCEPT {\n    return false;\n}\n\nbool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT {\n    return false;\n}\n\nvoid YoloLayerPlugin::configurePlugin(nvinfer1::PluginTensorDesc const* in, int nbInput,\n                                      nvinfer1::PluginTensorDesc const* out, int nbOutput) TRT_NOEXCEPT{};\n\nvoid YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                                      IGpuAllocator* gpuAllocator) TRT_NOEXCEPT{};\n\nvoid YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\nconst char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT {\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nvoid YoloLayerPlugin::destroy() TRT_NOEXCEPT {\n    delete this;\n}\n\nnvinfer1::IPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT {\n    YoloLayerPlugin* p = new YoloLayerPlugin(mClassCount, mYoloV13NetWidth, mYoloV13netHeight, mMaxOutObject, mStrides,\n                                             mStridesLength);\n    p->setPluginNamespace(mPluginNamespace);\n    return p;\n}\n\nint YoloLayerPlugin::enqueue(int batchSize, const void* TRT_CONST_ENQUEUE* inputs, void* const* outputs,\n                             void* workspace, cudaStream_t stream) TRT_NOEXCEPT {\n    forwardGpu((const float* const*)inputs, (float*)outputs[0], stream, mYoloV13netHeight, mYoloV13NetWidth, batchSize);\n    return 0;\n}\n\n__device__ float Logist(float data) {\n    return 1.0f / (1.0f + expf(-data));\n};\n\n__global__ void CalDetection(const float* input, float* output, int numElements, int maxoutobject, const int grid_h,\n                             int grid_w, const int stride, int classes, int outputElem) {\n    int idx = threadIdx.x + blockDim.x * blockIdx.x;\n    if (idx >= numElements)\n        return;\n\n    int total_grid = grid_h * grid_w;\n    int info_len = 4 + classes;\n    int batchIdx = idx / total_grid;\n    int elemIdx = idx % total_grid;\n    const float* curInput = input + batchIdx * total_grid * info_len;\n    int outputIdx = batchIdx * outputElem;\n\n    int class_id = 0;\n    float max_cls_prob = 0.0;\n    for (int i = 4; i < 4 + classes; i++) {\n        float p = Logist(curInput[elemIdx + i * total_grid]);\n        if (p > max_cls_prob) {\n            max_cls_prob = p;\n            class_id = i - 4;\n        }\n    }\n\n    if (max_cls_prob < 0.1)\n        return;\n\n    int count = (int)atomicAdd(output + outputIdx, 1);\n    char* data = (char*)(output + outputIdx) + sizeof(float) + count * sizeof(Detection);\n    Detection* det = (Detection*)(data);\n\n    if (count >= maxoutobject)\n        return;\n\n    int row = elemIdx / grid_w;\n    int col = elemIdx % grid_w;\n\n    det->conf = max_cls_prob;\n    det->class_id = class_id;\n    det->bbox[0] = (col + 0.5f - curInput[elemIdx + 0 * total_grid]) * stride;\n    det->bbox[1] = (row + 0.5f - curInput[elemIdx + 1 * total_grid]) * stride;\n    det->bbox[2] = (col + 0.5f + curInput[elemIdx + 2 * total_grid]) * stride;\n    det->bbox[3] = (row + 0.5f + curInput[elemIdx + 3 * total_grid]) * stride;\n}\n\nvoid YoloLayerPlugin::forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV13netHeight,\n                                 int mYoloV13NetWidth, int batchSize) {\n    int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);\n    cudaMemsetAsync(output, 0, sizeof(float), stream);\n    for (int idx = 0; idx < batchSize; ++idx) {\n        CUDA_CHECK(cudaMemsetAsync(output + idx * outputElem, 0, sizeof(float), stream));\n    }\n    int numElem = 0;\n\n    int maxGrids = mStridesLength;\n    int flatGridsLen = 2 * maxGrids;\n    int* flatGrids = new int[flatGridsLen];\n\n    for (int i = 0; i < maxGrids; ++i) {\n        flatGrids[2 * i] = mYoloV13netHeight / mStrides[i];\n        flatGrids[2 * i + 1] = mYoloV13NetWidth / mStrides[i];\n    }\n\n    for (unsigned int i = 0; i < maxGrids; i++) {\n        // Access the elements of the original 2D array from the flattened 1D array\n        int grid_h = flatGrids[2 * i];      // Corresponds to the access of grids[i][0]\n        int grid_w = flatGrids[2 * i + 1];  // Corresponds to the access of grids[i][1]\n        int stride = mStrides[i];\n        numElem = grid_h * grid_w * batchSize;  // Calculate the total number of elements\n        if (numElem < mThreadCount)             // Adjust the thread count if needed\n            mThreadCount = numElem;\n\n        CalDetection<<<(numElem + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream>>>(\n                inputs[i], output, numElem, mMaxOutObject, grid_h, grid_w, stride, mClassCount, outputElem);\n    }\n\n    delete[] flatGrids;\n}\n\nPluginFieldCollection YoloPluginCreator::mFC{};\nstd::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\nYoloPluginCreator::YoloPluginCreator() {\n    mPluginAttributes.clear();\n    mFC.nbFields = mPluginAttributes.size();\n    mFC.fields = mPluginAttributes.data();\n}\n\nconst char* YoloPluginCreator::getPluginName() const TRT_NOEXCEPT {\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloPluginCreator::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nconst PluginFieldCollection* YoloPluginCreator::getFieldNames() TRT_NOEXCEPT {\n    return &mFC;\n}\n\nIPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT {\n    assert(fc->nbFields == 1);\n    assert(strcmp(fc->fields[0].name, \"combinedInfo\") == 0);\n    const int* combinedInfo = static_cast<const int*>(fc->fields[0].data);\n\n    // Clean packed layout: class_num, input_w, input_h, max_out\n    int class_count = combinedInfo[0];\n    int input_w = combinedInfo[1];\n    int input_h = combinedInfo[2];\n    int max_output_object_count = combinedInfo[3];\n    int stride_offset = 4;\n\n    const int* px_arry = combinedInfo + stride_offset;\n    int px_arry_length = fc->fields[0].length - stride_offset;\n\n    YoloLayerPlugin* obj =\n            new YoloLayerPlugin(class_count, input_w, input_h, max_output_object_count, px_arry, px_arry_length);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\nIPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData,\n                                                     size_t serialLength) TRT_NOEXCEPT {\n    YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\n}  // namespace nvinfer1\n"
  },
  {
    "path": "yolov13/plugin/yololayer.h",
    "content": "#pragma once\n\n#include <opencv2/opencv.hpp>\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n#include \"macros.h\"\n\nnamespace nvinfer1 {\nclass API YoloLayerPlugin : public IPluginV2IOExt {\n   public:\n    YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, const int* strides, int stridesLength);\n\n    YoloLayerPlugin(const void* data, size_t length);\n\n    ~YoloLayerPlugin();\n\n    int getNbOutputs() const TRT_NOEXCEPT override { return 1; }\n\n    nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n    int initialize() TRT_NOEXCEPT override;\n\n    virtual void terminate() TRT_NOEXCEPT override {}\n\n    virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0; }\n\n    virtual int enqueue(int batchSize, const void* const* inputs, void* TRT_CONST_ENQUEUE* outputs, void* workspace,\n                        cudaStream_t stream) TRT_NOEXCEPT override;\n\n    virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n    virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n    bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs,\n                                   int nbOutputs) const TRT_NOEXCEPT override {\n        return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n    }\n\n    const char* getPluginType() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    void destroy() TRT_NOEXCEPT override;\n\n    IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n    nvinfer1::DataType getOutputDataType(int32_t index, nvinfer1::DataType const* inputTypes,\n                                         int32_t nbInputs) const TRT_NOEXCEPT;\n\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                      int nbInputs) const TRT_NOEXCEPT override;\n\n    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n    void attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                         IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n    void configurePlugin(PluginTensorDesc const* in, int32_t nbInput, PluginTensorDesc const* out,\n                         int32_t nbOutput) TRT_NOEXCEPT override;\n\n    void detachFromContext() TRT_NOEXCEPT override;\n\n   private:\n    void forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV13netHeight,\n                    int mYoloV13NetWidth, int batchSize);\n\n    int mThreadCount = 256;\n    const char* mPluginNamespace;\n    int mClassCount;\n    // Removed non-detection members\n    int mYoloV13netHeight;\n    int mYoloV13NetWidth;\n    int mMaxOutObject;\n    int* mStrides;\n    int mStridesLength;\n};\n\nclass API YoloPluginCreator : public IPluginCreator {\n   public:\n    YoloPluginCreator();\n\n    ~YoloPluginCreator() override = default;\n\n    const char* getPluginName() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    const nvinfer1::PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n    nvinfer1::IPluginV2IOExt* createPlugin(const char* name,\n                                           const nvinfer1::PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n    nvinfer1::IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData,\n                                                size_t serialLength) TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override { mNamespace = libNamespace; }\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override { return mNamespace.c_str(); }\n\n   private:\n    std::string mNamespace;\n    static PluginFieldCollection mFC;\n    static std::vector<PluginField> mPluginAttributes;\n};\n\nREGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n}  // namespace nvinfer1\n"
  },
  {
    "path": "yolov13/readme.md",
    "content": "## Introduction\n\nYolov13 model supports TensorRT-8.\n\nDetection training code [link](https://github.com/iMoonLab/yolov13/releases/tag/yolov13)\n\n\n## Environment\n\n* cuda 11.6\n* cudnn 8.9.1.23\n* tensorrt 8.6.1.6\n* opencv 4.8.0\n* ultralytics 8.3.63\n\n## Support\n\n* [x] YOLOV13-det support FP32/FP16/INT8 and C++ API\n\n\n## Config\n\n* Choose the YOLOV13 sub-model n/s/l/x from command line arguments.\n* Other configs please check [include/config.h](include/config.h)\n\n## Build and Run (Detection)\n\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\n\n```shell\n# Download ultralytics\nwget https://github.com/iMoonLab/yolov13/releases/tag/yolov13 -O ultralytics-8.3.63.zip\n# Unzip ultralytics\nunzip ultralytics-8.3.63.zip\ncd ultralytics-8.3.63\n# Training your ownself models\nto download other models, replace 'yolov13n.pt' with 'yolov13s.pt', 'yolov13l.pt', or 'yolov13x.pt'\n# Generate .wts\ncp [PATH-TO-TENSORRTX]/yolov13/gen_wts.py .\npython3 gen_wts.py -w yolov13n.pt -o yolov13n.wts\n# A file 'yolov13n.wts' will be generated.\n```\n\n2. build tensorrtx/yolov13 and run\n```shell\ncd [PATH-TO-TENSORRTX]/yolov13\nmkdir build\ncd build\ncmake ..\nmake\n```\n\n\n\n### Detection\n```shell\ncp [PATH-TO-ultralytics]/yolov13n.wts .\n# Build and serialize TensorRT engine\n./yolov13-det -s yolov13n.wts yolov13n-det.engine [n/s/l/x]\n# Run inference\n./yolov13-det -d yolov13n-det.engine ../images [c/g]\n# results saved in build directory\n```\n\n## INT8 Quantization\n1. Prepare calibration images, you can randomly select 1000s images from your train set.\n     For coco, you can also download the calibration images `coco_calib` from\n     [GoogleDrive](https://drive.google.com/drive/folders/1s7jE9DtOngZMzJC1uL307J2MiaGwdRSI?usp=sharing)\n     or [BaiduPan](https://pan.baidu.com/s/1GOm_-JobpyLMAqZWCDUhKg) pwd: a9wh\n2. unzip it in [PATH-TO-TENSORRTX]/yolov13/build\n3. set the macro `USE_INT8` in include/config.h and make again\n4. serialize the model and test\n... build successfully in my 4060 ...\n\n## More Information\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n"
  },
  {
    "path": "yolov13/src/block.cpp",
    "content": "#include \"block.h\"\n#include <assert.h>\n#include <math.h>\n#include <fstream>\n#include <iostream>\n#include \"config.h\"\n#include \"model.h\"\n#include \"yololayer.h\"\n\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, nvinfer1::Weights> WeightMap;\n\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = nvinfer1::DataType::kFLOAT;\n\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; x++) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        wt.count = size;\n        WeightMap[name] = wt;\n        // std::cout << \"===========name:              \" << name << std::endl;\n    }\n    return WeightMap;\n}\n\nnvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                      std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                      std::string lname, float eps) {\n    cout << \"BatchNorm's name :             \" << lname << endl;\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights scale{nvinfer1::DataType::kFLOAT, scval, len};\n\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights shift{nvinfer1::DataType::kFLOAT, shval, len};\n\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    nvinfer1::Weights power{nvinfer1::DataType::kFLOAT, pval, len};\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    nvinfer1::IScaleLayer* output = network->addScale(input, nvinfer1::ScaleMode::kCHANNEL, shift, scale, power);\n    assert(output);\n    return output;\n}\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, std::vector<int> k, int s, std::string lname, int p, int g, int d) {\n\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k[0], k[1]},\n                                                                  weightMap[lname + \".conv.weight\"], bias_empty);\n\n    conv->setNbGroups(g);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    // auto pad\n    int p0 = k[0] / 2;\n    int p1 = k[1] / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p0, p1});\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n\nstatic nvinfer1::ILayer* bottleneck(nvinfer1::INetworkDefinition* network,\n                                    std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                    int c1, int c2, bool shortcut, std::vector<int> k1, std::vector<int> k2, float e,\n                                    int g, std::string lname) {\n    int c_ = (int)((float)c2 * e);\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, c_, k1, 1, lname + \".cv1\");\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *conv1->getOutput(0), c2, k2, 1, lname + \".cv2\", 0, g);\n\n    if (shortcut && c1 == c2) {\n        nvinfer1::IElementWiseLayer* ew =\n                network->addElementWise(input, *conv2->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n        return ew;\n    }\n    return conv2;\n}\n\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int grid, int k, int s, int p, std::string lname) {\n\n    nvinfer1::IShuffleLayer* shuffle1 = network->addShuffle(input);\n    shuffle1->setReshapeDimensions(nvinfer1::Dims4{kBatchSize, 4, 16, grid});\n    shuffle1->setSecondTranspose(nvinfer1::Permutation{0, 2, 1, 3});\n    nvinfer1::ISoftMaxLayer* softmax = network->addSoftMax(*shuffle1->getOutput(0));\n    softmax->setAxes(1 << 1);\n\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(*softmax->getOutput(0), 1, nvinfer1::DimsHW{1, 1}, weightMap[lname], bias_empty);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n\n    nvinfer1::IShuffleLayer* shuffle2 = network->addShuffle(*conv->getOutput(0));\n    shuffle2->setReshapeDimensions(nvinfer1::Dims3{kBatchSize, 4, grid});\n\n    return shuffle2;\n}\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network,\n                                       std::vector<nvinfer1::IConcatenationLayer*> dets, const int* px_arry,\n                                       int px_arry_num) {\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    // Packing: class_num, input_w, input_h, max_out\n    const int netinfo_count = 4;\n    const int total_count = netinfo_count + px_arry_num;\n\n    std::vector<int> combinedInfo(total_count);\n\n    // Fill in the first 4 elements\n    combinedInfo[0] = kNumClass;\n    combinedInfo[1] = kInputW;\n    combinedInfo[2] = kInputH;\n    combinedInfo[3] = kMaxNumOutputBbox;\n\n    // Copy the contents of px_arry into the combinedInfo vector\n    std::copy(px_arry, px_arry + px_arry_num, combinedInfo.begin() + netinfo_count);\n\n    // Now let's create the PluginField object to hold this combined information.\n    nvinfer1::PluginField pluginField;\n    pluginField.name = \"combinedInfo\";\n    pluginField.data = combinedInfo.data();\n    pluginField.type = nvinfer1::PluginFieldType::kINT32;\n    pluginField.length = combinedInfo.size();\n\n    // Create the PluginFieldCollection\n    nvinfer1::PluginFieldCollection pluginFieldCollection;\n    pluginFieldCollection.nbFields = 1;\n    pluginFieldCollection.fields = &pluginField;\n\n    // Create the plugin object\n    nvinfer1::IPluginV2* pluginObject = creator->createPlugin(\"yololayer\", &pluginFieldCollection);\n\n    // Prepare input tensors for the YOLO Layer.\n    std::vector<nvinfer1::ITensor*> inputTensors;\n    for (auto det : dets) {\n        inputTensors.push_back(det->getOutput(0));\n    }\n\n    // Add the plugin to the network\n    nvinfer1::IPluginV2Layer* yoloLayer = network->addPluginV2(inputTensors.data(), inputTensors.size(), *pluginObject);\n\n    return yoloLayer;\n}\n\nnvinfer1::ILayer* Conv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                       nvinfer1::ITensor& input, int c_out, std::string lname, int k, int s, int padding, int g,\n                       bool act) {\n    nvinfer1::Weights emptywts{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    cout << \"Conv name: \" << lname << endl;\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(input, c_out, nvinfer1::DimsHW{k, k},\n                                                                  weightMap[lname + \".conv.weight\"], emptywts);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    // auto pad\n    int p0 = k / 2;\n    int p1 = k / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p0, p1});\n    conv->setNbGroups(g);\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n    if (act) {\n        nvinfer1::IActivationLayer* sigmoid =\n                network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n        nvinfer1::IElementWiseLayer* ew = network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0),\n                                                                  nvinfer1::ElementWiseOperation::kPROD);\n        assert(ew);\n        return ew;\n    } else\n        return bn;\n}\n\nnvinfer1::ILayer* DWConv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int ch, std::vector<int> k, int s, std::string lname) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k[0], k[1]},\n                                                                  weightMap[lname + \".conv.weight\"], bias_empty);\n\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setNbGroups(ch);\n    // auto pad\n    int p0 = k[0] / 2;\n    int p1 = k[1] / 2;\n    conv->setPaddingNd(nvinfer1::DimsHW{p0, p1});\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n\nnvinfer1::IElementWiseLayer* C3k(nvinfer1::INetworkDefinition* network,\n                                 std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c2,\n                                 std::string lname, int n, bool shortcut, int g, float e, int k) {\n    int c_ = c2 * float(e);\n\n    nvinfer1::IElementWiseLayer* cv1 = convBnSiLU(network, weightMap, input, c_, {1, 1}, 1, lname + \".cv1\");\n    nvinfer1::IElementWiseLayer* cv2 = convBnSiLU(network, weightMap, input, c_, {1, 1}, 1, lname + \".cv2\");\n    nvinfer1::ITensor* y = cv1->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        nvinfer1::ILayer* b = bottleneck(network, weightMap, *y, c_, c_, shortcut, {k, k}, {k, k}, 1.0, g,\n                                         lname + \".m.\" + std::to_string(i));\n        y = b->getOutput(0);\n    }\n    nvinfer1::ITensor* inputTensor[] = {y, cv2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensor, 2);\n    nvinfer1::IElementWiseLayer* cv3 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv3\");\n\n    return cv3;\n}\n\nnvinfer1::IElementWiseLayer* C3K2(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c2,\n                                  int n, std::string lname, bool c3k, float e, int g, bool shortcut) {\n    int c = int(c2 * float(e));\n    nvinfer1::ILayer* cv1 = Conv(network, weightMap, input, 2 * c, lname + \".cv1\", 1, 1);\n    nvinfer1::ISliceLayer* sl0 = network->addSlice(\n            *cv1->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n            nvinfer1::Dims4{cv1->getOutput(0)->getDimensions().d[0], cv1->getOutput(0)->getDimensions().d[1] / 2,\n                            cv1->getOutput(0)->getDimensions().d[2], cv1->getOutput(0)->getDimensions().d[3]},\n            nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* sl1 = network->addSlice(\n            *cv1->getOutput(0), nvinfer1::Dims4{0, cv1->getOutput(0)->getDimensions().d[1] / 2, 0, 0},\n            nvinfer1::Dims4{cv1->getOutput(0)->getDimensions().d[0], cv1->getOutput(0)->getDimensions().d[1] / 2,\n                            cv1->getOutput(0)->getDimensions().d[2], cv1->getOutput(0)->getDimensions().d[3]},\n            nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ITensor* inputTensor0[] = {sl0->getOutput(0), sl1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensor0, 2);\n    nvinfer1::ITensor* current = sl1->getOutput(0);\n\n    for (int i = 0; i < n; i++) {\n        nvinfer1::ILayer* b;\n        if (c3k) {\n            b = C3k(network, weightMap, *current, c, lname + \".m.\" + std::to_string(i), 2, shortcut, g);\n        } else {\n            b = bottleneck(network, weightMap, *current, c, c, shortcut, {3, 3}, {3, 3}, 0.5, g,\n                           lname + \".m.\" + std::to_string(i));\n        }\n        current = b->getOutput(0);\n        nvinfer1::ITensor* inputTensors[] = {cat->getOutput(0), b->getOutput(0)};\n        cat = network->addConcatenation(inputTensors, 2);\n    }\n    nvinfer1::IElementWiseLayer* cv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, {1, 1}, 1, lname + \".cv2\");\n    return cv2;\n}\n\nvoid cout_dim(nvinfer1::ITensor& input) {\n\n    nvinfer1::Dims d = input.getDimensions();\n\n    std::cout << \"======================= Dimensions =================================\" << std::endl;\n    std::cout << \"          \" << d.d[0] << std::endl;\n    std::cout << \"          \" << d.d[1] << std::endl;\n    std::cout << \"          \" << d.d[2] << std::endl;\n    std::cout << \"          \" << d.d[3] << std::endl;\n    std::cout << \"======================================================================\" << std::endl;\n}\n\nnvinfer1::ILayer* AAttn(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                        nvinfer1::ITensor& input, int dim, int num_heads, std::string lname, int area) {\n\n    nvinfer1::Dims d_input = input.getDimensions();\n    int B = d_input.d[0];\n    int C = d_input.d[1];\n    int H = d_input.d[2];\n    int W = d_input.d[3];\n    int N = W * H;\n    int head_dim = dim / num_heads;\n    int all_head_dim = head_dim * num_heads;\n\n    nvinfer1::ILayer* qk = Conv(network, weightMap, input, all_head_dim * 2, lname + \".qk\", 1, 1, 0, 1, false);\n    nvinfer1::IShuffleLayer* qk_flatten_t = network->addShuffle(*qk->getOutput(0));\n    qk_flatten_t->setReshapeDimensions(nvinfer1::Dims3{B, -1, N});\n    qk_flatten_t->setSecondTranspose(nvinfer1::Permutation{0, 2, 1});\n\n    nvinfer1::ILayer* v = Conv(network, weightMap, input, all_head_dim, lname + \".v\", 1, 1, 0, 1, false);\n    nvinfer1::IShuffleLayer* v_flatten_t = network->addShuffle(*v->getOutput(0));\n    v_flatten_t->setReshapeDimensions(nvinfer1::Dims3{B, -1, N});\n    v_flatten_t->setSecondTranspose(nvinfer1::Permutation{0, 2, 1});  // (1, 6400, 64)\n\n    nvinfer1::ILayer* pe = Conv(network, weightMap, *v->getOutput(0), dim, lname + \".pe\", 5, 1, 2, dim, false);\n\n    nvinfer1::ITensor* q_k = qk_flatten_t->getOutput(0);\n    nvinfer1::ITensor* v_ = v_flatten_t->getOutput(0);\n    if (area > 1) {\n        B = B * area;\n        N = N / area;\n\n        nvinfer1::IShuffleLayer* qk_reshape = network->addShuffle(*qk_flatten_t->getOutput(0));\n        qk_reshape->setReshapeDimensions(nvinfer1::Dims3{B, N, C * 2});\n        nvinfer1::IShuffleLayer* v_reshape = network->addShuffle(*v_flatten_t->getOutput(0));\n        v_reshape->setReshapeDimensions(nvinfer1::Dims3{B, N, C});\n\n        q_k = qk_reshape->getOutput(0);\n        v_ = v_reshape->getOutput(0);\n    }\n    nvinfer1::Dims q_k_dim = q_k->getDimensions();\n    nvinfer1::ISliceLayer* q =\n            network->addSlice(*q_k, nvinfer1::Dims3{0, 0, 0},\n                              nvinfer1::Dims3{q_k_dim.d[0], q_k_dim.d[1], q_k_dim.d[2] / 2}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* k =\n            network->addSlice(*q_k, nvinfer1::Dims3{0, 0, q_k_dim.d[2] / 2},\n                              nvinfer1::Dims3{q_k_dim.d[0], q_k_dim.d[1], q_k_dim.d[2] / 2}, nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* q_reshape = network->addShuffle(*q->getOutput(0));\n    q_reshape->setReshapeDimensions(nvinfer1::Dims4{B, N, num_heads, head_dim});\n    nvinfer1::IShuffleLayer* k_reshape = network->addShuffle(*k->getOutput(0));\n    k_reshape->setReshapeDimensions(nvinfer1::Dims4{B, N, num_heads, head_dim});\n    nvinfer1::IShuffleLayer* v_reshape = network->addShuffle(*v_);\n    v_reshape->setReshapeDimensions(nvinfer1::Dims4{B, N, num_heads, head_dim});\n\n    // (B, N, num_head, head_dim)--->(B, num_head, head_dim, N)\n    nvinfer1::IShuffleLayer* q_t_view = network->addShuffle(*q_reshape->getOutput(0));\n    q_t_view->setFirstTranspose(nvinfer1::Permutation{0, 2, 3, 1});\n\n    nvinfer1::IShuffleLayer* k_t_view = network->addShuffle(*k_reshape->getOutput(0));\n    k_t_view->setFirstTranspose(nvinfer1::Permutation{0, 2, 3, 1});\n    nvinfer1::IShuffleLayer* v_t_view = network->addShuffle(*v_reshape->getOutput(0));\n    v_t_view->setFirstTranspose(nvinfer1::Permutation{0, 2, 3, 1});\n\n    nvinfer1::IShuffleLayer* q_T = network->addShuffle(*q_t_view->getOutput(0));\n    q_T->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});  // (B, num_head, N, head_dim, N)\n    nvinfer1::IMatrixMultiplyLayer* q_mul_k =\n            network->addMatrixMultiply(*q_T->getOutput(0), nvinfer1::MatrixOperation::kNONE, *k_t_view->getOutput(0),\n                                       nvinfer1::MatrixOperation::kNONE);\n\n    float scale = 1.0 / sqrt(head_dim);\n    float* scale_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    scale_val[0] = scale;\n    nvinfer1::Weights s_w{nvinfer1::DataType::kFLOAT, scale_val, 1};  // scale\n    float* shift_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    shift_val[0] = 0;\n    nvinfer1::Weights sh_w{nvinfer1::DataType::kFLOAT, shift_val, 1};  // shift\n    float* power_val = reinterpret_cast<float*>(malloc(sizeof(float) * 1));\n    power_val[0] = 1;\n    nvinfer1::Weights p_w{nvinfer1::DataType::kFLOAT, power_val, 1};  // power\n    nvinfer1::IScaleLayer* q_mul_k_scale =\n            network->addScale(*q_mul_k->getOutput(0), nvinfer1::ScaleMode::kUNIFORM, sh_w, s_w, p_w);\n\n    nvinfer1::IReduceLayer* attn_max =\n            network->addReduce(*q_mul_k_scale->getOutput(0), nvinfer1::ReduceOperation::kMAX, 1 << 3, true);\n\n    nvinfer1::IElementWiseLayer* attn_sub = network->addElementWise(\n            *q_mul_k_scale->getOutput(0), *attn_max->getOutput(0), nvinfer1::ElementWiseOperation::kSUB);\n    nvinfer1::IUnaryLayer* attn_exp = network->addUnary(*attn_sub->getOutput(0), nvinfer1::UnaryOperation::kEXP);\n    nvinfer1::IReduceLayer* attn_sum =\n            network->addReduce(*attn_exp->getOutput(0), nvinfer1::ReduceOperation::kSUM, 1 << 3, true);\n\n    nvinfer1::IElementWiseLayer* attn_div = network->addElementWise(*attn_exp->getOutput(0), *attn_sum->getOutput(0),\n                                                                    nvinfer1::ElementWiseOperation::kDIV);\n    cout_dim(*attn_div->getOutput(0));\n\n    nvinfer1::IShuffleLayer* attn_t = network->addShuffle(*attn_div->getOutput(0));\n    attn_t->setFirstTranspose(nvinfer1::Permutation{0, 1, 3, 2});\n\n    nvinfer1::IMatrixMultiplyLayer* attn_v =\n            network->addMatrixMultiply(*v_t_view->getOutput(0), nvinfer1::MatrixOperation::kNONE, *attn_t->getOutput(0),\n                                       nvinfer1::MatrixOperation::kNONE);\n\n    nvinfer1::IShuffleLayer* attn_v_t = network->addShuffle(*attn_v->getOutput(0));\n    attn_v_t->setFirstTranspose(nvinfer1::Permutation{0, 3, 1, 2});\n    nvinfer1::ITensor* attn_temp = attn_v_t->getOutput(0);\n    if (area > 1) {\n        B = B / area;\n        N = N * area;\n\n        nvinfer1::IShuffleLayer* attn_v_t_r = network->addShuffle(*attn_v_t->getOutput(0));\n        attn_v_t_r->setReshapeDimensions(nvinfer1::Dims3{B, N, C});\n        attn_temp = attn_v_t_r->getOutput(0);\n    }\n    nvinfer1::IShuffleLayer* attn_x = network->addShuffle(*attn_temp);\n    attn_x->setReshapeDimensions(nvinfer1::Dims4{B, H, W, C});\n    attn_x->setSecondTranspose(nvinfer1::Permutation{0, 3, 1, 2});\n    nvinfer1::IElementWiseLayer* x_add_pp =\n            network->addElementWise(*attn_x->getOutput(0), *pe->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    nvinfer1::ILayer* proj = Conv(network, weightMap, *x_add_pp->getOutput(0), dim, lname + \".proj\", 1, 1, 0, 1, false);\n\n    return proj;\n}\n\nnvinfer1::IElementWiseLayer* ABlock(nvinfer1::INetworkDefinition* network,\n                                    std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                    int dim, int num_heads, std::string lname, float mlp_ratio, int area) {\n\n    nvinfer1::ILayer* attn = AAttn(network, weightMap, input, dim, num_heads, lname + \".attn\", area);\n    nvinfer1::IElementWiseLayer* add1 =  // x = x + self.attn(x)\n            network->addElementWise(input, *attn->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    int mlp_hidden_dim = int(dim * mlp_ratio);\n\n    nvinfer1::ILayer* mlp_0 =\n            Conv(network, weightMap, *add1->getOutput(0), mlp_hidden_dim, lname + \".mlp.0\", 1, 1, 0, 1, true);\n    nvinfer1::ILayer* mlp_1 = Conv(network, weightMap, *mlp_0->getOutput(0), dim, lname + \".mlp.1\", 1, 1, 0, 1, false);\n\n    nvinfer1::IElementWiseLayer* result =\n            network->addElementWise(*add1->getOutput(0), *mlp_1->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    return result;\n}\n\nnvinfer1::ILayer* A2C2f(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                        nvinfer1::ITensor& input, int c2, int n, std::string lname, bool a2, int area, bool residual,\n                        float mlp_ratio, float e, int g, bool shortcut) {\n\n    int c_ = static_cast<int>(c2 * e);\n    assert(c_ % 32 == 0 && \"Dimension of ABlock must be a multiple of 32\");\n    int num_heads = c_ / 32;\n\n    nvinfer1::ILayer* cv1 = Conv(network, weightMap, input, c_, lname + \".cv1\", 1, 1);\n    std::vector<nvinfer1::ITensor*> y{cv1->getOutput(0)};\n    nvinfer1::ITensor* current = cv1->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        if (a2) {\n            nvinfer1::ILayer* m_0 = ABlock(network, weightMap, *current, c_, num_heads,\n                                           lname + \".m.\" + std::to_string(i) + \".0\", mlp_ratio, area);\n            nvinfer1::ILayer* m_1 = ABlock(network, weightMap, *m_0->getOutput(0), c_, num_heads,\n                                           lname + \".m.\" + std::to_string(i) + \".1\", mlp_ratio, area);\n            current = m_1->getOutput(0);\n        } else {\n            // C3k\n            nvinfer1::ILayer* m =\n                    C3k(network, weightMap, *current, c_, lname + \".m.\" + std::to_string(i), 2, shortcut, g);\n            current = m->getOutput(0);\n        }\n        y.push_back(current);\n    }\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(y.data(), static_cast<int>(y.size()));\n    cat->setAxis(1);\n    nvinfer1::ILayer* cv2 = Conv(network, weightMap, *cat->getOutput(0), c2, lname + \".cv2\", 1, 1);\n\n    if (a2 && residual) {\n        std::cout << lname << \" applying residual connection with gamma\" << std::endl;\n\n        nvinfer1::Weights gamma = weightMap[lname + \".gamma\"];\n\n        nvinfer1::IConstantLayer* gamma_layer = network->addConstant(nvinfer1::Dims4{1, c2, 1, 1}, gamma);\n        nvinfer1::IElementWiseLayer* scaled_output = network->addElementWise(\n                *gamma_layer->getOutput(0), *cv2->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n        nvinfer1::IElementWiseLayer* result =\n                network->addElementWise(input, *scaled_output->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n\n        return result;\n    } else {\n\n        return cv2;\n    }\n}\n\nnvinfer1::IElementWiseLayer* DSConv(nvinfer1::INetworkDefinition* network,\n                                    std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                    int c_in, int c_out, std::string lname, int k, int s, int p, int d, bool bias) {\n    if (p == 0) {\n        p = (d * (k - 1)) / 2;\n    }\n    nvinfer1::Weights emptywts{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* dw =\n            network->addConvolutionNd(input, c_in, nvinfer1::DimsHW{k, k}, weightMap[lname + \".dw.weight\"], emptywts);\n    dw->setStrideNd(nvinfer1::DimsHW{s, s});\n    dw->setPaddingNd(nvinfer1::DimsHW{p, p});\n    dw->setNbGroups(c_in);\n    dw->setDilationNd(nvinfer1::DimsHW{d, d});\n\n    nvinfer1::IConvolutionLayer* pw = network->addConvolutionNd(*dw->getOutput(0), c_out, nvinfer1::DimsHW{1, 1},\n                                                                weightMap[lname + \".pw.weight\"], emptywts);\n    pw->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pw->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    pw->setNbGroups(1);\n    pw->setDilationNd(nvinfer1::DimsHW{1, 1});\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *pw->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n\nnvinfer1::ILayer* DSBottleneck(nvinfer1::INetworkDefinition* network,\n                               std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                               int c2, std::string lname, bool shortcut, float e, int k1, int k2, int d2) {\n    int c_ = float(e) * c2;\n    nvinfer1::IElementWiseLayer* cv1 = DSConv(network, weightMap, input, c1, c_, lname + \".cv1\", k1, 1, 0, 1, false);\n    nvinfer1::IElementWiseLayer* y =\n            DSConv(network, weightMap, *cv1->getOutput(0), c_, c2, lname + \".cv2\", k2, 1, 0, d2, false);\n    if (c1 == c2 && shortcut) {\n        nvinfer1::IElementWiseLayer* add =\n                network->addElementWise(input, *y->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n        return add;\n    } else\n        return y;\n}\n\nnvinfer1::ILayer* DSC3k(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                        nvinfer1::ITensor& input, int c2, int n, std::string lname, bool shortcut, int g, float e,\n                        int k1, int k2, int d2) {\n    int c_ = float(e) * c2;\n    nvinfer1::ILayer* cv1 = Conv(network, weightMap, input, c_, lname + \".cv1\", 1, 1);\n    nvinfer1::ILayer* cv2 = Conv(network, weightMap, input, c_, lname + \".cv2\", 1, 1);\n    nvinfer1::ITensor* current = cv1->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        nvinfer1::ILayer* m_ = DSBottleneck(network, weightMap, *current, c_, c_, lname + \".m.\" + std::to_string(i),\n                                            shortcut, 1.0, k1, k2, d2);\n        current = m_->getOutput(0);\n    }\n    nvinfer1::ITensor* inputTensors[] = {current, cv2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensors, 2);\n    nvinfer1::ILayer* cv3 = Conv(network, weightMap, *cat->getOutput(0), c2, lname + \".cv3\", 1, 1);\n\n    return cv3;\n}\n\nnvinfer1::ILayer* DSC3K2(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                         nvinfer1::ITensor& input, int c2, std::string lname, int n, bool dsc3k, float e, int g,\n                         bool shortcut, int k1, int k2, int d2) {\n    int c = float(e) * c2;\n    nvinfer1::ILayer* cv1 = Conv(network, weightMap, input, 2 * c, lname + \".cv1\");\n    nvinfer1::Dims dim_cv1 = cv1->getOutput(0)->getDimensions();\n    nvinfer1::ISliceLayer* sl0 = network->addSlice(\n            *cv1->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n            nvinfer1::Dims4{dim_cv1.d[0], dim_cv1.d[1] / 2, dim_cv1.d[2], dim_cv1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* sl1 = network->addSlice(\n            *cv1->getOutput(0), nvinfer1::Dims4{0, dim_cv1.d[1] / 2, 0, 0},\n            nvinfer1::Dims4{dim_cv1.d[0], dim_cv1.d[1] / 2, dim_cv1.d[2], dim_cv1.d[3]}, nvinfer1::Dims4{1, 1, 1, 1});\n    std::vector<nvinfer1::ITensor*> y = {sl0->getOutput(0), sl1->getOutput(0)};\n    nvinfer1::ITensor* current = sl1->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        if (dsc3k) {\n            nvinfer1::ILayer* m_ = DSC3k(network, weightMap, *current, c, 2, lname + \".m.\" + std::to_string(i),\n                                         shortcut, g, 1.0, k1, k2, d2);\n            current = m_->getOutput(0);\n            y.push_back(current);\n        } else {\n            nvinfer1::ILayer* m_ = DSBottleneck(network, weightMap, *current, c, c, lname + \".m.\" + std::to_string(i),\n                                                shortcut, 1.0, k1, k2, d2);\n            current = m_->getOutput(0);\n            y.push_back(current);\n        }\n    }\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(y.data(), y.size());\n    nvinfer1::ILayer* cv2 = Conv(network, weightMap, *cat->getOutput(0), c2, lname + \".cv2\");\n\n    return cv2;\n}\n\nnvinfer1::ILayer* FuseModule(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             std::vector<nvinfer1::ITensor*>& input, int c_in, bool channel_adjust, std::string lname) {\n    nvinfer1::IPoolingLayer* x1_ds =\n            network->addPoolingNd(*input[0], nvinfer1::PoolingType::kAVERAGE, nvinfer1::DimsHW{2, 2});\n    x1_ds->setStrideNd(nvinfer1::DimsHW{2, 2});\n    x1_ds->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    nvinfer1::IResizeLayer* x3_up = network->addResize(*input[2]);\n    float scale[] = {1, 1, 2, 2};\n    x3_up->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    x3_up->setScales(scale, 4);\n\n    nvinfer1::ITensor* inputTensor[] = {x1_ds->getOutput(0), input[1], x3_up->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensor, 3);\n    cat->setAxis(1);\n    nvinfer1::ILayer* conv_out = Conv(network, weightMap, *cat->getOutput(0), c_in, lname + \".conv_out\");\n    return conv_out;\n}\n\nnvinfer1::ISoftMaxLayer* AdaHyperedgeGen(nvinfer1::INetworkDefinition* network,\n                                         std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                         int node_dim, int num_hyperedges, std::string lname, int num_heads,\n                                         std::string context) {\n\n    nvinfer1::Dims dim_input = input.getDimensions();\n    int B = dim_input.d[0];\n    int N = dim_input.d[1];\n    int D = dim_input.d[2];\n    int head_dim = node_dim / num_heads;\n    nvinfer1::ITensor* context_cat = nullptr;\n    if (context == \"mean\") {\n        nvinfer1::IReduceLayer* context_mean =\n                network->addReduce(input, nvinfer1::ReduceOperation::kAVG, 1 << 1, false);\n        context_cat = context_mean->getOutput(0);\n    } else if (context == \"max\") {\n        nvinfer1::IReduceLayer* context_max = network->addReduce(input, nvinfer1::ReduceOperation::kMAX, 1 << 1, false);\n        context_cat = context_max->getOutput(0);\n    } else {\n        nvinfer1::IReduceLayer* context_mean =\n                network->addReduce(input, nvinfer1::ReduceOperation::kAVG, 1 << 1, false);\n        nvinfer1::IReduceLayer* context_max = network->addReduce(input, nvinfer1::ReduceOperation::kMAX, 1 << 1, false);\n        nvinfer1::ITensor* inputTensor[] = {context_mean->getOutput(0), context_max->getOutput(0)};\n        nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensor, 2);\n        cat->setAxis(1 << 0);\n        context_cat = cat->getOutput(0);\n    }\n\n    nvinfer1::IShuffleLayer* context_cat_dim4 = network->addShuffle(*context_cat);\n    context_cat_dim4->setReshapeDimensions(\n            nvinfer1::Dims4{context_cat->getDimensions().d[0], context_cat->getDimensions().d[1], 1, 1});\n    nvinfer1::IFullyConnectedLayer* prototypes_offsets_ = network->addFullyConnected(\n            *context_cat_dim4->getOutput(0), num_hyperedges * node_dim, weightMap[lname + \".context_net.weight\"],\n            weightMap[lname + \".context_net.bias\"]);\n    nvinfer1::IShuffleLayer* prototypes_offsets = network->addShuffle(*prototypes_offsets_->getOutput(0));\n    prototypes_offsets->setReshapeDimensions(nvinfer1::Dims3{B, num_hyperedges, D});\n    // prototype_offsets = self.context_net(context_cat).view(B, self.num_hyperedges, D)\n\n    nvinfer1::Weights prototype_base_wts = weightMap[lname + \".prototype_base\"];\n    nvinfer1::IConstantLayer* prototype_base =\n            network->addConstant(nvinfer1::Dims3{1, num_hyperedges, node_dim}, prototype_base_wts);\n    nvinfer1::IElementWiseLayer* prototypes = network->addElementWise(\n            *prototype_base->getOutput(0), *prototypes_offsets->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    // prototypes = self.prototype_base.unsqueeze(0) + prototype_offsets\n\n    nvinfer1::IShuffleLayer* input_dim4 = network->addShuffle(input);\n    input_dim4->setReshapeDimensions(nvinfer1::Dims4{B * N, D, 1, 1});\n    nvinfer1::IFullyConnectedLayer* X_proj =\n            network->addFullyConnected(*input_dim4->getOutput(0), node_dim, weightMap[lname + \".pre_head_proj.weight\"],\n                                       weightMap[lname + \".pre_head_proj.bias\"]);\n    // X_proj = self.pre_head_proj(X)\n\n    nvinfer1::IShuffleLayer* X_heads = network->addShuffle(*X_proj->getOutput(0));\n    X_heads->setReshapeDimensions(nvinfer1::Dims4{B, N, num_heads, head_dim});\n    X_heads->setSecondTranspose(nvinfer1::Permutation{0, 2, 1, 3});\n    // X_heads = X_proj.view(B, N, self.num_heads, self.head_dim).transpose(1, 2)\n\n    nvinfer1::IShuffleLayer* proto_heads = network->addShuffle(*prototypes->getOutput(0));\n    proto_heads->setReshapeDimensions(nvinfer1::Dims4{B, num_hyperedges, num_heads, head_dim});\n    proto_heads->setSecondTranspose(nvinfer1::Permutation{0, 2, 1, 3});\n    // proto_heads = prototypes.view(B, self.num_hyperedges, self.num_heads, self.head_dim).permute(0, 2, 1, 3)\n\n    nvinfer1::IShuffleLayer* X_heads_flat = network->addShuffle(*X_heads->getOutput(0));\n    X_heads_flat->setReshapeDimensions(nvinfer1::Dims3{B * num_heads, N, head_dim});\n    // X_heads_flat = X_heads.reshape(B * self.num_heads, N, self.head_dim)\n\n    nvinfer1::IShuffleLayer* proto_heads_flat = network->addShuffle(*proto_heads->getOutput(0));\n    proto_heads_flat->setReshapeDimensions(nvinfer1::Dims3{B * num_heads, num_hyperedges, head_dim});\n    proto_heads_flat->setSecondTranspose(nvinfer1::Permutation{0, 2, 1});\n    //proto_heads_flat = proto_heads.reshape(B * self.num_heads, self.num_hyperedges, self.head_dim).transpose(1, 2)\n\n    nvinfer1::IMatrixMultiplyLayer* logits =\n            network->addMatrixMultiply(*X_heads_flat->getOutput(0), nvinfer1::MatrixOperation::kNONE,\n                                       *proto_heads_flat->getOutput(0), nvinfer1::MatrixOperation::kNONE);\n    float* scales_ptr = reinterpret_cast<float*>(malloc(sizeof(float)));\n    *scales_ptr = sqrt(static_cast<float>(head_dim));\n    nvinfer1::Weights scale_wts{nvinfer1::DataType::kFLOAT, scales_ptr, 1};\n    nvinfer1::IConstantLayer* scale_layer = network->addConstant(nvinfer1::Dims3{1, 1, 1}, scale_wts);\n    // keep weight alive during build\n    weightMap[lname + \".scaling\"] = scale_wts;\n    nvinfer1::IElementWiseLayer* logits_scale = network->addElementWise(\n            *logits->getOutput(0), *scale_layer->getOutput(0), nvinfer1::ElementWiseOperation::kDIV);\n    // logits = torch.bmm(X_heads_flat, proto_heads_flat) / self.scaling\n\n    nvinfer1::IShuffleLayer* logits_scale_view = network->addShuffle(*logits_scale->getOutput(0));\n    logits_scale_view->setReshapeDimensions(nvinfer1::Dims4{B, num_heads, N, num_hyperedges});\n    nvinfer1::IReduceLayer* logits_scale_view_mean =\n            network->addReduce(*logits_scale_view->getOutput(0), nvinfer1::ReduceOperation::kAVG, 1 << 1, false);\n\n    nvinfer1::ISoftMaxLayer* softmax = network->addSoftMax(*logits_scale_view_mean->getOutput(0));\n    softmax->setAxes(1 << 1);\n\n    return softmax;\n}\n\nnvinfer1::IElementWiseLayer* GELU(nvinfer1::INetworkDefinition* network, nvinfer1::ITensor& input) {\n    static float sqrt_2_over_pi = 0.797885f;  // 0.7978845608\n    static float kappa = 0.044715f;\n    static float one = 1.0f;\n    static float half = 0.5f;\n\n    nvinfer1::IElementWiseLayer* x3_layer =\n            network->addElementWise(input, input, nvinfer1::ElementWiseOperation::kPROD);\n    nvinfer1::ITensor* x2 = x3_layer->getOutput(0);\n    x3_layer = network->addElementWise(*x2, input, nvinfer1::ElementWiseOperation::kPROD);\n    nvinfer1::ITensor* x3 = x3_layer->getOutput(0);\n\n    nvinfer1::Weights kappa_weight{nvinfer1::DataType::kFLOAT, &kappa, 1};\n    nvinfer1::IConstantLayer* kappa_const = network->addConstant(nvinfer1::Dims4{1, 1, 1, 1}, kappa_weight);\n    nvinfer1::IElementWiseLayer* scaled_x3 =\n            network->addElementWise(*x3, *kappa_const->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n\n    nvinfer1::IElementWiseLayer* inner_sum =\n            network->addElementWise(input, *scaled_x3->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n    nvinfer1::ITensor* inner = inner_sum->getOutput(0);\n\n    nvinfer1::Weights sqrt_weight{nvinfer1::DataType::kFLOAT, &sqrt_2_over_pi, 1};\n    nvinfer1::IConstantLayer* sqrt_const = network->addConstant(nvinfer1::Dims4{1, 1, 1, 1}, sqrt_weight);\n    nvinfer1::IElementWiseLayer* scaled_inner =\n            network->addElementWise(*inner, *sqrt_const->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n\n    nvinfer1::IActivationLayer* tanh_layer =\n            network->addActivation(*scaled_inner->getOutput(0), nvinfer1::ActivationType::kTANH);\n\n    nvinfer1::Weights one_weight{nvinfer1::DataType::kFLOAT, &one, 1};\n    nvinfer1::IConstantLayer* one_const = network->addConstant(nvinfer1::Dims4{1, 1, 1, 1}, one_weight);\n    nvinfer1::IElementWiseLayer* add_one = network->addElementWise(*tanh_layer->getOutput(0), *one_const->getOutput(0),\n                                                                   nvinfer1::ElementWiseOperation::kSUM);\n\n    nvinfer1::IElementWiseLayer* half_x =\n            network->addElementWise(input, *add_one->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n\n    nvinfer1::Weights half_weight{nvinfer1::DataType::kFLOAT, &half, 1};\n    nvinfer1::IConstantLayer* half_const = network->addConstant(nvinfer1::Dims4{1, 1, 1, 1}, half_weight);\n    nvinfer1::IElementWiseLayer* gelu = network->addElementWise(*half_x->getOutput(0), *half_const->getOutput(0),\n                                                                nvinfer1::ElementWiseOperation::kPROD);\n    return gelu;\n}\n\nnvinfer1::IElementWiseLayer* AdaHGConv(nvinfer1::INetworkDefinition* network,\n                                       std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                       int embed_dim, std::string lname, int num_hyperedges, int num_heads,\n                                       std::string context) {\n\n    // {B, N, num_hyperedges}\n    nvinfer1::ISoftMaxLayer* A = AdaHyperedgeGen(network, weightMap, input, embed_dim, num_hyperedges,\n                                                 lname + \".edge_generator\", num_heads, context);\n    nvinfer1::IMatrixMultiplyLayer* He = network->addMatrixMultiply(  // 486 layer\n            *A->getOutput(0), nvinfer1::MatrixOperation::kTRANSPOSE, input, nvinfer1::MatrixOperation::kNONE);\n    nvinfer1::IShuffleLayer* He_dim4 = network->addShuffle(*He->getOutput(0));\n    He_dim4->setReshapeDimensions(nvinfer1::Dims4{He->getOutput(0)->getDimensions().d[1],\n                                                  He->getOutput(0)->getDimensions().d[0],\n                                                  He->getOutput(0)->getDimensions().d[2], 1});\n\n    nvinfer1::IFullyConnectedLayer* He_edge_proj_ =\n            network->addFullyConnected(*He_dim4->getOutput(0), embed_dim, weightMap[lname + \".edge_proj.0.weight\"],\n                                       weightMap[lname + \".edge_proj.0.bias\"]);\n    nvinfer1::IElementWiseLayer* He_edge_proj = GELU(network, *He_edge_proj_->getOutput(0));\n    nvinfer1::IShuffleLayer* He_edge_proj_dim2 = network->addShuffle(*He_edge_proj->getOutput(0));\n    He_edge_proj_dim2->setReshapeDimensions(nvinfer1::Dims2{He_edge_proj->getOutput(0)->getDimensions().d[0],\n                                                            He_edge_proj->getOutput(0)->getDimensions().d[1]});\n    nvinfer1::IShuffleLayer* A_dim2 = network->addShuffle(*A->getOutput(0));\n    A_dim2->setReshapeDimensions(\n            nvinfer1::Dims2{A->getOutput(0)->getDimensions().d[1] *\n                                    A->getOutput(0)->getDimensions().d[0],  // keep the batch information\n                            A->getOutput(0)->getDimensions().d[2]});\n    nvinfer1::IMatrixMultiplyLayer* x_new_ =\n            network->addMatrixMultiply(*A_dim2->getOutput(0), nvinfer1::MatrixOperation::kNONE,\n                                       *He_edge_proj_dim2->getOutput(0), nvinfer1::MatrixOperation::kNONE);\n    nvinfer1::IShuffleLayer* x_new_dim4 = network->addShuffle(*x_new_->getOutput(0));\n    x_new_dim4->setReshapeDimensions(nvinfer1::Dims4{x_new_->getOutput(0)->getDimensions().d[0],\n                                                     x_new_->getOutput(0)->getDimensions().d[1], 1, 1});\n    nvinfer1::IFullyConnectedLayer* x_new_node_proj_ =\n            network->addFullyConnected(*x_new_dim4->getOutput(0), embed_dim, weightMap[lname + \".node_proj.0.weight\"],\n                                       weightMap[lname + \".node_proj.0.bias\"]);\n    nvinfer1::IElementWiseLayer* x_new_node_proj = GELU(network, *x_new_node_proj_->getOutput(0));\n    nvinfer1::IShuffleLayer* x_new_finall = network->addShuffle(*x_new_node_proj->getOutput(0));\n    x_new_finall->setReshapeDimensions(nvinfer1::Dims3{1, x_new_node_proj->getOutput(0)->getDimensions().d[0],\n                                                       x_new_node_proj->getOutput(0)->getDimensions().d[1]});\n    nvinfer1::IElementWiseLayer* add =\n            network->addElementWise(*x_new_finall->getOutput(0), input, nvinfer1::ElementWiseOperation::kSUM);\n\n    return add;\n}\n\nnvinfer1::IShuffleLayer* AdaHGComputation(nvinfer1::INetworkDefinition* network,\n                                          std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                          int embed_dim, std::string lname, int num_hyperedges, int num_heads,\n                                          std::string context) {\n    nvinfer1::Dims dim = input.getDimensions();\n    int B = dim.d[0];\n    int C = dim.d[1];\n    int H = dim.d[2];\n    int W = dim.d[3];\n    nvinfer1::IShuffleLayer* tokens = network->addShuffle(input);\n    tokens->setReshapeDimensions(nvinfer1::Dims3{B, C, H * W});\n    tokens->setSecondTranspose(nvinfer1::Permutation{0, 2, 1});\n    nvinfer1::IElementWiseLayer* hgnn = AdaHGConv(network, weightMap, *tokens->getOutput(0), embed_dim, lname + \".hgnn\",\n                                                  num_hyperedges, num_heads, context);\n\n    nvinfer1::IShuffleLayer* x_out = network->addShuffle(*hgnn->getOutput(0));\n    x_out->setFirstTranspose(nvinfer1::Permutation{0, 2, 1});\n    x_out->setReshapeDimensions(nvinfer1::Dims4{B, C, H, W});\n\n    return x_out;\n}\n\nnvinfer1::ILayer* C3AH(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                       nvinfer1::ITensor& input, int c2, std::string lname, float e, int num_hyperedges,\n                       std::string context) {\n    int c_ = float(e) * c2;\n    assert(c_ % 16 == 0 && \"Dimension of AdaHGComputation should be a multiplt of 16\");\n    int num_heads = c_ / 16;\n    nvinfer1::ILayer* cv1 = Conv(network, weightMap, input, c_, lname + \".cv1\");\n    nvinfer1::ILayer* cv2 = Conv(network, weightMap, input, c_, lname + \".cv2\");\n\n    nvinfer1::IShuffleLayer* m = AdaHGComputation(network, weightMap, *cv1->getOutput(0), c_, lname + \".m\",\n                                                  num_hyperedges, num_heads, context);\n    nvinfer1::ITensor* inputTensor[] = {m->getOutput(0), cv2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensor, 2);\n    nvinfer1::ILayer* cv3 = Conv(network, weightMap, *cat->getOutput(0), c2, lname + \".cv3\");\n    return cv3;\n}\n\nnvinfer1::ILayer* HyperACE(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                           std::vector<nvinfer1::ITensor*> input, int c1, int c2, std::string lname, int n,\n                           int num_hyperedges, bool dsc3k, bool shortcut, float e1, float e2, std::string context,\n                           bool channel_adjust) {\n    int c = int(c2 * e1);\n    nvinfer1::ILayer* fuse = FuseModule(network, weightMap, input, c1, channel_adjust, lname + \".fuse\");\n    nvinfer1::ILayer* cv1 = Conv(network, weightMap, *fuse->getOutput(0), 3 * c, lname + \".cv1\");\n    nvinfer1::Dims d_cv1 = cv1->getOutput(0)->getDimensions();\n    nvinfer1::ISliceLayer* sl0 = network->addSlice(*cv1->getOutput(0), nvinfer1::Dims4{0, 0, 0, 0},\n                                                   nvinfer1::Dims4{d_cv1.d[0], d_cv1.d[1] / 3, d_cv1.d[2], d_cv1.d[3]},\n                                                   nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* sl1 = network->addSlice(*cv1->getOutput(0), nvinfer1::Dims4{0, d_cv1.d[1] / 3, 0, 0},\n                                                   nvinfer1::Dims4{d_cv1.d[0], d_cv1.d[1] / 3, d_cv1.d[2], d_cv1.d[3]},\n                                                   nvinfer1::Dims4{1, 1, 1, 1});\n    nvinfer1::ISliceLayer* sl2 = network->addSlice(*cv1->getOutput(0), nvinfer1::Dims4{0, d_cv1.d[1] / 3 * 2, 0, 0},\n                                                   nvinfer1::Dims4{d_cv1.d[0], d_cv1.d[1] / 3, d_cv1.d[2], d_cv1.d[3]},\n                                                   nvinfer1::Dims4{1, 1, 1, 1});\n    std::vector<nvinfer1::ITensor*> y = {sl0->getOutput(0), sl1->getOutput(0), sl2->getOutput(0)};\n    nvinfer1::ILayer* out1 = C3AH(network, weightMap, *y[1], c, lname + \".branch1\", e2, num_hyperedges, context);\n    nvinfer1::ILayer* out2 = C3AH(network, weightMap, *y[1], c, lname + \".branch2\", e2, num_hyperedges, context);\n    nvinfer1::ITensor* current = y[2];\n    for (int i = 0; i < n; i++) {\n        if (dsc3k) {\n            nvinfer1::ILayer* m_ = DSC3k(network, weightMap, *current, c, 2, lname + \".m.\" + std::to_string(i),\n                                         shortcut, 1, 0.5, 3, 7, 1);\n            current = m_->getOutput(0);\n        } else {\n            nvinfer1::ILayer* m_ =\n                    DSBottleneck(network, weightMap, *current, c, c, lname + \".m.\" + std::to_string(i), shortcut);\n            current = m_->getOutput(0);\n        }\n        y.push_back(current);\n    }\n\n    y[1] = out1->getOutput(0);\n    y.push_back(out2->getOutput(0));\n\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(y.data(), y.size());\n    nvinfer1::ILayer* cv2 = Conv(network, weightMap, *cat->getOutput(0), c2, lname + \".cv2\");\n\n    return cv2;\n}\n\nnvinfer1::ILayer* DownsampleConv(nvinfer1::INetworkDefinition* network,\n                                 std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                 int in_channels, std::string lname, bool channel_adjust) {\n    nvinfer1::IPoolingLayer* downsample =\n            network->addPoolingNd(input, nvinfer1::PoolingType::kAVERAGE, nvinfer1::DimsHW{2, 2});\n    downsample->setStrideNd(nvinfer1::DimsHW{2, 2});\n    downsample->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    if (channel_adjust) {\n        nvinfer1::ILayer* channel_adjust_ =\n                Conv(network, weightMap, *downsample->getOutput(0), in_channels * 2, lname + \".channel_adjust\");\n        return channel_adjust_;\n    } else\n        return downsample;\n}\n\nnvinfer1::IElementWiseLayer* FullPad_Tunnel(nvinfer1::INetworkDefinition* network,\n                                            std::map<std::string, nvinfer1::Weights> weightMap,\n                                            std::vector<nvinfer1::ITensor*> input, std::string lname) {\n    nvinfer1::Weights gate = weightMap[lname + \".gate\"];\n    nvinfer1::IConstantLayer* gate_constant = network->addConstant(nvinfer1::Dims4{1, 1, 1, 1}, gate);\n    nvinfer1::IElementWiseLayer* scaled_input_1 =\n            network->addElementWise(*input[1], *gate_constant->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    nvinfer1::IElementWiseLayer* add =\n            network->addElementWise(*input[0], *scaled_input_1->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n\n    return add;\n}\n"
  },
  {
    "path": "yolov13/src/calibrator.cpp",
    "content": "#include \"calibrator.h\"\n#include <fstream>\n#include <iostream>\n#include <iterator>\n#include <opencv2/dnn/dnn.hpp>\n#include \"cuda_utils.h\"\n#include \"utils.h\"\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir,\n                                               const char* calib_table_name, const char* input_blob_name,\n                                               bool read_cache)\n    : batchsize_(batchsize),\n      input_w_(input_w),\n      input_h_(input_h),\n      img_idx_(0),\n      img_dir_(img_dir),\n      calib_table_name_(calib_table_name),\n      input_blob_name_(input_blob_name),\n      read_cache_(read_cache) {\n    input_count_ = 3 * input_w * input_h * batchsize;\n    CUDA_CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n    read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2() {\n    CUDA_CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const TRT_NOEXCEPT {\n    return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT {\n    if (img_idx_ + batchsize_ > (int)img_files_.size()) {\n        return false;\n    }\n\n    std::vector<cv::Mat> input_imgs_;\n    for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n        std::cout << img_files_[i] << \"  \" << i << std::endl;\n        cv::Mat temp = cv::imread(img_dir_ + \"/\" + img_files_[i]);\n        if (temp.empty()) {\n            std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n            return false;\n        }\n        cv::Mat pr_img = preprocess_img(temp, input_w_, input_h_);\n        input_imgs_.push_back(pr_img);\n    }\n    img_idx_ += batchsize_;\n    cv::Mat blob = cv::dnn::blobFromImages(input_imgs_, 1.0 / 255.0, cv::Size(input_w_, input_h_), cv::Scalar(0, 0, 0),\n                                           true, false);\n    CUDA_CHECK(cudaMemcpy(device_input_, blob.ptr<float>(0), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n    assert(!strcmp(names[0], input_blob_name_));\n    bindings[0] = device_input_;\n    return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length) TRT_NOEXCEPT {\n    std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n    calib_cache_.clear();\n    std::ifstream input(calib_table_name_, std::ios::binary);\n    input >> std::noskipws;\n    if (read_cache_ && input.good()) {\n        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n    }\n    length = calib_cache_.size();\n    return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT {\n    std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n    std::ofstream output(calib_table_name_, std::ios::binary);\n    output.write(reinterpret_cast<const char*>(cache), length);\n}\n"
  },
  {
    "path": "yolov13/src/model.cpp",
    "content": "#include <math.h>\n#include <iostream>\n\n#include \"block.h\"\n#include \"calibrator.h\"\n#include \"config.h\"\n#include \"model.h\"\n\nstatic int get_width(int x, float gw, int max_channels, int divisor = 8) {\n    auto channel = std::min(x, max_channels);\n    channel = int(ceil((channel * gw) / divisor)) * divisor;\n    return channel;\n}\n\nstatic int get_depth(int x, float gd) {\n    if (x == 1)\n        return 1;\n    int r = round(x * gd);\n    if (x * gd - int(x * gd) == 0.5 && (int(x * gd) % 2) == 0)\n        --r;\n    return std::max<int>(r, 1);\n}\n// Unused functions removed: convBnSiLUProto, Proto, cv4_conv_combined\n\nvoid calculateStrides(nvinfer1::IElementWiseLayer* conv_layers[], int size, int reference_size, int strides[]) {\n    for (int i = 0; i < size; ++i) {\n        nvinfer1::ILayer* layer = conv_layers[i];\n        nvinfer1::Dims dims = layer->getOutput(0)->getDimensions();\n        int feature_map_size = dims.d[2];\n        strides[i] = reference_size / feature_map_size;\n    }\n}\n\nvoid calculateStrides(nvinfer1::ILayer* conv_layers[], int size, int reference_size, int strides[]) {\n    for (int i = 0; i < size; ++i) {\n        nvinfer1::ILayer* layer = conv_layers[i];\n        nvinfer1::Dims dims = layer->getOutput(0)->getDimensions();\n        int feature_map_size = dims.d[2];\n        strides[i] = reference_size / feature_map_size;\n    }\n}\n\nnvinfer1::IHostMemory* buildEngineYolov13Det(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                             nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                             int& max_channels, std::string& type) {\n\n    std::cout << \"The number of the KNumClass is \" << kNumClass << std::endl;\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(\n            1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));\n\n    // =====================   input   ===================================================\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims4{kBatchSize, 3, kInputH, kInputW});\n    assert(data);\n\n    // =====================   backbone   ===================================================\n    nvinfer1::ILayer* conv0 = Conv(network, weightMap, *data, get_width(64, gw, max_channels), \"model.0\", 3, 2);\n    nvinfer1::ILayer* conv1 =\n            Conv(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), \"model.1\", 3, 2, 1, 2);\n\n    bool dsc3k = false;\n    float mlp_ratio = 2.0;\n    bool residual = false;\n    bool channel_adjust = true;\n    if (type == \"l\" || type == \"x\") {\n        mlp_ratio = 1.5;\n        residual = true;\n        dsc3k = true;\n        channel_adjust = false;\n    }\n    nvinfer1::ILayer* conv2 = DSC3K2(network, weightMap, *conv1->getOutput(0), get_width(256, gw, max_channels),\n                                     \"model.2\", get_depth(2, gd), dsc3k, 0.25);\n    nvinfer1::IElementWiseLayer* conv3 = convBnSiLU(network, weightMap, *conv2->getOutput(0),\n                                                    get_width(256, gw, max_channels), {3, 3}, 2, \"model.3\", 1, 4);\n    nvinfer1::ILayer* conv4 = DSC3K2(network, weightMap, *conv3->getOutput(0), get_width(512, gw, max_channels),\n                                     \"model.4\", get_depth(2, gd), dsc3k, 0.25);\n    nvinfer1::IElementWiseLayer* conv5 =\n            DSConv(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels),\n                   get_width(512, gw, max_channels), \"model.5\", 3, 2);\n    nvinfer1::ILayer* conv6 = A2C2f(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                                    get_depth(4, gd), \"model.6\", true, 4, residual, mlp_ratio);\n    nvinfer1::IElementWiseLayer* conv7 =\n            DSConv(network, weightMap, *conv6->getOutput(0), get_width(512, gw, max_channels),\n                   get_width(1024, gw, max_channels), \"model.7\", 3, 2);\n\n    nvinfer1::ILayer* conv8 = A2C2f(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                                    get_depth(4, gd), \"model.8\", true, 1, residual, mlp_ratio);\n\n    //=========================  neck ====================================================================\n    float scale[] = {1.0, 1.0, 2.0, 2.0};\n    int num_hyperedges = 8;\n    if (type == \"n\") {\n        num_hyperedges *= 0.5;\n    } else if (type == \"x\") {\n        num_hyperedges *= 1.5;\n    }\n\n    nvinfer1::ILayer* conv9 =\n            HyperACE(network, weightMap, {conv4->getOutput(0), conv6->getOutput(0), conv8->getOutput(0)},\n                     get_width(512, gw, max_channels), get_width(512, gw, max_channels), \"model.9\", get_depth(2, gd),\n                     num_hyperedges, true, true, 0.5, 1, \"both\", channel_adjust);\n\n    auto input_dims = conv9->getOutput(0)->getDimensions();\n    nvinfer1::IResizeLayer* upsample10 = network->addResize(*conv9->getOutput(0));\n    assert(upsample10);\n    upsample10->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample10->setOutputDimensions(\n            nvinfer1::Dims4{input_dims.d[0], input_dims.d[1], input_dims.d[2] * 2, input_dims.d[3] * 2});\n\n    nvinfer1::ILayer* downsample11 = DownsampleConv(network, weightMap, *conv9->getOutput(0),\n                                                    get_width(512, gw, max_channels), \"model.11\", channel_adjust);\n\n    nvinfer1::IElementWiseLayer* conv12 =  // conv6:(1, 128, 40, 40) conv9: (1, 128, 40, 40)\n            FullPad_Tunnel(network, weightMap, {conv6->getOutput(0), conv9->getOutput(0)}, \"model.12\");\n    nvinfer1::IElementWiseLayer* conv13 =\n            FullPad_Tunnel(network, weightMap, {conv4->getOutput(0), upsample10->getOutput(0)}, \"model.13\");\n\n    nvinfer1::IElementWiseLayer* conv14 =\n            FullPad_Tunnel(network, weightMap, {conv8->getOutput(0), downsample11->getOutput(0)}, \"model.14\");\n\n    nvinfer1::IResizeLayer* upsample15 = network->addResize(*conv14->getOutput(0));\n    assert(upsample15);\n    upsample15->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample15->setScales(scale, 4);\n    nvinfer1::ITensor* inputTensors16[] = {upsample15->getOutput(0), conv12->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat16 = network->addConcatenation(inputTensors16, 2);\n    nvinfer1::ILayer* conv17 = DSC3K2(network, weightMap, *cat16->getOutput(0), get_width(512, gw, max_channels),\n                                      \"model.17\", get_depth(2, gd), true);\n\n    nvinfer1::IElementWiseLayer* conv18 =\n            FullPad_Tunnel(network, weightMap, {conv17->getOutput(0), conv9->getOutput(0)}, \"model.18\");\n\n    nvinfer1::IResizeLayer* upsample19 = network->addResize(*conv17->getOutput(0));\n    assert(upsample19);\n    upsample19->setScales(scale, 4);\n    upsample19->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    nvinfer1::ITensor* inputTensors20[] = {upsample19->getOutput(0), conv13->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat20 = network->addConcatenation(inputTensors20, 2);\n    nvinfer1::ILayer* conv21 = DSC3K2(network, weightMap, *cat20->getOutput(0), get_width(256, gw, max_channels),\n                                      \"model.21\", get_depth(2, gd), true);\n\n    nvinfer1::ILayer* conv22 =\n            Conv(network, weightMap, *upsample10->getOutput(0), get_width(256, gw, max_channels), \"model.22\");\n    nvinfer1::IElementWiseLayer* conv23 =\n            FullPad_Tunnel(network, weightMap, {conv21->getOutput(0), conv22->getOutput(0)}, \"model.23\");\n\n    nvinfer1::ILayer* conv24 =\n            Conv(network, weightMap, *conv23->getOutput(0), get_width(256, gw, max_channels), \"model.24\", 3, 2);\n    nvinfer1::ITensor* inputTensors25[] = {conv24->getOutput(0), conv18->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat25 = network->addConcatenation(inputTensors25, 2);\n    nvinfer1::ILayer* conv26 = DSC3K2(network, weightMap, *cat25->getOutput(0), get_width(512, gw, max_channels),\n                                      \"model.26\", get_depth(2, gd), true);\n    nvinfer1::IElementWiseLayer* conv27 =\n            FullPad_Tunnel(network, weightMap, {conv26->getOutput(0), conv9->getOutput(0)}, \"model.27\");\n\n    nvinfer1::ILayer* conv28 =\n            Conv(network, weightMap, *conv26->getOutput(0), get_width(512, gw, max_channels), \"model.28\", 3, 2);\n    nvinfer1::ITensor* inputTensors29[] = {conv28->getOutput(0), conv14->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat29 = network->addConcatenation(inputTensors29, 2);\n    nvinfer1::ILayer* conv30 = DSC3K2(network, weightMap, *cat29->getOutput(0), get_width(1024, gw, max_channels),\n                                      \"model.30\", get_depth(2, gd), true);\n    nvinfer1::IElementWiseLayer* conv31 =\n            FullPad_Tunnel(network, weightMap, {conv30->getOutput(0), downsample11->getOutput(0)}, \"model.31\");\n\n    // =============================== output ===================================================================\n    int c2 = std::max(std::max(16, get_width(256, gw, max_channels) / 4), 16 * 4);\n    int c3 = std::max(get_width(256, gw, max_channels), std::min(kNumClass, 100));\n\n    // output0   location\n    nvinfer1::IElementWiseLayer* conv32_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv23->getOutput(0), c2, {3, 3}, 1, \"model.32.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv32_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv32_cv2_0_0->getOutput(0), c2, {3, 3}, 1, \"model.32.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv32_cv2_0_2 =\n            network->addConvolutionNd(*conv32_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.32.cv2.0.2.weight\"], weightMap[\"model.32.cv2.0.2.bias\"]);\n    conv32_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv32_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    // output0 classes\n    auto* conv32_cv3_0_0_0 = DWConv(network, weightMap, *conv23->getOutput(0), get_width(256, gw, max_channels), {3, 3},\n                                    1, \"model.32.cv3.0.0.0\");\n    nvinfer1::IElementWiseLayer* conv32_cv3_0_0_1 =\n            convBnSiLU(network, weightMap, *conv32_cv3_0_0_0->getOutput(0), c3, {1, 1}, 1, \"model.32.cv3.0.0.1\");\n\n    auto* conv32_cv3_0_1_0 =\n            DWConv(network, weightMap, *conv32_cv3_0_0_1->getOutput(0), c3, {3, 3}, 1, \"model.32.cv3.0.1.0\");\n    nvinfer1::IElementWiseLayer* conv32_cv3_0_1_1 =\n            convBnSiLU(network, weightMap, *conv32_cv3_0_1_0->getOutput(0), c3, {1, 1}, 1, \"model.32.cv3.0.1.1\");\n    nvinfer1::IConvolutionLayer* conv32_cv3_0_1_2 =\n            network->addConvolutionNd(*conv32_cv3_0_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.32.cv3.0.2.weight\"], weightMap[\"model.32.cv3.0.2.bias\"]);\n    conv32_cv3_0_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv32_cv3_0_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n\n    nvinfer1::ITensor* inputTensors32_0[] = {conv32_cv2_0_2->getOutput(0), conv32_cv3_0_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat32_0 = network->addConcatenation(inputTensors32_0, 2);\n\n    // out1 location\n    nvinfer1::IElementWiseLayer* conv32_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv27->getOutput(0), c2, {3, 3}, 1, \"model.32.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv32_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv32_cv2_1_0->getOutput(0), c2, {3, 3}, 1, \"model.32.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv32_cv2_1_2 =\n            network->addConvolutionNd(*conv32_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.32.cv2.1.2.weight\"], weightMap[\"model.32.cv2.1.2.bias\"]);\n    conv32_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv32_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    // out1 classes\n    auto* conv32_cv3_1_0_0 = DWConv(network, weightMap, *conv27->getOutput(0), get_width(512, gw, max_channels), {3, 3},\n                                    1, \"model.32.cv3.1.0.0\");\n    nvinfer1::IElementWiseLayer* conv32_cv3_1_0_1 =\n            convBnSiLU(network, weightMap, *conv32_cv3_1_0_0->getOutput(0), c3, {1, 1}, 1, \"model.32.cv3.1.0.1\");\n    auto* conv32_cv3_1_1_0 =\n            DWConv(network, weightMap, *conv32_cv3_1_0_1->getOutput(0), c3, {3, 3}, 1, \"model.32.cv3.1.1.0\");\n    nvinfer1::IElementWiseLayer* conv32_cv3_1_1_1 =\n            convBnSiLU(network, weightMap, *conv32_cv3_1_1_0->getOutput(0), c3, {1, 1}, 1, \"model.32.cv3.1.1.1\");\n    nvinfer1::IConvolutionLayer* conv32_cv3_1_1_2 =\n            network->addConvolutionNd(*conv32_cv3_1_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.32.cv3.1.2.weight\"], weightMap[\"model.32.cv3.1.2.bias\"]);\n    conv32_cv3_1_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    conv32_cv3_1_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n\n    nvinfer1::ITensor* inputTensors32_1[] = {conv32_cv2_1_2->getOutput(0), conv32_cv3_1_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat32_1 = network->addConcatenation(inputTensors32_1, 2);\n\n    // out2 location\n    nvinfer1::IElementWiseLayer* conv32_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv31->getOutput(0), c2, {3, 3}, 1, \"model.32.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv32_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv32_cv2_2_0->getOutput(0), c2, {3, 3}, 1, \"model.32.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv32_cv2_2_2 =\n            network->addConvolutionNd(*conv32_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.32.cv2.2.2.weight\"], weightMap[\"model.32.cv2.2.2.bias\"]);\n    conv32_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv32_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    // out2 classes\n    auto* conv32_cv3_2_0_0 = DWConv(network, weightMap, *conv31->getOutput(0), get_width(1024, gw, max_channels),\n                                    {3, 3}, 1, \"model.32.cv3.2.0.0\");\n    nvinfer1::IElementWiseLayer* conv32_cv3_2_0_1 =\n            convBnSiLU(network, weightMap, *conv32_cv3_2_0_0->getOutput(0), c3, {1, 1}, 1, \"model.32.cv3.2.0.1\");\n    auto* conv32_cv3_2_1_0 =\n            DWConv(network, weightMap, *conv32_cv3_2_0_1->getOutput(0), c3, {3, 3}, 1, \"model.32.cv3.2.1.0\");\n    nvinfer1::IElementWiseLayer* conv32_cv3_2_1_1 =\n            convBnSiLU(network, weightMap, *conv32_cv3_2_1_0->getOutput(0), c3, {1, 1}, 1, \"model.32.cv3.2.1.1\");\n    nvinfer1::IConvolutionLayer* conv32_cv3_2_1_2 =\n            network->addConvolutionNd(*conv32_cv3_2_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.32.cv3.2.2.weight\"], weightMap[\"model.32.cv3.2.2.bias\"]);\n    conv32_cv3_2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv32_cv3_2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    nvinfer1::ITensor* inputTensor32_2[] = {conv32_cv2_2_2->getOutput(0), conv32_cv3_2_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat32_2 = network->addConcatenation(inputTensor32_2, 2);\n\n    // ============================================ yolov13  detect =========================================\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle32_0 = network->addShuffle(*cat32_0->getOutput(0));\n    shuffle32_0->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split32_0_0 = network->addSlice(\n            *shuffle32_0->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split32_0_1 =\n            network->addSlice(*shuffle32_0->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])},\n                              nvinfer1::Dims3{1, 1, 1});\n\n    nvinfer1::IShuffleLayer* dfl32_0 =\n            DFL(network, weightMap, *split32_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.32.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor32_dfl_0[] = {dfl32_0->getOutput(0), split32_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat32_dfl_0 = network->addConcatenation(inputTensor32_dfl_0, 2);\n    cat32_dfl_0->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle32_1 = network->addShuffle(*cat32_1->getOutput(0));\n    shuffle32_1->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split32_1_0 = network->addSlice(\n            *shuffle32_1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split32_1_1 =\n            network->addSlice(*shuffle32_1->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl32_1 =\n            DFL(network, weightMap, *split32_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.32.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor32_dfl_1[] = {dfl32_1->getOutput(0), split32_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat32_dfl_1 = network->addConcatenation(inputTensor32_dfl_1, 2);\n    cat32_dfl_1->setAxis(1);\n\n    nvinfer1::IShuffleLayer* shuffle32_2 = network->addShuffle(*cat32_2->getOutput(0));\n    shuffle32_2->setReshapeDimensions(\n            nvinfer1::Dims3{kBatchSize, 64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split32_2_0 = network->addSlice(\n            *shuffle32_2->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n            nvinfer1::Dims3{kBatchSize, 64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split32_2_1 =\n            network->addSlice(*shuffle32_2->getOutput(0), nvinfer1::Dims3{0, 64, 0},\n                              nvinfer1::Dims3{kBatchSize, kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::IShuffleLayer* dfl32_2 =\n            DFL(network, weightMap, *split32_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.32.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor32_dfl_2[] = {dfl32_2->getOutput(0), split32_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat32_dfl_2 = network->addConcatenation(inputTensor32_dfl_2, 2);\n    cat32_dfl_2->setAxis(1);\n    std::cout << \" There are  \" << weightMap.size() << \"  layers parameters in the network!!!\" << endl;\n\n    nvinfer1::IPluginV2Layer* yolo =\n            addYoLoLayer(network, std::vector<nvinfer1::IConcatenationLayer*>{cat32_dfl_0, cat32_dfl_1, cat32_dfl_2},\n                         strides, stridesLength);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(kBatchSize, kInputW, kInputH, kInputQuantizationFolder,\n                                                  \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n"
  },
  {
    "path": "yolov13/src/postprocess.cpp",
    "content": "#include \"postprocess.h\"\n#include \"utils.h\"\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    float l, r, t, b;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n\n    if (r_h > r_w) {\n        l = bbox[0];\n        r = bbox[2];\n        t = bbox[1] - (kInputH - r_w * img.rows) / 2;\n        b = bbox[3] - (kInputH - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - (kInputW - r_h * img.cols) / 2;\n        r = bbox[2] - (kInputW - r_h * img.cols) / 2;\n        t = bbox[1];\n        b = bbox[3];\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\nstatic float iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n            (std::max)(lbox[0], rbox[0]),\n            (std::min)(lbox[2], rbox[2]),\n            (std::max)(lbox[1], rbox[1]),\n            (std::min)(lbox[3], rbox[3]),\n    };\n\n    if (interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS = (interBox[1] - interBox[0]) * (interBox[3] - interBox[2]);\n    float unionBoxS = (lbox[2] - lbox[0]) * (lbox[3] - lbox[1]) + (rbox[2] - rbox[0]) * (rbox[3] - rbox[1]) - interBoxS;\n    return interBoxS / unionBoxS;\n}\n\nstatic bool cmp(const Detection& a, const Detection& b) {\n    if (a.conf == b.conf) {\n        return a.bbox[0] < b.bbox[0];\n    }\n    return a.conf > b.conf;\n}\n\nvoid nms(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh) {\n    int det_size = sizeof(Detection) / sizeof(float);\n    std::map<float, std::vector<Detection>> m;\n\n    for (int i = 0; i < output[0]; i++) {\n        if (output[1 + det_size * i + 4] <= conf_thresh || isnan(output[1 + det_size * i + 4]))\n            continue;\n        Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        if (m.count(det.class_id) == 0)\n            m.emplace(det.class_id, std::vector<Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                    dets.erase(dets.begin() + n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\nvoid batch_nms(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size,\n               float conf_thresh, float nms_thresh) {\n    res_batch.resize(batch_size);\n    for (int i = 0; i < batch_size; i++) {\n        nms(res_batch[i], &output[i * output_size], conf_thresh, nms_thresh);\n    }\n}\n\nvoid process_decode_ptr_host(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element, cv::Mat& img,\n                             int count) {\n    Detection det;\n    for (int i = 0; i < count; i++) {\n        int basic_pos = 1 + i * bbox_element;\n        int keep_flag = decode_ptr_host[basic_pos + 6];\n        if (keep_flag == 1) {\n            det.bbox[0] = decode_ptr_host[basic_pos + 0];\n            det.bbox[1] = decode_ptr_host[basic_pos + 1];\n            det.bbox[2] = decode_ptr_host[basic_pos + 2];\n            det.bbox[3] = decode_ptr_host[basic_pos + 3];\n            det.conf = decode_ptr_host[basic_pos + 4];\n            det.class_id = decode_ptr_host[basic_pos + 5];\n            res.push_back(det);\n        }\n    }\n}\n\nvoid batch_process(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                   int bbox_element, const std::vector<cv::Mat>& img_batch) {\n    res_batch.resize(batch_size);\n    int count = static_cast<int>(*decode_ptr_host);\n    count = std::min(count, kMaxNumOutputBbox);\n    for (int i = 0; i < batch_size; i++) {\n        auto& img = const_cast<cv::Mat&>(img_batch[i]);\n        process_decode_ptr_host(res_batch[i], &decode_ptr_host[i * count], bbox_element, img, count);\n    }\n}\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        cv::Mat img = img_batch[i];\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect(img, res[j].bbox);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                        cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n        }\n    }\n}\n"
  },
  {
    "path": "yolov13/src/postprocess.cu",
    "content": "//\n// Created by lindsay on 23-7-17.\n//\n#include \"postprocess.h\"\n#include \"types.h\"\n\nstatic __global__ void decode_kernel(float* predict, int num_bboxes, float confidence_threshold, float* parray,\n                                     int max_objects) {\n    float count = predict[0];\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    if (position >= count)\n        return;\n\n    float* pitem = predict + 1 + position * (sizeof(Detection) / sizeof(float));\n    int index = atomicAdd(parray, 1);\n    char* pout_item_char = (char*)parray + sizeof(float) + index * bbox_element * sizeof(float);\n    float* pout_item = (float*)pout_item_char;\n    // Wait, let's look at how parray is used.\n    // In original code:\n    // float* pout_item = parray + 1 + index * bbox_element;\n    // But parray[0] is count. So parray + 1 is start of data.\n    // Ensure this matches usage in nms_kernel.\n\n    if (index >= max_objects)\n        return;\n\n    float confidence = pitem[4];\n    if (confidence < confidence_threshold)\n        return;\n\n    float left = pitem[0];\n    float top = pitem[1];\n    float right = pitem[2];\n    float bottom = pitem[3];\n    float label = pitem[5];\n\n    // Re-verify pointer arithmetic.\n    // parray is float*. 1 is float size.\n    // index * bbox_element is float offset.\n    float* out_ptr = parray + 1 + index * bbox_element;\n\n    *out_ptr++ = left;\n    *out_ptr++ = top;\n    *out_ptr++ = right;\n    *out_ptr++ = bottom;\n    *out_ptr++ = confidence;\n    *out_ptr++ = label;\n    *out_ptr++ = 1;  // 1 = keep, 0 = ignore\n}\n\nstatic __device__ float box_iou(float aleft, float atop, float aright, float abottom, float bleft, float btop,\n                                float bright, float bbottom) {\n    float cleft = max(aleft, bleft);\n    float ctop = max(atop, btop);\n    float cright = min(aright, bright);\n    float cbottom = min(abottom, bbottom);\n    float c_area = max(cright - cleft, 0.0f) * max(cbottom - ctop, 0.0f);\n    if (c_area == 0.0f)\n        return 0.0f;\n\n    float a_area = max(0.0f, aright - aleft) * max(0.0f, abottom - atop);\n    float b_area = max(0.0f, bright - bleft) * max(0.0f, bbottom - btop);\n    return c_area / (a_area + b_area - c_area);\n}\n\nstatic __global__ void nms_kernel(float* bboxes, int max_objects, float threshold) {\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    int count = min((int)*bboxes, max_objects);\n    if (position >= count)\n        return;\n\n    float* pcurrent = bboxes + 1 + position * bbox_element;\n    for (int i = 0; i < count; ++i) {\n        float* pitem = bboxes + 1 + i * bbox_element;\n        if (i == position || pcurrent[5] != pitem[5])\n            continue;\n        if (pitem[4] >= pcurrent[4]) {\n            if (pitem[4] == pcurrent[4] && i < position)\n                continue;\n            float iou =\n                    box_iou(pcurrent[0], pcurrent[1], pcurrent[2], pcurrent[3], pitem[0], pitem[1], pitem[2], pitem[3]);\n            if (iou > threshold) {\n                pcurrent[6] = 0;\n                return;\n            }\n        }\n    }\n}\n\nvoid cuda_decode(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                 cudaStream_t stream) {\n    int block = 256;\n    int grid = ceil(num_bboxes / (float)block);\n    decode_kernel<<<grid, block, 0, stream>>>((float*)predict, num_bboxes, confidence_threshold, parray, max_objects);\n}\n\nvoid cuda_nms(float* parray, float nms_threshold, int max_objects, cudaStream_t stream) {\n    int block = max_objects < 256 ? max_objects : 256;\n    int grid = ceil(max_objects / (float)block);\n    nms_kernel<<<grid, block, 0, stream>>>(parray, max_objects, nms_threshold);\n}\n"
  },
  {
    "path": "yolov13/src/preprocess.cu",
    "content": "#include \"cuda_utils.h\"\n#include \"preprocess.h\"\n\nstatic uint8_t* img_buffer_host = nullptr;\nstatic uint8_t* img_buffer_device = nullptr;\n\n__global__ void warpaffine_kernel(uint8_t* src, int src_line_size, int src_width, int src_height, float* dst,\n                                  int dst_width, int dst_height, uint8_t const_value_st, AffineMatrix d2s, int edge) {\n    int position = blockDim.x * blockIdx.x + threadIdx.x;\n    if (position >= edge)\n        return;\n\n    float m_x1 = d2s.value[0];\n    float m_y1 = d2s.value[1];\n    float m_z1 = d2s.value[2];\n    float m_x2 = d2s.value[3];\n    float m_y2 = d2s.value[4];\n    float m_z2 = d2s.value[5];\n\n    int dx = position % dst_width;\n    int dy = position / dst_width;\n    float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;\n    float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;\n    float c0, c1, c2;\n\n    if (src_x <= -1 || src_x >= src_width || src_y <= -1 || src_y >= src_height) {\n        // out of range\n        c0 = const_value_st;\n        c1 = const_value_st;\n        c2 = const_value_st;\n    } else {\n        int y_low = floorf(src_y);\n        int x_low = floorf(src_x);\n        int y_high = y_low + 1;\n        int x_high = x_low + 1;\n\n        uint8_t const_value[] = {const_value_st, const_value_st, const_value_st};\n        float ly = src_y - y_low;\n        float lx = src_x - x_low;\n        float hy = 1 - ly;\n        float hx = 1 - lx;\n        float w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;\n        uint8_t* v1 = const_value;\n        uint8_t* v2 = const_value;\n        uint8_t* v3 = const_value;\n        uint8_t* v4 = const_value;\n\n        if (y_low >= 0) {\n            if (x_low >= 0)\n                v1 = src + y_low * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v2 = src + y_low * src_line_size + x_high * 3;\n        }\n\n        if (y_high < src_height) {\n            if (x_low >= 0)\n                v3 = src + y_high * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v4 = src + y_high * src_line_size + x_high * 3;\n        }\n\n        c0 = w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0];\n        c1 = w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1];\n        c2 = w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2];\n    }\n\n    // bgr to rgb\n    float t = c2;\n    c2 = c0;\n    c0 = t;\n\n    // normalization\n    c0 = c0 / 255.0f;\n    c1 = c1 / 255.0f;\n    c2 = c2 / 255.0f;\n\n    // rgbrgbrgb to rrrgggbbb\n    int area = dst_width * dst_height;\n    float* pdst_c0 = dst + dy * dst_width + dx;\n    float* pdst_c1 = pdst_c0 + area;\n    float* pdst_c2 = pdst_c1 + area;\n    *pdst_c0 = c0;\n    *pdst_c1 = c1;\n    *pdst_c2 = c2;\n}\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream) {\n    int img_size = src_width * src_height * 3;\n    // copy data to pinned memory\n    memcpy(img_buffer_host, src, img_size);\n    // copy data to device memory\n    CUDA_CHECK(cudaMemcpyAsync(img_buffer_device, img_buffer_host, img_size, cudaMemcpyHostToDevice, stream));\n\n    AffineMatrix s2d, d2s;\n    float scale = std::min(dst_height / (float)src_height, dst_width / (float)src_width);\n\n    s2d.value[0] = scale;\n    s2d.value[1] = 0;\n    s2d.value[2] = -scale * src_width * 0.5 + dst_width * 0.5;\n    s2d.value[3] = 0;\n    s2d.value[4] = scale;\n    s2d.value[5] = -scale * src_height * 0.5 + dst_height * 0.5;\n    cv::Mat m2x3_s2d(2, 3, CV_32F, s2d.value);\n    cv::Mat m2x3_d2s(2, 3, CV_32F, d2s.value);\n    cv::invertAffineTransform(m2x3_s2d, m2x3_d2s);\n\n    memcpy(d2s.value, m2x3_d2s.ptr<float>(0), sizeof(d2s.value));\n\n    int jobs = dst_height * dst_width;\n    int threads = 256;\n    int blocks = ceil(jobs / (float)threads);\n    warpaffine_kernel<<<blocks, threads, 0, stream>>>(img_buffer_device, src_width * 3, src_width, src_height, dst,\n                                                      dst_width, dst_height, 128, d2s, jobs);\n}\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream) {\n    int dst_size = dst_width * dst_height * 3;\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        cuda_preprocess(img_batch[i].ptr(), img_batch[i].cols, img_batch[i].rows, &dst[dst_size * i], dst_width,\n                        dst_height, stream);\n        CUDA_CHECK(cudaStreamSynchronize(stream));\n    }\n}\n\nvoid cuda_preprocess_init(int max_image_size) {\n    // prepare input data in pinned memory\n    CUDA_CHECK(cudaMallocHost((void**)&img_buffer_host, max_image_size * 3));\n    // prepare input data in device memory\n    CUDA_CHECK(cudaMalloc((void**)&img_buffer_device, max_image_size * 3));\n}\n\nvoid cuda_preprocess_destroy() {\n    CUDA_CHECK(cudaFree(img_buffer_device));\n    CUDA_CHECK(cudaFreeHost(img_buffer_host));\n}\n"
  },
  {
    "path": "yolov13/yolov13_det.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#if defined(_WIN32)\n#include <direct.h>\n#include <io.h>\n#include <windows.h>\n#else\n#include <sys/stat.h>\n#include <unistd.h>\n#endif\n#include <climits>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"utils.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\n\nstatic std::string get_executable_dir() {\n#if defined(_WIN32)\n    char buf[MAX_PATH];\n    DWORD len = GetModuleFileNameA(NULL, buf, MAX_PATH);\n    if (len == 0 || len == MAX_PATH)\n        return std::string(\".\");\n    std::string path(buf, buf + len);\n    size_t pos = path.find_last_of(\"\\\\/\");\n    if (pos != std::string::npos)\n        return path.substr(0, pos);\n    return std::string(\".\");\n#else\n    char buf[PATH_MAX];\n    ssize_t len = readlink(\"/proc/self/exe\", buf, sizeof(buf) - 1);\n    if (len == -1)\n        return std::string(\".\");\n    buf[len] = '\\0';\n    std::string path(buf);\n    size_t pos = path.find_last_of('/');\n    if (pos != std::string::npos)\n        return path.substr(0, pos);\n    return std::string(\".\");\n#endif\n}\n\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, float& gd, float& gw, int& max_channels,\n                      std::string& type) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine = nullptr;\n\n    serialized_engine = buildEngineYolov13Det(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels, type);\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_buffer_host, float** decode_ptr_host, float** decode_ptr_device,\n                    std::string cuda_post_process) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n    if (cuda_post_process == \"c\") {\n        *output_buffer_host = new float[kBatchSize * kOutputSize];\n    } else if (cuda_post_process == \"g\") {\n        if (kBatchSize > 1) {\n            std::cerr << \"Do not yet support GPU post processing for multiple batches\" << std::endl;\n            exit(0);\n        }\n        // Allocate memory for decode_ptr_host and copy to device\n        *decode_ptr_host = new float[1 + kMaxNumOutputBbox * bbox_element];\n        CUDA_CHECK(cudaMalloc((void**)decode_ptr_device, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element)));\n    }\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchsize,\n           float* decode_ptr_host, float* decode_ptr_device, int model_bboxes, std::string cuda_post_process) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueueV2(buffers, stream, nullptr);\n    if (cuda_post_process == \"c\") {\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n    } else if (cuda_post_process == \"g\") {\n        CUDA_CHECK(\n                cudaMemsetAsync(decode_ptr_device, 0, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), stream));\n        cuda_decode((float*)buffers[1], model_bboxes, kConfThresh, decode_ptr_device, kMaxNumOutputBbox, stream);\n        cuda_nms(decode_ptr_device, kNmsThresh, kMaxNumOutputBbox, stream);  //cuda nms\n        CUDA_CHECK(cudaMemcpyAsync(decode_ptr_host, decode_ptr_device,\n                                   sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference and gpu postprocess time: \"\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir, std::string& type,\n                std::string& cuda_post_process, float& gd, float& gw, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && (argc == 5)) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        auto sub_type = std::string(argv[4]);\n\n        if (sub_type[0] == 'n') {\n            gd = 0.50;\n            gw = 0.25;\n            max_channels = 1024;\n            type = \"n\";\n        } else if (sub_type[0] == 's') {\n            gd = 0.50;\n            gw = 0.50;\n            max_channels = 1024;\n            type = \"s\";\n        } else if (sub_type[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n            type = \"l\";\n        } else if (sub_type[0] == 'x') {\n            gd = 1.0;\n            gw = 1.50;\n            max_channels = 512;\n            type = \"x\";\n        } else {\n            return false;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 5) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n        cuda_post_process = std::string(argv[4]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(kGpuId);\n    std::string wts_name;\n    std::string engine_name;\n    std::string img_dir;\n    std::string cuda_post_process;\n    std::string type;\n    int model_bboxes;\n    float gd = 0, gw = 0;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir, type, cuda_post_process, gd, gw, max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolov13-det -s [.wts] [.engine] [n/s/l/x]  // serialize model to \"\n                     \"plan file\"\n                  << std::endl;\n        std::cerr << \"./yolov13-det -d [.engine] ../images  [c/g]// deserialize plan file and run inference\"\n                  << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, gd, gw, max_channels, type);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* output_buffer_host = nullptr;\n    float* decode_ptr_host = nullptr;\n    float* decode_ptr_device = nullptr;\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host, &decode_ptr_host,\n                   &decode_ptr_device, cuda_post_process);\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n        // Run inference\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize, decode_ptr_host,\n              decode_ptr_device, model_bboxes, cuda_post_process);\n        // Save the first 100 values of output_buffer_host, one per line\n        //        std::ofstream out(\"../models/output.txt\");\n        //        for (int j = 0; j < 100; j++) {\n        //            out << output_buffer_host[j] << std::endl;\n        //        }\n        //        out.close();\n        std::vector<std::vector<Detection>> res_batch;\n        if (cuda_post_process == \"c\") {\n            // NMS\n            batch_nms(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n        } else if (cuda_post_process == \"g\") {\n            //Process gpu decode and nms results\n            batch_process(res_batch, decode_ptr_host, img_batch.size(), bbox_element, img_batch);\n        }\n        // Draw bounding boxes\n        draw_bbox(img_batch, res_batch);\n#if 0\n        // legacy: save under a \"build\" subfolder of the working directory\n        const std::string out_dir = \"build\";\n#else\n        // Save results to the directory where the executable resides\n        const std::string exe_dir = get_executable_dir();\n        const std::string out_dir = exe_dir;\n#endif\n#if defined(_WIN32)\n        if (_access(out_dir.c_str(), 0) != 0) {\n            if (_mkdir(out_dir.c_str()) != 0) {\n                std::cerr << \"Warning: create directory failed: \" << out_dir << std::endl;\n            }\n        }\n#else\n        if (access(out_dir.c_str(), F_OK) != 0) {\n            if (mkdir(out_dir.c_str(), 0755) != 0) {\n                std::cerr << \"Warning: create directory failed: \" << out_dir << std::endl;\n            }\n        }\n#endif\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            std::string out_path = out_dir + \"/_\" + img_name_batch[j];\n            if (cv::imwrite(out_path, img_batch[j])) {\n                std::cout << \"Saved: \" << out_path << std::endl;\n            } else {\n                std::cerr << \"Failed to save: \" << out_path << std::endl;\n            }\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    CUDA_CHECK(cudaFree(decode_ptr_device));\n    delete[] decode_ptr_host;\n    delete[] output_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    // Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < kOutputSize; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov13/yolov13_det_trt.py",
    "content": "\"\"\"\r\nAn example that uses TensorRT's Python api to make inferences.\r\n\"\"\"\r\nimport ctypes\r\nimport os\r\nimport shutil\r\nimport random\r\nimport sys\r\nimport threading\r\nimport time\r\nimport cv2\r\nimport numpy as np\r\nimport pycuda.autoinit  # noqa: F401\r\nimport pycuda.driver as cuda\r\nimport tensorrt as trt\r\n\r\nCONF_THRESH = 0.5\r\nIOU_THRESHOLD = 0.4\r\nDET_NUM = 6\r\n\r\n\r\ndef get_img_path_batches(batch_size, img_dir):\r\n    ret = []\r\n    batch = []\r\n    for root, dirs, files in os.walk(img_dir):\r\n        for name in files:\r\n            if len(batch) == batch_size:\r\n                ret.append(batch)\r\n                batch = []\r\n            batch.append(os.path.join(root, name))\r\n    if len(batch) > 0:\r\n        ret.append(batch)\r\n    return ret\r\n\r\n\r\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\r\n    \"\"\"\r\n    description: Plots one bounding box on image img,\r\n                 this function comes from YoLov13 project.\r\n    param:\r\n        x:      a box likes [x1,y1,x2,y2]\r\n        img:    a opencv image object\r\n        color:  color to draw rectangle, such as (0,255,0)\r\n        label:  str\r\n        line_thickness: int\r\n    return:\r\n        no return\r\n\r\n    \"\"\"\r\n    tl = (\r\n            line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\r\n    )  # line/font thickness\r\n    color = color or [random.randint(0, 255) for _ in range(3)]\r\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\r\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\r\n    if label:\r\n        tf = max(tl - 1, 1)  # font thickness\r\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\r\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\r\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\r\n        cv2.putText(\r\n            img,\r\n            label,\r\n            (c1[0], c1[1] - 2),\r\n            0,\r\n            tl / 3,\r\n            [225, 255, 255],\r\n            thickness=tf,\r\n            lineType=cv2.LINE_AA,\r\n        )\r\n\r\n\r\nclass YoLov13TRT(object):\r\n    \"\"\"\r\n    description: A YOLOv13 class that warps TensorRT ops, preprocess and postprocess ops.\r\n    \"\"\"\r\n\r\n    def __init__(self, engine_file_path):\r\n        # Create a Context on this device,\r\n        self.ctx = cuda.Device(0).make_context()\r\n        stream = cuda.Stream()\r\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\r\n        runtime = trt.Runtime(TRT_LOGGER)\r\n\r\n        # Deserialize the engine from file\r\n        with open(engine_file_path, \"rb\") as f:\r\n            engine = runtime.deserialize_cuda_engine(f.read())\r\n        context = engine.create_execution_context()\r\n\r\n        host_inputs = []\r\n        cuda_inputs = []\r\n        host_outputs = []\r\n        cuda_outputs = []\r\n        bindings = []\r\n\r\n        for binding in engine:\r\n            print('bingding:', binding, engine.get_binding_shape(binding))\r\n            self.batch_size = engine.get_binding_shape(binding)[0]\r\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\r\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\r\n            # Allocate host and device buffers\r\n            host_mem = cuda.pagelocked_empty(size, dtype)\r\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\r\n            # Append the device buffer to device bindings.\r\n            bindings.append(int(cuda_mem))\r\n            # Append to the appropriate list.\r\n            if engine.binding_is_input(binding):\r\n                self.input_w = engine.get_binding_shape(binding)[-1]\r\n                self.input_h = engine.get_binding_shape(binding)[-2]\r\n                host_inputs.append(host_mem)\r\n                cuda_inputs.append(cuda_mem)\r\n            else:\r\n                host_outputs.append(host_mem)\r\n                cuda_outputs.append(cuda_mem)\r\n\r\n        # Store\r\n        self.stream = stream\r\n        self.context = context\r\n        self.engine = engine\r\n        self.host_inputs = host_inputs\r\n        self.cuda_inputs = cuda_inputs\r\n        self.host_outputs = host_outputs\r\n        self.cuda_outputs = cuda_outputs\r\n        self.bindings = bindings\r\n        self.det_output_length = host_outputs[0].shape[0]\r\n\r\n    def infer(self, raw_image_generator):\r\n        threading.Thread.__init__(self)\r\n        # Make self the active context, pushing it on top of the context stack.\r\n        self.ctx.push()\r\n        # Restore\r\n        stream = self.stream\r\n        context = self.context\r\n        host_inputs = self.host_inputs\r\n        cuda_inputs = self.cuda_inputs\r\n        host_outputs = self.host_outputs\r\n        cuda_outputs = self.cuda_outputs\r\n        bindings = self.bindings\r\n        # Do image preprocess\r\n        batch_image_raw = []\r\n        batch_origin_h = []\r\n        batch_origin_w = []\r\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\r\n        for i, image_raw in enumerate(raw_image_generator):\r\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\r\n            batch_image_raw.append(image_raw)\r\n            batch_origin_h.append(origin_h)\r\n            batch_origin_w.append(origin_w)\r\n            np.copyto(batch_input_image[i], input_image)\r\n        batch_input_image = np.ascontiguousarray(batch_input_image)\r\n\r\n        # Copy input image to host buffer\r\n        np.copyto(host_inputs[0], batch_input_image.ravel())\r\n        start = time.time()\r\n        # Transfer input data  to the GPU.\r\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\r\n        # Run inference.\r\n        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\r\n        # Transfer predictions back from the GPU.\r\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\r\n        # Synchronize the stream\r\n        stream.synchronize()\r\n        end = time.time()\r\n        # Remove any context from the top of the context stack, deactivating it.\r\n        self.ctx.pop()\r\n        # Here we use the first row of output in that batch_size = 1\r\n        output = host_outputs[0]\r\n        # print(\"output: \", output[400:500])\r\n        # Do postprocess\r\n        for i in range(self.batch_size):\r\n            result_boxes, result_scores, result_classid = self.post_process(\r\n                output[i * self.det_output_length: (i + 1) * self.det_output_length], batch_origin_h[i],\r\n                batch_origin_w[i]\r\n            )\r\n            # Draw rectangles and labels on the original image\r\n            for j in range(len(result_boxes)):\r\n                box = result_boxes[j]\r\n                plot_one_box(\r\n                    box,\r\n                    batch_image_raw[i],\r\n                    label=\"{}:{:.2f}\".format(\r\n                        categories[int(result_classid[j])], result_scores[j]\r\n                    ),\r\n                )\r\n        return batch_image_raw, end - start\r\n\r\n    def destroy(self):\r\n        # Remove any context from the top of the context stack, deactivating it.\r\n        self.ctx.pop()\r\n\r\n    def get_raw_image(self, image_path_batch):\r\n        \"\"\"\r\n        description: Read an image from image path\r\n        \"\"\"\r\n        for img_path in image_path_batch:\r\n            yield cv2.imread(img_path)\r\n\r\n    def get_raw_image_zeros(self, image_path_batch=None):\r\n        \"\"\"\r\n        description: Ready data for warmup\r\n        \"\"\"\r\n        for _ in range(self.batch_size):\r\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\r\n\r\n    def preprocess_image(self, raw_bgr_image):\r\n        \"\"\"\r\n        description: Convert BGR image to RGB,\r\n                     resize and pad it to target size, normalize to [0,1],\r\n                     transform to NCHW format.\r\n        param:\r\n            input_image_path: str, image path\r\n        return:\r\n            image:  the processed image\r\n            image_raw: the original image\r\n            h: original height\r\n            w: original width\r\n        \"\"\"\r\n        image_raw = raw_bgr_image\r\n        h, w, c = image_raw.shape\r\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\r\n        # Calculate widht and height and paddings\r\n        r_w = self.input_w / w\r\n        r_h = self.input_h / h\r\n        if r_h > r_w:\r\n            tw = self.input_w\r\n            th = int(r_w * h)\r\n            tx1 = tx2 = 0\r\n            ty1 = int((self.input_h - th) / 2)\r\n            ty2 = self.input_h - th - ty1\r\n        else:\r\n            tw = int(r_h * w)\r\n            th = self.input_h\r\n            tx1 = int((self.input_w - tw) / 2)\r\n            tx2 = self.input_w - tw - tx1\r\n            ty1 = ty2 = 0\r\n        # Resize the image with long side while maintaining ratio\r\n        image = cv2.resize(image, (tw, th))\r\n        # Pad the short side with (128,128,128)\r\n        image = cv2.copyMakeBorder(\r\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\r\n        )\r\n        image = image.astype(np.float32)\r\n        # Normalize to [0,1]\r\n        image /= 255.0\r\n        # HWC to CHW format:\r\n        image = np.transpose(image, [2, 0, 1])\r\n        # CHW to NCHW format\r\n        image = np.expand_dims(image, axis=0)\r\n        # Convert the image to row-major order, also known as \"C order\":\r\n        image = np.ascontiguousarray(image)\r\n        return image, image_raw, h, w\r\n\r\n    def xywh2xyxy(self, origin_h, origin_w, x):\r\n        \"\"\"\r\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\r\n        param:\r\n            origin_h:   height of original image\r\n            origin_w:   width of original image\r\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\r\n        return:\r\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\r\n        \"\"\"\r\n        y = np.zeros_like(x)\r\n        r_w = self.input_w / origin_w\r\n        r_h = self.input_h / origin_h\r\n        if r_h > r_w:\r\n            y[:, 0] = x[:, 0]\r\n            y[:, 2] = x[:, 2]\r\n            y[:, 1] = x[:, 1] - (self.input_h - r_w * origin_h) / 2\r\n            y[:, 3] = x[:, 3] - (self.input_h - r_w * origin_h) / 2\r\n            y /= r_w\r\n        else:\r\n            y[:, 0] = x[:, 0] - (self.input_w - r_h * origin_w) / 2\r\n            y[:, 2] = x[:, 2] - (self.input_w - r_h * origin_w) / 2\r\n            y[:, 1] = x[:, 1]\r\n            y[:, 3] = x[:, 3]\r\n            y /= r_h\r\n\r\n        return y\r\n\r\n    def post_process(self, output, origin_h, origin_w):\r\n        \"\"\"\r\n        description: postprocess the prediction\r\n        param:\r\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...]\r\n            origin_h:   height of original image\r\n            origin_w:   width of original image\r\n        return:\r\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\r\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\r\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\r\n        \"\"\"\r\n        num_values_per_detection = DET_NUM\r\n        # Get the num of boxes detected\r\n        num = int(output[0])\r\n        print(\"There are {} detections in the picture!!!\".format(num))\r\n        # Reshape to a two dimentional ndarray\r\n        # pred = np.reshape(output[1:], (-1, 38))[:num, :]\r\n        pred = np.reshape(output[1:], (-1, num_values_per_detection))[:num, :]\r\n        # Do nms\r\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\r\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\r\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\r\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\r\n        return result_boxes, result_scores, result_classid\r\n\r\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\r\n        \"\"\"\r\n        description: compute the IoU of two bounding boxes\r\n        param:\r\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\r\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\r\n            x1y1x2y2: select the coordinate format\r\n        return:\r\n            iou: computed iou\r\n        \"\"\"\r\n        if not x1y1x2y2:\r\n            # Transform from center and width to exact coordinates\r\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\r\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\r\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\r\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\r\n        else:\r\n            # Get the coordinates of bounding boxes\r\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\r\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\r\n\r\n        # Get the coordinates of the intersection rectangle\r\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\r\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\r\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\r\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\r\n        # Intersection area\r\n        inter_area = (np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None)\r\n                      * np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None))\r\n        # Union Area\r\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\r\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\r\n\r\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\r\n\r\n        return iou\r\n\r\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\r\n        \"\"\"\r\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\r\n        Non-Maximum Suppression to further filter detections.\r\n        param:\r\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\r\n            origin_h: original image height\r\n            origin_w: original image width\r\n            conf_thres: a confidence threshold to filter detections\r\n            nms_thres: a iou threshold to filter detections\r\n        return:\r\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\r\n        \"\"\"\r\n        # Get the boxes that score > CONF_THRESH\r\n        boxes = prediction[prediction[:, 4] >= conf_thres]\r\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\r\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\r\n        # clip the coordinates\r\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1)\r\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1)\r\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1)\r\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1)\r\n        # Object confidence\r\n        confs = boxes[:, 4]\r\n        # Sort by the confs\r\n        boxes = boxes[np.argsort(-confs)]\r\n        # Perform non-maximum suppression\r\n        keep_boxes = []\r\n        while boxes.shape[0]:\r\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\r\n            label_match = boxes[0, -1] == boxes[:, -1]\r\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\r\n            invalid = large_overlap & label_match\r\n            keep_boxes += [boxes[0]]\r\n            boxes = boxes[~invalid]\r\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\r\n        return boxes\r\n\r\n\r\nclass inferThread(threading.Thread):\r\n    def __init__(self, yolov13_wrapper, image_path_batch):\r\n        threading.Thread.__init__(self)\r\n        self.yolov13_wrapper = yolov13_wrapper\r\n        self.image_path_batch = image_path_batch\r\n\r\n    def run(self):\r\n        batch_image_raw, use_time = self.yolov13_wrapper.infer(\r\n            self.yolov13_wrapper.get_raw_image(self.image_path_batch))\r\n        for i, img_path in enumerate(self.image_path_batch):\r\n            parent, filename = os.path.split(img_path)\r\n            save_name = os.path.join('output', filename)\r\n            # Save image\r\n            cv2.imwrite(save_name, batch_image_raw[i])\r\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\r\n\r\n\r\nclass warmUpThread(threading.Thread):\r\n    def __init__(self, yolov13_wrapper):\r\n        threading.Thread.__init__(self)\r\n        self.yolov13_wrapper = yolov13_wrapper\r\n\r\n    def run(self):\r\n        batch_image_raw, use_time = self.yolov13_wrapper.infer(self.yolov13_wrapper.get_raw_image_zeros())\r\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    # load custom plugin and engine\r\n    PLUGIN_LIBRARY = \"build/libmyplugins.so\"\r\n    engine_file_path = \"build/yolov13n-det.engine\"\r\n    # engine_file_path = \"build/yolov13n-det-int8.engine\"\r\n\r\n    if len(sys.argv) > 1:\r\n        engine_file_path = sys.argv[1]\r\n    if len(sys.argv) > 2:\r\n        PLUGIN_LIBRARY = sys.argv[2]\r\n\r\n    ctypes.CDLL(PLUGIN_LIBRARY)\r\n\r\n    # load coco labels\r\n    # categories = [\"object\"]\r\n\r\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\",\r\n                  \"traffic light\",\r\n                  \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\r\n                  \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\r\n                  \"frisbee\",\r\n                  \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\",\r\n                  \"surfboard\",\r\n                  \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\r\n                  \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\r\n                  \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\r\n                  \"cell phone\",\r\n                  \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\r\n                  \"teddy bear\",\r\n                  \"hair drier\", \"toothbrush\"]\r\n\r\n    if os.path.exists('output/'):\r\n        shutil.rmtree('output/')\r\n    os.makedirs('output/')\r\n    # a YoLov13TRT instance\r\n    yolov13_wrapper = YoLov13TRT(engine_file_path)\r\n    try:\r\n        print('batch size is', yolov13_wrapper.batch_size)\r\n\r\n        image_dir = \"images\"\r\n        image_path_batches = get_img_path_batches(yolov13_wrapper.batch_size, image_dir)\r\n\r\n        for i in range(10):\r\n            # create a new thread to do warm_up\r\n            thread1 = warmUpThread(yolov13_wrapper)\r\n            thread1.start()\r\n            thread1.join()\r\n        for batch in image_path_batches:\r\n            # create a new thread to do inference\r\n            thread1 = inferThread(yolov13_wrapper, batch)\r\n            thread1.start()\r\n            thread1.join()\r\n    finally:\r\n        # destroy the instance\r\n        yolov13_wrapper.destroy()\r\n"
  },
  {
    "path": "yolov3/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(yolov3)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\n#cuda_add_library(leaky ${PROJECT_SOURCE_DIR}/leaky.cu)\ncuda_add_library(yololayer SHARED ${PROJECT_SOURCE_DIR}/yololayer.cu)\ntarget_link_libraries(yololayer nvinfer cudart ${OpenCV_LIBS})\n\nadd_executable(yolov3 ${PROJECT_SOURCE_DIR}/calibrator.cpp ${PROJECT_SOURCE_DIR}/yolov3.cpp)\ntarget_link_libraries(yolov3 nvinfer)\ntarget_link_libraries(yolov3 cudart)\ntarget_link_libraries(yolov3 yololayer)\ntarget_link_libraries(yolov3 ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "yolov3/README.md",
    "content": "# yolov3\n\nThe Pytorch implementation is [ultralytics/yolov3 archive branch](https://github.com/ultralytics/yolov3/tree/archive). It provides two trained weights of yolov3, `yolov3.weights` and `yolov3.pt`\n\nThis branch is using tensorrt7 API, there is also a yolov3 implementation using tensorrt4 API, go to [branch trt4/yolov3](https://github.com/wang-xinyu/tensorrtx/tree/trt4/yolov3), which is using [ayooshkathuria/pytorch-yolo-v3](https://github.com/ayooshkathuria/pytorch-yolo-v3).\n\n## Config\n\n- Input shape defined in yololayer.h\n- Number of classes defined in yololayer.h\n- INT8/FP16/FP32 can be selected by the macro in yolov3.cpp\n- GPU id can be selected by the macro in yolov3.cpp\n- NMS thresh in yolov3.cpp\n- BBox confidence thresh in yolov3.cpp\n\n## How to run\n\n1. generate yolov3.wts from pytorch implementation with yolov3.cfg and yolov3.weights, or download .wts from model zoo\n\n```\ngit clone https://github.com/wang-xinyu/tensorrtx.git\ngit clone -b archive https://github.com/ultralytics/yolov3.git\n// download its weights 'yolov3.pt' or 'yolov3.weights'\ncp {tensorrtx}/yolov3/gen_wts.py {ultralytics/yolov3/}\ncd {ultralytics/yolov3/}\npython gen_wts.py yolov3.weights\n// a file 'yolov3.wts' will be generated.\n// the master branch of yolov3 should work, if not, you can checkout cf7a4d31d37788023a9186a1a143a2dab0275ead\n```\n\n2. put yolov3.wts into tensorrtx/yolov3, build and run\n\n```\nmv yolov3.wts {tensorrtx}/yolov3/\ncd {tensorrtx}/yolov3\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./yolov3 -s                          // serialize model to plan file i.e. 'yolov3.engine'\nsudo ./yolov3 -d ../../yolov3-spp/samples // deserialize plan file and run inference, the images in samples will be processed.\n```\n\n3. check the images generated, as follows. _zidane.jpg and _bus.jpg\n\n# INT8 Quantization\n\n1. Prepare calibration images, you can randomly select 1000s images from your train set. For coco, you can also download my calibration images `coco_calib` from [GoogleDrive](https://drive.google.com/drive/folders/1s7jE9DtOngZMzJC1uL307J2MiaGwdRSI?usp=sharing) or [BaiduPan](https://pan.baidu.com/s/1GOm_-JobpyLMAqZWCDUhKg) pwd: a9wh\n\n2. unzip it in yolov3/build\n\n3. set the macro `USE_INT8` in yolov3.cpp and make\n\n4. serialize the model and test\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/78247927-4d9fac00-751e-11ea-8b1b-704a0aeb3fcf.jpg\">\n</p>\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/78247970-60b27c00-751e-11ea-88df-41473fed4823.jpg\">\n</p>\n\n## More Information\n\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n\n"
  },
  {
    "path": "yolov3/calibrator.cpp",
    "content": "#include <iostream>\n#include <iterator>\n#include <fstream>\n#include <opencv2/dnn/dnn.hpp>\n#include \"calibrator.h\"\n#include \"cuda_runtime_api.h\"\n#include \"utils.h\"\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name, const char* input_blob_name, bool read_cache)\n    : batchsize_(batchsize)\n    , input_w_(input_w)\n    , input_h_(input_h)\n    , img_idx_(0)\n    , img_dir_(img_dir)\n    , calib_table_name_(calib_table_name)\n    , input_blob_name_(input_blob_name)\n    , read_cache_(read_cache)\n{\n    input_count_ = 3 * input_w * input_h * batchsize;\n    CUDA_CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n    read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2()\n{\n    CUDA_CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const TRT_NOEXCEPT\n{\n    return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT\n{\n    if (img_idx_ + batchsize_ > (int)img_files_.size()) {\n        return false;\n    }\n\n    std::vector<cv::Mat> input_imgs_;\n    for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n        std::cout << img_files_[i] << \"  \" << i << std::endl;\n        cv::Mat temp = cv::imread(img_dir_ + img_files_[i]);\n        if (temp.empty()){\n            std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n            return false;\n        }\n        cv::Mat pr_img = preprocess_img(temp, input_w_, input_h_);\n        input_imgs_.push_back(pr_img);\n    }\n    img_idx_ += batchsize_;\n    cv::Mat blob = cv::dnn::blobFromImages(input_imgs_, 1.0 / 255.0, cv::Size(input_w_, input_h_), cv::Scalar(0, 0, 0), true, false);\n\n    CUDA_CHECK(cudaMemcpy(device_input_, blob.ptr<float>(0), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n    assert(!strcmp(names[0], input_blob_name_));\n    bindings[0] = device_input_;\n    return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length) TRT_NOEXCEPT\n{\n    std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n    calib_cache_.clear();\n    std::ifstream input(calib_table_name_, std::ios::binary);\n    input >> std::noskipws;\n    if (read_cache_ && input.good())\n    {\n        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n    }\n    length = calib_cache_.size();\n    return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT\n{\n    std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n    std::ofstream output(calib_table_name_, std::ios::binary);\n    output.write(reinterpret_cast<const char*>(cache), length);\n}\n\n"
  },
  {
    "path": "yolov3/calibrator.h",
    "content": "#ifndef ENTROPY_CALIBRATOR_H\n#define ENTROPY_CALIBRATOR_H\n\n#include \"NvInfer.h\"\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\n//! \\class Int8EntropyCalibrator2\n//!\n//! \\brief Implements Entropy calibrator 2.\n//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.\n//!\nclass Int8EntropyCalibrator2 : public nvinfer1::IInt8EntropyCalibrator2\n{\npublic:\n    Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name, const char* input_blob_name, bool read_cache = true);\n\n    virtual ~Int8EntropyCalibrator2();\n    int getBatchSize() const TRT_NOEXCEPT override;\n    bool getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT override;\n    const void* readCalibrationCache(size_t& length) TRT_NOEXCEPT override;\n    void writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT override;\n    \nprivate:\n    int batchsize_;\n    int input_w_;\n    int input_h_;\n    int img_idx_;\n    std::string img_dir_;\n    std::vector<std::string> img_files_;\n    size_t input_count_;\n    std::string calib_table_name_;\n    const char* input_blob_name_;\n    bool read_cache_;\n    void* device_input_;\n    std::vector<char> calib_cache_;\n};\n\n#endif // ENTROPY_CALIBRATOR_H\n"
  },
  {
    "path": "yolov3/gen_wts.py",
    "content": "import struct\nimport sys\nimport torch\nfrom models import *  # noqa: F403\nfrom utils.utils import *  # noqa: F403\n\nmodel = Darknet('cfg/yolov3.cfg', (608, 608))  # noqa: F405\nweights = sys.argv[1]\ndevice = torch_utils.select_device('0')  # noqa: F405\nif weights.endswith('.pt'):  # pytorch format\n    model.load_state_dict(torch.load(weights, map_location=device, weights_only=False)['model'])\nelse:  # darknet format\n    load_darknet_weights(model, weights)  # noqa: F405\nmodel = model.eval()\n\nwith open('yolov3.wts', 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\n"
  },
  {
    "path": "yolov3/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"macros.h\"\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#else\n#define TRT_NOEXCEPT\n#endif\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "yolov3/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "yolov3/utils.h",
    "content": "#ifndef __TRT_UTILS_H_\n#define __TRT_UTILS_H_\n\n#include <iostream>\n#include <vector>\n#include <algorithm>\n#include <cudnn.h>\n#include <dirent.h>\n#include <opencv2/opencv.hpp>\n\n#ifndef CUDA_CHECK\n\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n\n#endif\n\nnamespace Tn\n{\n    template<typename T> \n    void write(char*& buffer, const T& val)\n    {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T> \n    void read(const char*& buffer, T& val)\n    {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n}\n\nstatic inline cv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h) {\n    int w, h, x, y;\n    float r_w = input_w / (img.cols*1.0);\n    float r_h = input_h / (img.rows*1.0);\n    if (r_h > r_w) {\n        w = input_w;\n        h = r_w * img.rows;\n        x = 0;\n        y = (input_h - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = input_h;\n        x = (input_w - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n    cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\nstatic inline int read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n            strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\n#endif\n"
  },
  {
    "path": "yolov3/yololayer.cu",
    "content": "#include \"yololayer.h\"\n#include \"utils.h\"\n#include <assert.h>\n\nusing namespace Yolo;\n\nnamespace nvinfer1\n{\n    YoloLayerPlugin::YoloLayerPlugin()\n    {\n        mClassCount = CLASS_NUM;\n        mYoloKernel.clear();\n        mYoloKernel.push_back(yolo1);\n        mYoloKernel.push_back(yolo2);\n        mYoloKernel.push_back(yolo3);\n\n        mKernelCount = mYoloKernel.size();\n    }\n    \n    YoloLayerPlugin::~YoloLayerPlugin()\n    {\n    }\n\n    // create the plugin at runtime from a byte stream\n    YoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length)\n    {\n        using namespace Tn;\n        const char *d = reinterpret_cast<const char *>(data), *a = d;\n        read(d, mClassCount);\n        read(d, mThreadCount);\n        read(d, mKernelCount);\n        mYoloKernel.resize(mKernelCount);\n        auto kernelSize = mKernelCount*sizeof(YoloKernel);\n        memcpy(mYoloKernel.data(),d,kernelSize);\n        d += kernelSize;\n\n        assert(d == a + length);\n    }\n\n    void YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT\n    {\n        using namespace Tn;\n        char* d = static_cast<char*>(buffer), *a = d;\n        write(d, mClassCount);\n        write(d, mThreadCount);\n        write(d, mKernelCount);\n        auto kernelSize = mKernelCount*sizeof(YoloKernel);\n        memcpy(d,mYoloKernel.data(),kernelSize);\n        d += kernelSize;\n\n        assert(d == a + getSerializationSize());\n    }\n    \n    size_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT\n    {  \n        return sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mKernelCount)  + sizeof(Yolo::YoloKernel) * mYoloKernel.size();\n    }\n\n    int YoloLayerPlugin::initialize() TRT_NOEXCEPT\n    { \n        return 0;\n    }\n    \n    Dims YoloLayerPlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT\n    {\n        //output the result to channel\n        int totalsize = MAX_OUTPUT_BBOX_COUNT * sizeof(Detection) / sizeof(float);\n\n        return Dims3(totalsize + 1, 1, 1);\n    }\n\n    // Set plugin namespace\n    void YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT\n    {\n        mPluginNamespace = pluginNamespace;\n    }\n\n    const char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT\n    {\n        return mPluginNamespace;\n    }\n\n    // Return the DataType of the plugin output at the requested index\n    DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT\n    {\n        return DataType::kFLOAT;\n    }\n\n    // Return true if output tensor is broadcast across a batch.\n    bool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    // Return true if plugin can use input that is broadcast across batch without replication.\n    bool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    void YoloLayerPlugin::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT\n    {\n    }\n\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\n    void YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT\n    {\n    }\n\n    // Detach the plugin object from its execution context.\n    void YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\n    const char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT\n    {\n        return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT\n    {\n        return \"1\";\n    }\n\n    void YoloLayerPlugin::destroy() TRT_NOEXCEPT\n    {\n        delete this;\n    }\n\n    // Clone the plugin\n    IPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT\n    {\n        YoloLayerPlugin *p = new YoloLayerPlugin();\n        p->setPluginNamespace(mPluginNamespace);\n        return p;\n    }\n\n    __device__ float Logist(float data){ return 1.0f / (1.0f + expf(-data)); };\n\n    __global__ void CalDetection(const float *input, float *output,int noElements, \n            int yoloWidth,int yoloHeight,const float anchors[CHECK_COUNT*2],int classes,int outputElem) {\n \n        int idx = threadIdx.x + blockDim.x * blockIdx.x;\n        if (idx >= noElements) return;\n\n        int total_grid = yoloWidth * yoloHeight;\n        int bnIdx = idx / total_grid;\n        idx = idx - total_grid*bnIdx;\n        int info_len_i = 5 + classes;\n        const float* curInput = input + bnIdx * (info_len_i * total_grid * CHECK_COUNT);\n\n        for (int k = 0; k < 3; ++k) {\n            int class_id = 0;\n            float max_cls_prob = 0.0;\n            for (int i = 5; i < info_len_i; ++i) {\n                float p = Logist(curInput[idx + k * info_len_i * total_grid + i * total_grid]);\n                if (p > max_cls_prob) {\n                    max_cls_prob = p;\n                    class_id = i - 5;\n                }\n            }\n            float box_prob = Logist(curInput[idx + k * info_len_i * total_grid + 4 * total_grid]);\n            if (max_cls_prob < IGNORE_THRESH || box_prob < IGNORE_THRESH) continue;\n\n            float *res_count = output + bnIdx*outputElem;\n            int count = (int)atomicAdd(res_count, 1);\n            if (count >= MAX_OUTPUT_BBOX_COUNT) return;\n            char* data = (char * )res_count + sizeof(float) + count*sizeof(Detection);\n            Detection* det =  (Detection*)(data);\n\n            int row = idx / yoloWidth;\n            int col = idx % yoloWidth;\n\n            //Location\n            det->bbox[0] = (col + Logist(curInput[idx + k * info_len_i * total_grid + 0 * total_grid])) * INPUT_W / yoloWidth;\n            det->bbox[1] = (row + Logist(curInput[idx + k * info_len_i * total_grid + 1 * total_grid])) * INPUT_H / yoloHeight;\n            det->bbox[2] = expf(curInput[idx + k * info_len_i * total_grid + 2 * total_grid]) * anchors[2*k];\n            det->bbox[3] = expf(curInput[idx + k * info_len_i * total_grid + 3 * total_grid]) * anchors[2*k + 1];\n            det->det_confidence = box_prob;\n            det->class_id = class_id;\n            det->class_confidence = max_cls_prob;\n        }\n    }\n\n    void YoloLayerPlugin::forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize) {\n        void* devAnchor;\n        size_t AnchorLen = sizeof(float)* CHECK_COUNT*2;\n        CUDA_CHECK(cudaMalloc(&devAnchor,AnchorLen));\n\n        int outputElem = 1 + MAX_OUTPUT_BBOX_COUNT * sizeof(Detection) / sizeof(float);\n\n        for(int idx = 0 ; idx < batchSize; ++idx) {\n            CUDA_CHECK(cudaMemset(output + idx*outputElem, 0, sizeof(float)));\n        }\n        int numElem = 0;\n        for (unsigned int i = 0;i< mYoloKernel.size();++i)\n        {\n            const auto& yolo = mYoloKernel[i];\n            numElem = yolo.width*yolo.height*batchSize;\n            if (numElem < mThreadCount)\n                mThreadCount = numElem;\n            CUDA_CHECK(cudaMemcpy(devAnchor, yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n            CalDetection<<< (yolo.width*yolo.height*batchSize + mThreadCount - 1) / mThreadCount, mThreadCount>>>\n                (inputs[i],output, numElem, yolo.width, yolo.height, (float *)devAnchor, mClassCount ,outputElem);\n        }\n\n        CUDA_CHECK(cudaFree(devAnchor));\n    }\n\n\n    int YoloLayerPlugin::enqueue(int batchSize, const void*const * inputs, void* TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT\n    {\n        //assert(batchSize == 1);\n        //GPU\n        //CUDA_CHECK(cudaStreamSynchronize(stream));\n        forwardGpu((const float *const *)inputs, (float*)outputs[0], stream, batchSize);\n\n        return 0;\n    }\n\n    PluginFieldCollection YoloPluginCreator::mFC{};\n    std::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\n    YoloPluginCreator::YoloPluginCreator()\n    {\n        mPluginAttributes.clear();\n\n        mFC.nbFields = mPluginAttributes.size();\n        mFC.fields = mPluginAttributes.data();\n    }\n\n    const char* YoloPluginCreator::getPluginName() const TRT_NOEXCEPT\n    {\n            return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloPluginCreator::getPluginVersion() const TRT_NOEXCEPT\n    {\n            return \"1\";\n    }\n\n    const PluginFieldCollection* YoloPluginCreator::getFieldNames() TRT_NOEXCEPT\n    {\n            return &mFC;\n    }\n\n    IPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT\n    {\n        YoloLayerPlugin* obj = new YoloLayerPlugin();\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n    IPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT\n    {\n        // This object will be deleted when the network is destroyed, which will\n        // call MishPlugin::destroy()\n        YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n}\n"
  },
  {
    "path": "yolov3/yololayer.h",
    "content": "#ifndef _YOLO_LAYER_H\n#define _YOLO_LAYER_H\n\n#include <iostream>\n#include <vector>\n#include \"NvInfer.h\"\n#include \"macros.h\"\n\n\nnamespace Yolo\n{\n    static constexpr int CHECK_COUNT = 3;\n    static constexpr float IGNORE_THRESH = 0.1f;\n    static constexpr int MAX_OUTPUT_BBOX_COUNT = 1000;\n    static constexpr int CLASS_NUM = 80;\n    static constexpr int INPUT_H = 608;\n    static constexpr int INPUT_W = 608;\n\n    struct YoloKernel\n    {\n        int width;\n        int height;\n        float anchors[CHECK_COUNT*2];\n    };\n\n    static constexpr YoloKernel yolo1 = {\n        INPUT_W / 32,\n        INPUT_H / 32,\n        {116,90,  156,198,  373,326}\n    };\n    static constexpr YoloKernel yolo2 = {\n        INPUT_W / 16,\n        INPUT_H / 16,\n        {30,61,  62,45,  59,119}\n    };\n    static constexpr YoloKernel yolo3 = {\n        INPUT_W / 8,\n        INPUT_H / 8,\n        {10,13,  16,30,  33,23}\n    };\n\n    static constexpr int LOCATIONS = 4;\n    struct alignas(float) Detection{\n        //x y w h\n        float bbox[LOCATIONS];\n        float det_confidence;\n        float class_id;\n        float class_confidence;\n    };\n}\n\nnamespace nvinfer1\n{\n    class YoloLayerPlugin: public IPluginV2IOExt\n    {\n        public:\n            explicit YoloLayerPlugin();\n            YoloLayerPlugin(const void* data, size_t length);\n\n            ~YoloLayerPlugin();\n\n            int getNbOutputs() const TRT_NOEXCEPT override\n            {\n                return 1;\n            }\n\n            Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n            int initialize() TRT_NOEXCEPT override;\n\n            virtual void terminate() TRT_NOEXCEPT override {};\n\n            virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0;}\n\n            virtual int enqueue(int batchSize, const void*const * inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT override;\n\n            virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n            virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n            bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const TRT_NOEXCEPT override {\n                return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n            }\n\n            const char* getPluginType() const TRT_NOEXCEPT override;\n\n            const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n            void destroy() TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n            void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n            const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n            DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override;\n\n            bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT override;\n\n            bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n            void attachToContext(\n                    cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n            void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT override;\n\n            void detachFromContext() TRT_NOEXCEPT override;\n\n        private:\n            void forwardGpu(const float *const * inputs,float * output, cudaStream_t stream,int batchSize = 1);\n            int mClassCount;\n            int mKernelCount;\n            std::vector<Yolo::YoloKernel> mYoloKernel;\n            int mThreadCount = 256;\n            const char* mPluginNamespace;\n    };\n\n    class YoloPluginCreator : public IPluginCreator\n    {\n        public:\n            YoloPluginCreator();\n\n            ~YoloPluginCreator() override = default;\n\n            const char* getPluginName() const TRT_NOEXCEPT override;\n\n            const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n            const PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT override;\n\n            void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override\n            {\n                mNamespace = libNamespace;\n            }\n\n            const char* getPluginNamespace() const TRT_NOEXCEPT override\n            {\n                return mNamespace.c_str();\n            }\n\n        private:\n            std::string mNamespace;\n            static PluginFieldCollection mFC;\n            static std::vector<PluginField> mPluginAttributes;\n    };\n    REGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n};\n\n#endif \n"
  },
  {
    "path": "yolov3/yolov3.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"utils.h\"\n#include \"logging.h\"\n#include \"yololayer.h\"\n#include \"calibrator.h\"\n\n#define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32\n#define DEVICE 0  // GPU id\n#define NMS_THRESH 0.4\n#define BBOX_CONF_THRESH 0.5\n\nusing namespace nvinfer1;\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = Yolo::INPUT_H;\nstatic const int INPUT_W = Yolo::INPUT_W;\nstatic const int DETECTION_SIZE = sizeof(Yolo::Detection) / sizeof(float);\nstatic const int OUTPUT_SIZE = Yolo::MAX_OUTPUT_BBOX_COUNT * DETECTION_SIZE + 1;  // we assume the yololayer outputs no more than MAX_OUTPUT_BBOX_COUNT boxes that conf >= 0.1\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\nstatic Logger gLogger;\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    int l, r, t, b;\n    float r_w = INPUT_W / (img.cols * 1.0);\n    float r_h = INPUT_H / (img.rows * 1.0);\n    if (r_h > r_w) {\n        l = bbox[0] - bbox[2]/2.f;\n        r = bbox[0] + bbox[2]/2.f;\n        t = bbox[1] - bbox[3]/2.f - (INPUT_H - r_w * img.rows) / 2;\n        b = bbox[1] + bbox[3]/2.f - (INPUT_H - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - bbox[2]/2.f - (INPUT_W - r_h * img.cols) / 2;\n        r = bbox[0] + bbox[2]/2.f - (INPUT_W - r_h * img.cols) / 2;\n        t = bbox[1] - bbox[3]/2.f;\n        b = bbox[1] + bbox[3]/2.f;\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    return cv::Rect(l, t, r-l, b-t);\n}\n\nfloat iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n        std::max(lbox[0] - lbox[2]/2.f , rbox[0] - rbox[2]/2.f), //left\n        std::min(lbox[0] + lbox[2]/2.f , rbox[0] + rbox[2]/2.f), //right\n        std::max(lbox[1] - lbox[3]/2.f , rbox[1] - rbox[3]/2.f), //top\n        std::min(lbox[1] + lbox[3]/2.f , rbox[1] + rbox[3]/2.f), //bottom\n    };\n\n    if(interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS =(interBox[1]-interBox[0])*(interBox[3]-interBox[2]);\n    return interBoxS/(lbox[2]*lbox[3] + rbox[2]*rbox[3] -interBoxS);\n}\n\nbool cmp(const Yolo::Detection& a, const Yolo::Detection& b) {\n    return a.det_confidence > b.det_confidence;\n}\n\nvoid nms(std::vector<Yolo::Detection>& res, float *output, float nms_thresh = NMS_THRESH) {\n    std::map<float, std::vector<Yolo::Detection>> m;\n    for (int i = 0; i < output[0] && i < 1000; i++) {\n        if (output[1 + 7 * i + 4] <= BBOX_CONF_THRESH) continue;\n        Yolo::Detection det;\n        memcpy(&det, &output[1 + 7 * i], 7 * sizeof(float));\n        if (m.count(det.class_id) == 0) m.emplace(det.class_id, std::vector<Yolo::Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        //std::cout << it->second[0].class_id << \" --- \" << std::endl;\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                    dets.erase(dets.begin()+n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* convBnLeaky(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input,  int outch, int ksize, int s, int p, int linx) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[\"module_list.\" + std::to_string(linx) + \".Conv2d.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"module_list.\" + std::to_string(linx) + \".BatchNorm2d\", 1e-5);\n\n    auto lr = network->addActivation(*bn1->getOutput(0), ActivationType::kLEAKY_RELU);\n    lr->setAlpha(0.1);\n\n    return lr;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../yolov3.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    // Yeah I am stupid, I just want to expand the complete arch of darknet..\n    auto lr0 = convBnLeaky(network, weightMap, *data, 32, 3, 1, 1, 0);\n    auto lr1 = convBnLeaky(network, weightMap, *lr0->getOutput(0), 64, 3, 2, 1, 1);\n    auto lr2 = convBnLeaky(network, weightMap, *lr1->getOutput(0), 32, 1, 1, 0, 2);\n    auto lr3 = convBnLeaky(network, weightMap, *lr2->getOutput(0), 64, 3, 1, 1, 3);\n    auto ew4 = network->addElementWise(*lr3->getOutput(0), *lr1->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr5 = convBnLeaky(network, weightMap, *ew4->getOutput(0), 128, 3, 2, 1, 5);\n    auto lr6 = convBnLeaky(network, weightMap, *lr5->getOutput(0), 64, 1, 1, 0, 6);\n    auto lr7 = convBnLeaky(network, weightMap, *lr6->getOutput(0), 128, 3, 1, 1, 7);\n    auto ew8 = network->addElementWise(*lr7->getOutput(0), *lr5->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr9 = convBnLeaky(network, weightMap, *ew8->getOutput(0), 64, 1, 1, 0, 9);\n    auto lr10 = convBnLeaky(network, weightMap, *lr9->getOutput(0), 128, 3, 1, 1, 10);\n    auto ew11 = network->addElementWise(*lr10->getOutput(0), *ew8->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr12 = convBnLeaky(network, weightMap, *ew11->getOutput(0), 256, 3, 2, 1, 12);\n    auto lr13 = convBnLeaky(network, weightMap, *lr12->getOutput(0), 128, 1, 1, 0, 13);\n    auto lr14 = convBnLeaky(network, weightMap, *lr13->getOutput(0), 256, 3, 1, 1, 14);\n    auto ew15 = network->addElementWise(*lr14->getOutput(0), *lr12->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr16 = convBnLeaky(network, weightMap, *ew15->getOutput(0), 128, 1, 1, 0, 16);\n    auto lr17 = convBnLeaky(network, weightMap, *lr16->getOutput(0), 256, 3, 1, 1, 17);\n    auto ew18 = network->addElementWise(*lr17->getOutput(0), *ew15->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr19 = convBnLeaky(network, weightMap, *ew18->getOutput(0), 128, 1, 1, 0, 19);\n    auto lr20 = convBnLeaky(network, weightMap, *lr19->getOutput(0), 256, 3, 1, 1, 20);\n    auto ew21 = network->addElementWise(*lr20->getOutput(0), *ew18->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr22 = convBnLeaky(network, weightMap, *ew21->getOutput(0), 128, 1, 1, 0, 22);\n    auto lr23 = convBnLeaky(network, weightMap, *lr22->getOutput(0), 256, 3, 1, 1, 23);\n    auto ew24 = network->addElementWise(*lr23->getOutput(0), *ew21->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr25 = convBnLeaky(network, weightMap, *ew24->getOutput(0), 128, 1, 1, 0, 25);\n    auto lr26 = convBnLeaky(network, weightMap, *lr25->getOutput(0), 256, 3, 1, 1, 26);\n    auto ew27 = network->addElementWise(*lr26->getOutput(0), *ew24->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr28 = convBnLeaky(network, weightMap, *ew27->getOutput(0), 128, 1, 1, 0, 28);\n    auto lr29 = convBnLeaky(network, weightMap, *lr28->getOutput(0), 256, 3, 1, 1, 29);\n    auto ew30 = network->addElementWise(*lr29->getOutput(0), *ew27->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr31 = convBnLeaky(network, weightMap, *ew30->getOutput(0), 128, 1, 1, 0, 31);\n    auto lr32 = convBnLeaky(network, weightMap, *lr31->getOutput(0), 256, 3, 1, 1, 32);\n    auto ew33 = network->addElementWise(*lr32->getOutput(0), *ew30->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr34 = convBnLeaky(network, weightMap, *ew33->getOutput(0), 128, 1, 1, 0, 34);\n    auto lr35 = convBnLeaky(network, weightMap, *lr34->getOutput(0), 256, 3, 1, 1, 35);\n    auto ew36 = network->addElementWise(*lr35->getOutput(0), *ew33->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr37 = convBnLeaky(network, weightMap, *ew36->getOutput(0), 512, 3, 2, 1, 37);\n    auto lr38 = convBnLeaky(network, weightMap, *lr37->getOutput(0), 256, 1, 1, 0, 38);\n    auto lr39 = convBnLeaky(network, weightMap, *lr38->getOutput(0), 512, 3, 1, 1, 39);\n    auto ew40 = network->addElementWise(*lr39->getOutput(0), *lr37->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr41 = convBnLeaky(network, weightMap, *ew40->getOutput(0), 256, 1, 1, 0, 41);\n    auto lr42 = convBnLeaky(network, weightMap, *lr41->getOutput(0), 512, 3, 1, 1, 42);\n    auto ew43 = network->addElementWise(*lr42->getOutput(0), *ew40->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr44 = convBnLeaky(network, weightMap, *ew43->getOutput(0), 256, 1, 1, 0, 44);\n    auto lr45 = convBnLeaky(network, weightMap, *lr44->getOutput(0), 512, 3, 1, 1, 45);\n    auto ew46 = network->addElementWise(*lr45->getOutput(0), *ew43->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr47 = convBnLeaky(network, weightMap, *ew46->getOutput(0), 256, 1, 1, 0, 47);\n    auto lr48 = convBnLeaky(network, weightMap, *lr47->getOutput(0), 512, 3, 1, 1, 48);\n    auto ew49 = network->addElementWise(*lr48->getOutput(0), *ew46->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr50 = convBnLeaky(network, weightMap, *ew49->getOutput(0), 256, 1, 1, 0, 50);\n    auto lr51 = convBnLeaky(network, weightMap, *lr50->getOutput(0), 512, 3, 1, 1, 51);\n    auto ew52 = network->addElementWise(*lr51->getOutput(0), *ew49->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr53 = convBnLeaky(network, weightMap, *ew52->getOutput(0), 256, 1, 1, 0, 53);\n    auto lr54 = convBnLeaky(network, weightMap, *lr53->getOutput(0), 512, 3, 1, 1, 54);\n    auto ew55 = network->addElementWise(*lr54->getOutput(0), *ew52->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr56 = convBnLeaky(network, weightMap, *ew55->getOutput(0), 256, 1, 1, 0, 56);\n    auto lr57 = convBnLeaky(network, weightMap, *lr56->getOutput(0), 512, 3, 1, 1, 57);\n    auto ew58 = network->addElementWise(*lr57->getOutput(0), *ew55->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr59 = convBnLeaky(network, weightMap, *ew58->getOutput(0), 256, 1, 1, 0, 59);\n    auto lr60 = convBnLeaky(network, weightMap, *lr59->getOutput(0), 512, 3, 1, 1, 60);\n    auto ew61 = network->addElementWise(*lr60->getOutput(0), *ew58->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr62 = convBnLeaky(network, weightMap, *ew61->getOutput(0), 1024, 3, 2, 1, 62);\n    auto lr63 = convBnLeaky(network, weightMap, *lr62->getOutput(0), 512, 1, 1, 0, 63);\n    auto lr64 = convBnLeaky(network, weightMap, *lr63->getOutput(0), 1024, 3, 1, 1, 64);\n    auto ew65 = network->addElementWise(*lr64->getOutput(0), *lr62->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr66 = convBnLeaky(network, weightMap, *ew65->getOutput(0), 512, 1, 1, 0, 66);\n    auto lr67 = convBnLeaky(network, weightMap, *lr66->getOutput(0), 1024, 3, 1, 1, 67);\n    auto ew68 = network->addElementWise(*lr67->getOutput(0), *ew65->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr69 = convBnLeaky(network, weightMap, *ew68->getOutput(0), 512, 1, 1, 0, 69);\n    auto lr70 = convBnLeaky(network, weightMap, *lr69->getOutput(0), 1024, 3, 1, 1, 70);\n    auto ew71 = network->addElementWise(*lr70->getOutput(0), *ew68->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr72 = convBnLeaky(network, weightMap, *ew71->getOutput(0), 512, 1, 1, 0, 72);\n    auto lr73 = convBnLeaky(network, weightMap, *lr72->getOutput(0), 1024, 3, 1, 1, 73);\n    auto ew74 = network->addElementWise(*lr73->getOutput(0), *ew71->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr75 = convBnLeaky(network, weightMap, *ew74->getOutput(0), 512, 1, 1, 0, 75);\n    auto lr76 = convBnLeaky(network, weightMap, *lr75->getOutput(0), 1024, 3, 1, 1, 76);\n    auto lr77 = convBnLeaky(network, weightMap, *lr76->getOutput(0), 512, 1, 1, 0, 77);\n    auto lr78 = convBnLeaky(network, weightMap, *lr77->getOutput(0), 1024, 3, 1, 1, 78);\n    auto lr79 = convBnLeaky(network, weightMap, *lr78->getOutput(0), 512, 1, 1, 0, 79);\n    auto lr80 = convBnLeaky(network, weightMap, *lr79->getOutput(0), 1024, 3, 1, 1, 80);\n    IConvolutionLayer* conv81 = network->addConvolutionNd(*lr80->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.81.Conv2d.weight\"], weightMap[\"module_list.81.Conv2d.bias\"]);\n    assert(conv81);\n    // 82 is yolo\n    auto l83 = lr79;\n    auto lr84 = convBnLeaky(network, weightMap, *l83->getOutput(0), 256, 1, 1, 0, 84);\n\n    float *deval = reinterpret_cast<float*>(malloc(sizeof(float) * 256 * 2 * 2));\n    for (int i = 0; i < 256 * 2 * 2; i++) {\n        deval[i] = 1.0;\n    }\n    Weights deconvwts85{DataType::kFLOAT, deval, 256 * 2 * 2};\n    IDeconvolutionLayer* deconv85 = network->addDeconvolutionNd(*lr84->getOutput(0), 256, DimsHW{2, 2}, deconvwts85, emptywts);\n    assert(deconv85);\n    deconv85->setStrideNd(DimsHW{2, 2});\n    deconv85->setNbGroups(256);\n    weightMap[\"deconv85\"] = deconvwts85;\n\n    ITensor* inputTensors[] = {deconv85->getOutput(0), ew61->getOutput(0)};\n    auto cat86 = network->addConcatenation(inputTensors, 2);\n    auto lr87 = convBnLeaky(network, weightMap, *cat86->getOutput(0), 256, 1, 1, 0, 87);\n    auto lr88 = convBnLeaky(network, weightMap, *lr87->getOutput(0), 512, 3, 1, 1, 88);\n    auto lr89 = convBnLeaky(network, weightMap, *lr88->getOutput(0), 256, 1, 1, 0, 89);\n    auto lr90 = convBnLeaky(network, weightMap, *lr89->getOutput(0), 512, 3, 1, 1, 90);\n    auto lr91 = convBnLeaky(network, weightMap, *lr90->getOutput(0), 256, 1, 1, 0, 91);\n    auto lr92 = convBnLeaky(network, weightMap, *lr91->getOutput(0), 512, 3, 1, 1, 92);\n    IConvolutionLayer* conv93 = network->addConvolutionNd(*lr92->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.93.Conv2d.weight\"], weightMap[\"module_list.93.Conv2d.bias\"]);\n    assert(conv93);\n    // 94 is yolo\n    auto l95 = lr91;\n    auto lr96 = convBnLeaky(network, weightMap, *l95->getOutput(0), 128, 1, 1, 0, 96);\n    Weights deconvwts97{DataType::kFLOAT, deval, 128 * 2 * 2};\n    IDeconvolutionLayer* deconv97 = network->addDeconvolutionNd(*lr96->getOutput(0), 128, DimsHW{2, 2}, deconvwts97, emptywts);\n    assert(deconv97);\n    deconv97->setStrideNd(DimsHW{2, 2});\n    deconv97->setNbGroups(128);\n    ITensor* inputTensors1[] = {deconv97->getOutput(0), ew36->getOutput(0)};\n    auto cat98 = network->addConcatenation(inputTensors1, 2);\n    auto lr99 = convBnLeaky(network, weightMap, *cat98->getOutput(0), 128, 1, 1, 0, 99);\n    auto lr100 = convBnLeaky(network, weightMap, *lr99->getOutput(0), 256, 3, 1, 1, 100);\n    auto lr101 = convBnLeaky(network, weightMap, *lr100->getOutput(0), 128, 1, 1, 0, 101);\n    auto lr102 = convBnLeaky(network, weightMap, *lr101->getOutput(0), 256, 3, 1, 1, 102);\n    auto lr103 = convBnLeaky(network, weightMap, *lr102->getOutput(0), 128, 1, 1, 0, 103);\n    auto lr104 = convBnLeaky(network, weightMap, *lr103->getOutput(0), 256, 3, 1, 1, 104);\n    IConvolutionLayer* conv105 = network->addConvolutionNd(*lr104->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.105.Conv2d.weight\"], weightMap[\"module_list.105.Conv2d.bias\"]);\n    assert(conv105);\n\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    const PluginFieldCollection* pluginData = creator->getFieldNames();\n    IPluginV2 *pluginObj = creator->createPlugin(\"yololayer\", pluginData);\n    ITensor* inputTensors_yolo[] = {conv81->getOutput(0), conv93->getOutput(0), conv105->getOutput(0)};\n    auto yolo = network->addPluginV2(inputTensors_yolo, 3, *pluginObj);\n\n    yolo->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*yolo->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#if defined(USE_FP16)\n    config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(BuilderFlag::kINT8);\n    Int8EntropyCalibrator2 *calibrator = new Int8EntropyCalibrator2(1, INPUT_W, INPUT_H, \"./coco_calib/\", \"int8calib.table\", INPUT_BLOB_NAME);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CUDA_CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CUDA_CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(buffers[inputIndex]));\n    CUDA_CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (argc == 2 && std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n        std::ofstream p(\"yolov3.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    } else if (argc == 3 && std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"yolov3.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./yolov3 -s  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./yolov3 -d ../samples  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(argv[2], file_names) < 0) {\n        std::cout << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    // prepare input data ---------------------------\n    static float data[3 * INPUT_H * INPUT_W];\n    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n    //    data[i] = 1.0;\n    static float prob[OUTPUT_SIZE];\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    int fcount = 0;\n    for (auto f: file_names) {\n        fcount++;\n        std::cout << fcount << \"  \" << f << std::endl;\n        cv::Mat img = cv::imread(std::string(argv[2]) + \"/\" + f);\n        if (img.empty()) continue;\n        cv::Mat pr_img = preprocess_img(img, INPUT_W, INPUT_H);\n        for (int i = 0; i < INPUT_H * INPUT_W; i++) {\n            data[i] = pr_img.at<cv::Vec3b>(i)[2] / 255.0;\n            data[i + INPUT_H * INPUT_W] = pr_img.at<cv::Vec3b>(i)[1] / 255.0;\n            data[i + 2 * INPUT_H * INPUT_W] = pr_img.at<cv::Vec3b>(i)[0] / 255.0;\n        }\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n        std::vector<Yolo::Detection> res;\n        nms(res, prob);\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect(img, res[j].bbox);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n        }\n        cv::imwrite(\"_\" + f, img);\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n    return 0;\n}\n"
  },
  {
    "path": "yolov3/yolov3_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLov5 project.\n    param: \n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n        line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n        )\n\n\nclass YoLov3TRT(object):\n    \"\"\"\n    description: A YOLOv5 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        engine = self.engine\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        #print(output.shape)\n\n        # Do postprocess\n        for i in range(self.batch_size):\n            result_boxes, result_scores, result_classid = self.post_process(\n                output[i * 7001: (i + 1) * 7001], batch_origin_h[i], batch_origin_w[i]\n            )\n            \n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                \n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        \n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n        \n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2\n            y /= r_h\n\n        return y\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...] \n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        # Get the num of boxes detected\n        num = int(output[0])\n        np.set_printoptions(suppress=True)\n        #print(\"num:\", num)\n        #np.set_printoptions(threshold=sys.maxsize)\n        #print(output[1:])\n        # Reshape to a two dimentional ndarray\n        pred = np.reshape(output[1:], (-1, 7))[:num, :]\n        if pred.shape[0] > 0:\n            #print(pred[0])\n            pred[:,4] *= pred[:,6]\n            pred = pred[:,:-1]\n            #print(pred[0])\n\n        # Do nms\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        return result_boxes, result_scores, result_classid\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))            \n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None) * \\\n                     np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None)\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\n        # clip the coordinates\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w -1)\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w -1)\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h -1)\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h -1)\n        # Object confidence\n        confs = boxes[:, 4]\n        # Sort by the confs\n        boxes = boxes[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_boxes = []\n        while boxes.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\n            label_match = boxes[0, -1] == boxes[:, -1]\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\n            invalid = large_overlap & label_match\n            keep_boxes += [boxes[0]]\n            boxes = boxes[~invalid]\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\n        return boxes\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov3_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov3_wrapper = yolov3_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov3_wrapper.infer(self.yolov3_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov3_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov3_wrapper = yolov3_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov3_wrapper.infer(self.yolov3_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"build/libyololayer.so\"\n    engine_file_path = \"build/yolov3.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load coco labels\n\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n            \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n            \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n            \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n            \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n            \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n            \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n            \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n            \"hair drier\", \"toothbrush\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov5TRT instance\n    yolov3_wrapper = YoLov3TRT(engine_file_path)\n    try:\n        print('batch size is', yolov3_wrapper.batch_size)\n        \n        image_dir = \"samples/\"\n        image_path_batches = get_img_path_batches(yolov3_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov3_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov3_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov3_wrapper.destroy()\n"
  },
  {
    "path": "yolov3-spp/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(yolov3-spp)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\ncuda_add_library(yololayer SHARED ${PROJECT_SOURCE_DIR}/yololayer.cu)\ntarget_link_libraries(yololayer nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(yolov3-spp ${PROJECT_SOURCE_DIR}/yolov3-spp.cpp)\ntarget_link_libraries(yolov3-spp nvinfer)\ntarget_link_libraries(yolov3-spp cudart)\ntarget_link_libraries(yolov3-spp yololayer)\ntarget_link_libraries(yolov3-spp ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "yolov3-spp/README.md",
    "content": "# yolov3-spp\n\nCurrently this is supporting dynamic input shape, if you want to use non-dynamic version, please checkout commit [659fd2b](https://github.com/wang-xinyu/tensorrtx/commit/659fd2b23482197b19dccf746a5a3dbff1611381).\n\nThe Pytorch implementation is [ultralytics/yolov3 archive branch](https://github.com/ultralytics/yolov3/tree/archive). It provides two trained weights of yolov3-spp, `yolov3-spp.pt` and `yolov3-spp-ultralytics.pt`(originally named `ultralytics68.pt`).\n\n## Config\n\n- Number of classes defined in yololayer.h\n- FP16/FP32 can be selected by the macro in yolov3-spp.cpp\n- GPU id can be selected by the macro in yolov3-spp.cpp\n- NMS thresh in yolov3-spp.cpp\n- BBox confidence thresh in yolov3-spp.cpp\n- MIN and MAX input size defined in yolov3-spp.cpp\n- Optimization width and height for IOptimizationProfile defined in yolov3-spp.cpp\n\n## How to Run\n\n1. generate yolov3-spp_ultralytics68.wts from pytorch implementation with yolov3-spp.cfg and yolov3-spp-ultralytics.pt, or download .wts from model zoo\n\n```\ngit clone https://github.com/wang-xinyu/tensorrtx.git\ngit clone -b archive https://github.com/ultralytics/yolov3.git\n// download its weights 'yolov3-spp-ultralytics.pt'\n// copy gen_wts.py from tensorrtx/yolov3-spp/ to ultralytics/yolov3/\n// go to ultralytics/yolov3/\npython gen_wts.py yolov3-spp-ultralytics.pt\n// a file 'yolov3-spp_ultralytics68.wts' will be generated.\n// the master branch of yolov3 should work, if not, you can checkout 4ac60018f6e6c1e24b496485f126a660d9c793d8\n```\n\n2. build tensorrtx/yolov3-spp and run\n\n```\n// put yolov3-spp_ultralytics68.wts into tensorrtx/yolov3-spp/\n// go to tensorrtx/yolov3-spp/\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./yolov3-spp -s             // serialize model to plan file i.e. 'yolov3-spp.engine'\nsudo ./yolov3-spp -d  ../samples // deserialize plan file and run inference, the images in samples will be processed.\n```\n\n3. check the images generated, as follows. _zidane.jpg and _bus.jpg\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/78247927-4d9fac00-751e-11ea-8b1b-704a0aeb3fcf.jpg\">\n</p>\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/78247970-60b27c00-751e-11ea-88df-41473fed4823.jpg\">\n</p>\n\n## More Information\n\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n\n"
  },
  {
    "path": "yolov3-spp/Utils.h",
    "content": "#ifndef __TRT_UTILS_H_\n#define __TRT_UTILS_H_\n\n#include <iostream>\n#include <vector>\n#include <algorithm>\n#include <cudnn.h>\n\n#ifndef CUDA_CHECK\n\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n\n#endif\n\nnamespace Tn\n{\n    class Profiler : public nvinfer1::IProfiler\n    {\n    public:\n        void printLayerTimes(int itrationsTimes)\n        {\n            float totalTime = 0;\n            for (size_t i = 0; i < mProfile.size(); i++)\n            {\n                printf(\"%-40.40s %4.3fms\\n\", mProfile[i].first.c_str(), mProfile[i].second / itrationsTimes);\n                totalTime += mProfile[i].second;\n            }\n            printf(\"Time over all layers: %4.3f\\n\", totalTime / itrationsTimes);\n        }\n    private:\n        typedef std::pair<std::string, float> Record;\n        std::vector<Record> mProfile;\n\n        virtual void reportLayerTime(const char* layerName, float ms)\n        {\n            auto record = std::find_if(mProfile.begin(), mProfile.end(), [&](const Record& r){ return r.first == layerName; });\n            if (record == mProfile.end())\n                mProfile.push_back(std::make_pair(layerName, ms));\n            else\n                record->second += ms;\n        }\n    };\n\n    //Logger for TensorRT info/warning/errors\n    class Logger : public nvinfer1::ILogger\n    {\n    public:\n\n        Logger(): Logger(Severity::kWARNING) {}\n\n        Logger(Severity severity): reportableSeverity(severity) {}\n\n        void log(Severity severity, const char* msg) override\n        {\n            // suppress messages with severity enum value greater than the reportable\n            if (severity > reportableSeverity) return;\n\n            switch (severity)\n            {\n                case Severity::kINTERNAL_ERROR: std::cerr << \"INTERNAL_ERROR: \"; break;\n                case Severity::kERROR: std::cerr << \"ERROR: \"; break;\n                case Severity::kWARNING: std::cerr << \"WARNING: \"; break;\n                case Severity::kINFO: std::cerr << \"INFO: \"; break;\n                default: std::cerr << \"UNKNOWN: \"; break;\n            }\n            std::cerr << msg << std::endl;\n        }\n\n        Severity reportableSeverity{Severity::kWARNING};\n    };\n\n    template<typename T> \n    void write(char*& buffer, const T& val)\n    {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T> \n    void read(const char*& buffer, T& val)\n    {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n}\n\n#endif"
  },
  {
    "path": "yolov3-spp/gen_wts.py",
    "content": "import struct\nimport sys\nimport torch\nfrom models import *  # noqa: F403\nfrom utils.utils import *  # noqa: F403\n\nmodel = Darknet('cfg/yolov3-spp.cfg', (416, 416))  # noqa: F405\nweights = sys.argv[1]\ndev = '0'\ndevice = torch_utils.select_device(dev)  # noqa: F405\nmodel.load_state_dict(torch.load(weights, map_location=device, weights_only=False)['model'])\n\n\nwith open('yolov3-spp_ultralytics68.wts', 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\n"
  },
  {
    "path": "yolov3-spp/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "yolov3-spp/yololayer.cu",
    "content": "#include \"yololayer.h\"\n\nusing namespace Yolo;\n\nnamespace nvinfer1\n{\n    YoloLayerPlugin::YoloLayerPlugin()\n    {\n        mClassCount = CLASS_NUM;\n        mYoloKernel.clear();\n        mYoloKernel.push_back(yolo1);\n        mYoloKernel.push_back(yolo2);\n        mYoloKernel.push_back(yolo3);\n        mKernelCount = mYoloKernel.size();\n\n        CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n        size_t anchorLen = sizeof(float) * CHECK_COUNT * 2;\n        for (int i = 0; i < mKernelCount; i++)\n        {\n            CUDA_CHECK(cudaMalloc(&mAnchor[i], anchorLen));\n            const auto& yolo = mYoloKernel[i];\n            CUDA_CHECK(cudaMemcpy(mAnchor[i], yolo.anchors, anchorLen, cudaMemcpyHostToDevice));\n        }\n    }\n\n    YoloLayerPlugin::~YoloLayerPlugin()\n    {\n        for (int i = 0; i < mKernelCount; i++)\n        {\n            CUDA_CHECK(cudaFree(mAnchor[i]));\n        }\n        CUDA_CHECK(cudaFreeHost(mAnchor));\n    }\n\n    // create the plugin at runtime from a byte stream\n    YoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length)\n    {\n        using namespace Tn;\n        const char *d = reinterpret_cast<const char *>(data), *a = d;\n        read(d, mClassCount);\n        read(d, mThreadCount);\n        read(d, mKernelCount);\n        mYoloKernel.resize(mKernelCount);\n        auto kernelSize = mKernelCount * sizeof(YoloKernel);\n        memcpy(mYoloKernel.data(), d, kernelSize);\n        d += kernelSize;\n        assert(d == a + length);\n\n        CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n        size_t anchorLen = sizeof(float) * CHECK_COUNT * 2;\n        for (int i = 0; i < mKernelCount; i++)\n        {\n            CUDA_CHECK(cudaMalloc(&mAnchor[i], anchorLen));\n            const auto& yolo = mYoloKernel[i];\n            CUDA_CHECK(cudaMemcpy(mAnchor[i], yolo.anchors, anchorLen, cudaMemcpyHostToDevice));\n        }\n    }\n\n    void YoloLayerPlugin::serialize(void* buffer) const\n    {\n        using namespace Tn;\n        char* d = static_cast<char*>(buffer), *a = d;\n        write(d, mClassCount);\n        write(d, mThreadCount);\n        write(d, mKernelCount);\n        auto kernelSize = mKernelCount * sizeof(YoloKernel);\n        memcpy(d,mYoloKernel.data(), kernelSize);\n        d += kernelSize;\n\n        assert(d == a + getSerializationSize());\n    }\n\n    size_t YoloLayerPlugin::getSerializationSize() const\n    {\n        return sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mKernelCount)  + sizeof(Yolo::YoloKernel) * mYoloKernel.size();\n    }\n\n    int YoloLayerPlugin::initialize()\n    {\n        return 0;\n    }\n\n    DimsExprs YoloLayerPlugin::getOutputDimensions(int outputIndex, const DimsExprs* inputs, int nbInputs, IExprBuilder& exprBuilder)\n    {\n        //output the result to channel\n        int totalsize = MAX_OUTPUT_BBOX_COUNT * sizeof(Detection) / sizeof(float);\n        DimsExprs de;\n        de.nbDims = 2;\n        de.d[0] = exprBuilder.constant(inputs[0].d[0]->getConstantValue());  // batchsize\n        de.d[1] = exprBuilder.constant(totalsize + 1);  // outputsize\n        return de;\n    }\n\n    // Set plugin namespace\n    void YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace)\n    {\n        mPluginNamespace = pluginNamespace;\n    }\n\n    const char* YoloLayerPlugin::getPluginNamespace() const\n    {\n        return mPluginNamespace;\n    }\n\n    // Return the DataType of the plugin output at the requested index\n    DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const\n    {\n        return DataType::kFLOAT;\n    }\n\n    void YoloLayerPlugin::configurePlugin(const DynamicPluginTensorDesc* in, int nbInputs, const DynamicPluginTensorDesc* out, int nbOutputs)\n    {\n    }\n\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\n    void YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator)\n    {\n    }\n\n    // Detach the plugin object from its execution context.\n    void YoloLayerPlugin::detachFromContext() {}\n\n    const char* YoloLayerPlugin::getPluginType() const\n    {\n        return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloLayerPlugin::getPluginVersion() const\n    {\n        return \"1\";\n    }\n\n    void YoloLayerPlugin::destroy()\n    {\n        delete this;\n    }\n\n    // Clone the plugin\n    IPluginV2DynamicExt* YoloLayerPlugin::clone() const\n    {\n        YoloLayerPlugin *p = new YoloLayerPlugin();\n        p->setPluginNamespace(mPluginNamespace);\n        return p;\n    }\n\n    __device__ float Logist(float data){ return 1.0f / (1.0f + expf(-data)); };\n\n    __global__ void CalDetection(const float *input, float *output, int noElements,\n            int yoloWidth, int yoloHeight, int yoloStride, const float anchors[CHECK_COUNT * 2], int classes, int outputElem) {\n\n        int idx = threadIdx.x + blockDim.x * blockIdx.x;\n        if (idx >= noElements) return;\n\n        int total_grid = yoloWidth * yoloHeight;\n        int bnIdx = idx / total_grid;\n        idx = idx - total_grid*bnIdx;\n        int info_len_i = 5 + classes;\n        const float* curInput = input + bnIdx * (info_len_i * total_grid * CHECK_COUNT);\n\n        for (int k = 0; k < 3; ++k) {\n            int class_id = 0;\n            float max_cls_prob = 0.0;\n            for (int i = 5; i < info_len_i; ++i) {\n                float p = Logist(curInput[idx + k * info_len_i * total_grid + i * total_grid]);\n                if (p > max_cls_prob) {\n                    max_cls_prob = p;\n                    class_id = i - 5;\n                }\n            }\n            float box_prob = Logist(curInput[idx + k * info_len_i * total_grid + 4 * total_grid]);\n            if (max_cls_prob < IGNORE_THRESH || box_prob < IGNORE_THRESH) continue;\n\n            float *res_count = output + bnIdx * outputElem;\n            int count = (int)atomicAdd(res_count, 1);\n            if (count >= MAX_OUTPUT_BBOX_COUNT) return;\n            char* data = (char*)res_count + sizeof(float) + count * sizeof(Detection);\n            Detection* det = (Detection*)(data);\n\n            int row = idx / yoloWidth;\n            int col = idx % yoloWidth;\n\n            //Location\n            det->bbox[0] = (col + Logist(curInput[idx + k * info_len_i * total_grid + 0 * total_grid])) * yoloStride;\n            det->bbox[1] = (row + Logist(curInput[idx + k * info_len_i * total_grid + 1 * total_grid])) * yoloStride;\n            det->bbox[2] = expf(curInput[idx + k * info_len_i * total_grid + 2 * total_grid]) * anchors[2 * k];\n            det->bbox[3] = expf(curInput[idx + k * info_len_i * total_grid + 3 * total_grid]) * anchors[2 * k + 1];\n            det->det_confidence = box_prob;\n            det->class_id = class_id;\n            det->class_confidence = max_cls_prob;\n        }\n    }\n\n    void YoloLayerPlugin::forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize) {\n        int outputElem = 1 + MAX_OUTPUT_BBOX_COUNT * sizeof(Detection) / sizeof(float);\n        for(int idx = 0 ; idx < batchSize; ++idx) {\n            CUDA_CHECK(cudaMemset(output + idx * outputElem, 0, sizeof(float)));\n        }\n        int numElem = 0;\n        for (size_t i = 0; i < mYoloKernel.size(); ++i) {\n            const auto& yolo = mYoloKernel[i];\n            numElem = yolo.width * yolo.height * batchSize;\n            CalDetection<<<(yolo.width * yolo.height * batchSize + mThreadCount - 1) / mThreadCount, mThreadCount>>>\n                (inputs[i], output, numElem, yolo.width, yolo.height, yolo.stride, (float*)mAnchor[i], mClassCount, outputElem);\n        }\n    }\n\n    int YoloLayerPlugin::enqueue(const PluginTensorDesc* inputDesc, const PluginTensorDesc* outputDesc, const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream)\n    {\n        int batchSize = inputDesc[0].dims.d[0];\n        for (size_t i = 0; i < mYoloKernel.size(); ++i) {\n            mYoloKernel[i].width = inputDesc[i].dims.d[3];\n            mYoloKernel[i].height = inputDesc[i].dims.d[2];\n        }\n        forwardGpu((const float *const *)inputs, (float*)outputs[0], stream, batchSize);\n        return 0;\n    }\n\n    PluginFieldCollection YoloPluginCreator::mFC{};\n    std::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\n    YoloPluginCreator::YoloPluginCreator()\n    {\n        mPluginAttributes.clear();\n\n        mFC.nbFields = mPluginAttributes.size();\n        mFC.fields = mPluginAttributes.data();\n    }\n\n    const char* YoloPluginCreator::getPluginName() const\n    {\n            return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloPluginCreator::getPluginVersion() const\n    {\n            return \"1\";\n    }\n\n    const PluginFieldCollection* YoloPluginCreator::getFieldNames()\n    {\n            return &mFC;\n    }\n\n    IPluginV2DynamicExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc)\n    {\n        YoloLayerPlugin* obj = new YoloLayerPlugin();\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n    IPluginV2DynamicExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength)\n    {\n        // This object will be deleted when the network is destroyed, which will\n        // call YoloLayerPlugin::destroy()\n        YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n}\n"
  },
  {
    "path": "yolov3-spp/yololayer.h",
    "content": "#ifndef _YOLO_LAYER_H\n#define _YOLO_LAYER_H\n\n#include <assert.h>\n#include <cmath>\n#include <string.h>\n#include <cublas_v2.h>\n#include \"NvInfer.h\"\n#include \"Utils.h\"\n#include <iostream>\n\nnamespace Yolo\n{\n    static constexpr int CHECK_COUNT = 3;\n    static constexpr float IGNORE_THRESH = 0.1f;\n    static constexpr int MAX_OUTPUT_BBOX_COUNT = 1000;\n    static constexpr int CLASS_NUM = 80;\n\n    struct YoloKernel\n    {\n        int width;\n        int height;\n        int stride;\n        float anchors[CHECK_COUNT*2];\n    };\n\n    static constexpr YoloKernel yolo1 = {\n        -1,  // dynamic width and height\n        -1,\n        32,\n        {116,90,  156,198,  373,326}\n    };\n    static constexpr YoloKernel yolo2 = {\n        -1,\n        -1,\n        16,\n        {30,61,  62,45,  59,119}\n    };\n    static constexpr YoloKernel yolo3 = {\n        -1,\n        -1,\n        8,\n        {10,13,  16,30,  33,23}\n    };\n\n    static constexpr int LOCATIONS = 4;\n    struct alignas(float) Detection{\n        //x y w h\n        float bbox[LOCATIONS];\n        float det_confidence;\n        float class_id;\n        float class_confidence;\n    };\n}\n\nnamespace nvinfer1\n{\n    class YoloLayerPlugin: public IPluginV2DynamicExt\n    {\n        public:\n            explicit YoloLayerPlugin();\n            YoloLayerPlugin(const void* data, size_t length);\n\n            ~YoloLayerPlugin();\n\n            int getNbOutputs() const override\n            {\n                return 1;\n            }\n\n            //virtual Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) final;\n            virtual DimsExprs getOutputDimensions(int outputIndex, const DimsExprs* inputs, int nbInputs, IExprBuilder& exprBuilder) override;\n\n            int initialize() override;\n\n            virtual void terminate() override {};\n\n            //virtual size_t getWorkspaceSize(int maxBatchSize) const override { return 0;}\n            size_t getWorkspaceSize(const PluginTensorDesc* inputs, int nbInputs, const PluginTensorDesc* outputs, int nbOutputs) const override { return 0; }\n\n            //virtual int enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream) override;\n            int enqueue(const PluginTensorDesc* inputDesc, const PluginTensorDesc* outputDesc, const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) override;\n\n            virtual size_t getSerializationSize() const override;\n\n            virtual void serialize(void* buffer) const override;\n\n            bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) override {\n                return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n            }\n\n            const char* getPluginType() const override;\n\n            const char* getPluginVersion() const override;\n\n            void destroy() override;\n\n            IPluginV2DynamicExt* clone() const override;\n\n            void setPluginNamespace(const char* pluginNamespace) override;\n\n            const char* getPluginNamespace() const override;\n\n            DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const override;\n\n            void attachToContext(\n                    cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) override;\n\n            void configurePlugin(const DynamicPluginTensorDesc* in, int nbInputs, const DynamicPluginTensorDesc* out, int nbOutputs) override;\n\n            void detachFromContext() override;\n\n        private:\n            void forwardGpu(const float *const * inputs,float * output, cudaStream_t stream,int batchSize = 1);\n            int mClassCount;\n            int mKernelCount;\n            std::vector<Yolo::YoloKernel> mYoloKernel;\n            int mThreadCount = 256;\n            void** mAnchor;\n            const char* mPluginNamespace;\n    };\n\n    class YoloPluginCreator : public IPluginCreator\n    {\n        public:\n            YoloPluginCreator();\n\n            ~YoloPluginCreator() override = default;\n\n            const char* getPluginName() const override;\n\n            const char* getPluginVersion() const override;\n\n            const PluginFieldCollection* getFieldNames() override;\n\n            IPluginV2DynamicExt* createPlugin(const char* name, const PluginFieldCollection* fc) override;\n\n            IPluginV2DynamicExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) override;\n\n            void setPluginNamespace(const char* libNamespace) override\n            {\n                mNamespace = libNamespace;\n            }\n\n            const char* getPluginNamespace() const override\n            {\n                return mNamespace.c_str();\n            }\n\n        private:\n            std::string mNamespace;\n            static PluginFieldCollection mFC;\n            static std::vector<PluginField> mPluginAttributes;\n    };\n    REGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n};\n\n#endif \n"
  },
  {
    "path": "yolov3-spp/yolov3-spp.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <opencv2/opencv.hpp>\n#include <opencv2/dnn/dnn.hpp>\n#include <dirent.h>\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include \"yololayer.h\"\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\n#define USE_FP16  // comment out this if want to use FP32\n#define DEVICE 0  // GPU id\n#define NMS_THRESH 0.4\n#define BBOX_CONF_THRESH 0.5\n\nusing namespace nvinfer1;\n\n// stuff we know about the network and the input/output blobs\nstatic const int MAX_INPUT_SIZE = 608;\nstatic const int MIN_INPUT_SIZE = 128;\nstatic const int OPT_INPUT_W = 608;\nstatic const int OPT_INPUT_H = 608;\nstatic const int DET_LEN = sizeof(Yolo::Detection) / sizeof(float);\nstatic const int OUTPUT_SIZE = Yolo::MAX_OUTPUT_BBOX_COUNT * DET_LEN + 1;  // we limit the yololayer to output no more than MAX_OUTPUT_BBOX_COUNT bboxes\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\nstatic Logger gLogger;\n\ncv::Mat letterbox(cv::Mat& img) {\n    float r = std::min(MAX_INPUT_SIZE / (img.cols*1.0), MAX_INPUT_SIZE / (img.rows*1.0));\n    r = std::min(r, 1.0f);\n    int unpad_w = r * img.cols;\n    int unpad_h = r * img.rows;\n    int dw = (MAX_INPUT_SIZE - unpad_w) % 32;\n    int dh = (MAX_INPUT_SIZE - unpad_h) % 32;\n    cv::Mat re(unpad_h, unpad_w, CV_8UC3);\n    cv::resize(img, re, re.size());\n    cv::Mat out(unpad_h + dh, unpad_w + dw, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(dw / 2, dh / 2, re.cols, re.rows)));\n    return out;\n}\n\ncv::Rect get_rect(cv::Size src_shape, cv::Size pre_shape, float bbox[4]) {\n    float ra = std::min(MAX_INPUT_SIZE / (src_shape.width * 1.0), MAX_INPUT_SIZE / (src_shape.height * 1.0));\n    ra = std::min(ra, 1.0f);\n    int unpad_w = ra * src_shape.width;\n    int unpad_h = ra * src_shape.height;\n    int dw = (MAX_INPUT_SIZE - unpad_w) % 32;\n    int dh = (MAX_INPUT_SIZE - unpad_h) % 32;\n\n    int l = bbox[0] - bbox[2]/2.f - dw / 2;\n    int r = bbox[0] + bbox[2]/2.f - dw / 2;\n    int t = bbox[1] - bbox[3]/2.f - dh / 2;\n    int b = bbox[1] + bbox[3]/2.f - dh / 2;\n    l /= ra;\n    r /= ra;\n    t /= ra;\n    b /= ra;\n    return cv::Rect(l, t, r-l, b-t);\n}\n\nfloat iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n        std::max(lbox[0] - lbox[2]/2.f , rbox[0] - rbox[2]/2.f), //left\n        std::min(lbox[0] + lbox[2]/2.f , rbox[0] + rbox[2]/2.f), //right\n        std::max(lbox[1] - lbox[3]/2.f , rbox[1] - rbox[3]/2.f), //top\n        std::min(lbox[1] + lbox[3]/2.f , rbox[1] + rbox[3]/2.f), //bottom\n    };\n\n    if(interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS =(interBox[1]-interBox[0])*(interBox[3]-interBox[2]);\n    return interBoxS/(lbox[2]*lbox[3] + rbox[2]*rbox[3] -interBoxS);\n}\n\nbool cmp(const Yolo::Detection& a, const Yolo::Detection& b) {\n    return a.det_confidence > b.det_confidence;\n}\n\nvoid nms(std::vector<Yolo::Detection>& res, float *output, float nms_thresh = NMS_THRESH) {\n    std::map<float, std::vector<Yolo::Detection>> m;\n    for (int i = 0; i < output[0] && i < Yolo::MAX_OUTPUT_BBOX_COUNT; i++) {\n        if (output[1 + DET_LEN * i + 4] <= BBOX_CONF_THRESH) continue;\n        Yolo::Detection det;\n        memcpy(&det, &output[1 + DET_LEN * i], DET_LEN * sizeof(float));\n        if (m.count(det.class_id) == 0) m.emplace(det.class_id, std::vector<Yolo::Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        //std::cout << it->second[0].class_id << \" --- \" << std::endl;\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                    dets.erase(dets.begin()+n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* convBnLeaky(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input,  int outch, int ksize, int s, int p, int linx) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[\"module_list.\" + std::to_string(linx) + \".Conv2d.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"module_list.\" + std::to_string(linx) + \".BatchNorm2d\", 1e-5);\n\n    auto lr = network->addActivation(*bn1->getOutput(0), ActivationType::kLEAKY_RELU);\n    lr->setAlpha(0.1);\n\n    return lr;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);\n    auto network = builder->createNetworkV2(explicitBatch);\n\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims4{1, 3, -1, -1});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../yolov3-spp_ultralytics68.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    // Yeah I am stupid, I just want to expand the complete arch of darknet..\n    auto lr0 = convBnLeaky(network, weightMap, *data, 32, 3, 1, 1, 0);\n    auto lr1 = convBnLeaky(network, weightMap, *lr0->getOutput(0), 64, 3, 2, 1, 1);\n    auto lr2 = convBnLeaky(network, weightMap, *lr1->getOutput(0), 32, 1, 1, 0, 2);\n    auto lr3 = convBnLeaky(network, weightMap, *lr2->getOutput(0), 64, 3, 1, 1, 3);\n    auto ew4 = network->addElementWise(*lr3->getOutput(0), *lr1->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr5 = convBnLeaky(network, weightMap, *ew4->getOutput(0), 128, 3, 2, 1, 5);\n    auto lr6 = convBnLeaky(network, weightMap, *lr5->getOutput(0), 64, 1, 1, 0, 6);\n    auto lr7 = convBnLeaky(network, weightMap, *lr6->getOutput(0), 128, 3, 1, 1, 7);\n    auto ew8 = network->addElementWise(*lr7->getOutput(0), *lr5->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr9 = convBnLeaky(network, weightMap, *ew8->getOutput(0), 64, 1, 1, 0, 9);\n    auto lr10 = convBnLeaky(network, weightMap, *lr9->getOutput(0), 128, 3, 1, 1, 10);\n    auto ew11 = network->addElementWise(*lr10->getOutput(0), *ew8->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr12 = convBnLeaky(network, weightMap, *ew11->getOutput(0), 256, 3, 2, 1, 12);\n    auto lr13 = convBnLeaky(network, weightMap, *lr12->getOutput(0), 128, 1, 1, 0, 13);\n    auto lr14 = convBnLeaky(network, weightMap, *lr13->getOutput(0), 256, 3, 1, 1, 14);\n    auto ew15 = network->addElementWise(*lr14->getOutput(0), *lr12->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr16 = convBnLeaky(network, weightMap, *ew15->getOutput(0), 128, 1, 1, 0, 16);\n    auto lr17 = convBnLeaky(network, weightMap, *lr16->getOutput(0), 256, 3, 1, 1, 17);\n    auto ew18 = network->addElementWise(*lr17->getOutput(0), *ew15->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr19 = convBnLeaky(network, weightMap, *ew18->getOutput(0), 128, 1, 1, 0, 19);\n    auto lr20 = convBnLeaky(network, weightMap, *lr19->getOutput(0), 256, 3, 1, 1, 20);\n    auto ew21 = network->addElementWise(*lr20->getOutput(0), *ew18->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr22 = convBnLeaky(network, weightMap, *ew21->getOutput(0), 128, 1, 1, 0, 22);\n    auto lr23 = convBnLeaky(network, weightMap, *lr22->getOutput(0), 256, 3, 1, 1, 23);\n    auto ew24 = network->addElementWise(*lr23->getOutput(0), *ew21->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr25 = convBnLeaky(network, weightMap, *ew24->getOutput(0), 128, 1, 1, 0, 25);\n    auto lr26 = convBnLeaky(network, weightMap, *lr25->getOutput(0), 256, 3, 1, 1, 26);\n    auto ew27 = network->addElementWise(*lr26->getOutput(0), *ew24->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr28 = convBnLeaky(network, weightMap, *ew27->getOutput(0), 128, 1, 1, 0, 28);\n    auto lr29 = convBnLeaky(network, weightMap, *lr28->getOutput(0), 256, 3, 1, 1, 29);\n    auto ew30 = network->addElementWise(*lr29->getOutput(0), *ew27->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr31 = convBnLeaky(network, weightMap, *ew30->getOutput(0), 128, 1, 1, 0, 31);\n    auto lr32 = convBnLeaky(network, weightMap, *lr31->getOutput(0), 256, 3, 1, 1, 32);\n    auto ew33 = network->addElementWise(*lr32->getOutput(0), *ew30->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr34 = convBnLeaky(network, weightMap, *ew33->getOutput(0), 128, 1, 1, 0, 34);\n    auto lr35 = convBnLeaky(network, weightMap, *lr34->getOutput(0), 256, 3, 1, 1, 35);\n    auto ew36 = network->addElementWise(*lr35->getOutput(0), *ew33->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr37 = convBnLeaky(network, weightMap, *ew36->getOutput(0), 512, 3, 2, 1, 37);\n    auto lr38 = convBnLeaky(network, weightMap, *lr37->getOutput(0), 256, 1, 1, 0, 38);\n    auto lr39 = convBnLeaky(network, weightMap, *lr38->getOutput(0), 512, 3, 1, 1, 39);\n    auto ew40 = network->addElementWise(*lr39->getOutput(0), *lr37->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr41 = convBnLeaky(network, weightMap, *ew40->getOutput(0), 256, 1, 1, 0, 41);\n    auto lr42 = convBnLeaky(network, weightMap, *lr41->getOutput(0), 512, 3, 1, 1, 42);\n    auto ew43 = network->addElementWise(*lr42->getOutput(0), *ew40->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr44 = convBnLeaky(network, weightMap, *ew43->getOutput(0), 256, 1, 1, 0, 44);\n    auto lr45 = convBnLeaky(network, weightMap, *lr44->getOutput(0), 512, 3, 1, 1, 45);\n    auto ew46 = network->addElementWise(*lr45->getOutput(0), *ew43->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr47 = convBnLeaky(network, weightMap, *ew46->getOutput(0), 256, 1, 1, 0, 47);\n    auto lr48 = convBnLeaky(network, weightMap, *lr47->getOutput(0), 512, 3, 1, 1, 48);\n    auto ew49 = network->addElementWise(*lr48->getOutput(0), *ew46->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr50 = convBnLeaky(network, weightMap, *ew49->getOutput(0), 256, 1, 1, 0, 50);\n    auto lr51 = convBnLeaky(network, weightMap, *lr50->getOutput(0), 512, 3, 1, 1, 51);\n    auto ew52 = network->addElementWise(*lr51->getOutput(0), *ew49->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr53 = convBnLeaky(network, weightMap, *ew52->getOutput(0), 256, 1, 1, 0, 53);\n    auto lr54 = convBnLeaky(network, weightMap, *lr53->getOutput(0), 512, 3, 1, 1, 54);\n    auto ew55 = network->addElementWise(*lr54->getOutput(0), *ew52->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr56 = convBnLeaky(network, weightMap, *ew55->getOutput(0), 256, 1, 1, 0, 56);\n    auto lr57 = convBnLeaky(network, weightMap, *lr56->getOutput(0), 512, 3, 1, 1, 57);\n    auto ew58 = network->addElementWise(*lr57->getOutput(0), *ew55->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr59 = convBnLeaky(network, weightMap, *ew58->getOutput(0), 256, 1, 1, 0, 59);\n    auto lr60 = convBnLeaky(network, weightMap, *lr59->getOutput(0), 512, 3, 1, 1, 60);\n    auto ew61 = network->addElementWise(*lr60->getOutput(0), *ew58->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr62 = convBnLeaky(network, weightMap, *ew61->getOutput(0), 1024, 3, 2, 1, 62);\n    auto lr63 = convBnLeaky(network, weightMap, *lr62->getOutput(0), 512, 1, 1, 0, 63);\n    auto lr64 = convBnLeaky(network, weightMap, *lr63->getOutput(0), 1024, 3, 1, 1, 64);\n    auto ew65 = network->addElementWise(*lr64->getOutput(0), *lr62->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr66 = convBnLeaky(network, weightMap, *ew65->getOutput(0), 512, 1, 1, 0, 66);\n    auto lr67 = convBnLeaky(network, weightMap, *lr66->getOutput(0), 1024, 3, 1, 1, 67);\n    auto ew68 = network->addElementWise(*lr67->getOutput(0), *ew65->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr69 = convBnLeaky(network, weightMap, *ew68->getOutput(0), 512, 1, 1, 0, 69);\n    auto lr70 = convBnLeaky(network, weightMap, *lr69->getOutput(0), 1024, 3, 1, 1, 70);\n    auto ew71 = network->addElementWise(*lr70->getOutput(0), *ew68->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr72 = convBnLeaky(network, weightMap, *ew71->getOutput(0), 512, 1, 1, 0, 72);\n    auto lr73 = convBnLeaky(network, weightMap, *lr72->getOutput(0), 1024, 3, 1, 1, 73);\n    auto ew74 = network->addElementWise(*lr73->getOutput(0), *ew71->getOutput(0), ElementWiseOperation::kSUM);\n    auto lr75 = convBnLeaky(network, weightMap, *ew74->getOutput(0), 512, 1, 1, 0, 75);\n    auto lr76 = convBnLeaky(network, weightMap, *lr75->getOutput(0), 1024, 3, 1, 1, 76);\n    auto lr77 = convBnLeaky(network, weightMap, *lr76->getOutput(0), 512, 1, 1, 0, 77);\n\n    auto pool78 = network->addPoolingNd(*lr77->getOutput(0), PoolingType::kMAX, DimsHW{5,5});\n    pool78->setPaddingNd(DimsHW{2, 2});\n    pool78->setStrideNd(DimsHW{1, 1});\n    auto pool80 = network->addPoolingNd(*lr77->getOutput(0), PoolingType::kMAX, DimsHW{9,9});\n    pool80->setPaddingNd(DimsHW{4, 4});\n    pool80->setStrideNd(DimsHW{1, 1});\n    auto pool82 = network->addPoolingNd(*lr77->getOutput(0), PoolingType::kMAX, DimsHW{13,13});\n    pool82->setPaddingNd(DimsHW{6, 6});\n    pool82->setStrideNd(DimsHW{1, 1});\n\n    ITensor* inputTensors83[] = {pool82->getOutput(0), pool80->getOutput(0), pool78->getOutput(0), lr77->getOutput(0)};\n    auto cat83 = network->addConcatenation(inputTensors83, 4);\n\n    auto lr84 = convBnLeaky(network, weightMap, *cat83->getOutput(0), 512, 1, 1, 0, 84);\n    auto lr85 = convBnLeaky(network, weightMap, *lr84->getOutput(0), 1024, 3, 1, 1, 85);\n    auto lr86 = convBnLeaky(network, weightMap, *lr85->getOutput(0), 512, 1, 1, 0, 86);\n    auto lr87 = convBnLeaky(network, weightMap, *lr86->getOutput(0), 1024, 3, 1, 1, 87);\n    IConvolutionLayer* conv88 = network->addConvolutionNd(*lr87->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.88.Conv2d.weight\"], weightMap[\"module_list.88.Conv2d.bias\"]);\n    assert(conv88);\n    auto lr91 = convBnLeaky(network, weightMap, *lr86->getOutput(0), 256, 1, 1, 0, 91);\n\n    float *deval = reinterpret_cast<float*>(malloc(sizeof(float) * 256 * 2 * 2));\n    for (int i = 0; i < 256 * 2 * 2; i++) {\n        deval[i] = 1.0;\n    }\n    Weights deconvwts92{DataType::kFLOAT, deval, 256 * 2 * 2};\n    IDeconvolutionLayer* deconv92 = network->addDeconvolutionNd(*lr91->getOutput(0), 256, DimsHW{2, 2}, deconvwts92, emptywts);\n    assert(deconv92);\n    deconv92->setStrideNd(DimsHW{2, 2});\n    deconv92->setNbGroups(256);\n    weightMap[\"deconv92\"] = deconvwts92;\n\n    ITensor* inputTensors[] = {deconv92->getOutput(0), ew61->getOutput(0)};\n    auto cat93 = network->addConcatenation(inputTensors, 2);\n    auto lr94 = convBnLeaky(network, weightMap, *cat93->getOutput(0), 256, 1, 1, 0, 94);\n    auto lr95 = convBnLeaky(network, weightMap, *lr94->getOutput(0), 512, 3, 1, 1, 95);\n    auto lr96 = convBnLeaky(network, weightMap, *lr95->getOutput(0), 256, 1, 1, 0, 96);\n    auto lr97 = convBnLeaky(network, weightMap, *lr96->getOutput(0), 512, 3, 1, 1, 97);\n    auto lr98 = convBnLeaky(network, weightMap, *lr97->getOutput(0), 256, 1, 1, 0, 98);\n    auto lr99 = convBnLeaky(network, weightMap, *lr98->getOutput(0), 512, 3, 1, 1, 99);\n    IConvolutionLayer* conv100 = network->addConvolutionNd(*lr99->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.100.Conv2d.weight\"], weightMap[\"module_list.100.Conv2d.bias\"]);\n    assert(conv100);\n    auto lr103 = convBnLeaky(network, weightMap, *lr98->getOutput(0), 128, 1, 1, 0, 103);\n    Weights deconvwts104{DataType::kFLOAT, deval, 128 * 2 * 2};\n    IDeconvolutionLayer* deconv104 = network->addDeconvolutionNd(*lr103->getOutput(0), 128, DimsHW{2, 2}, deconvwts104, emptywts);\n    assert(deconv104);\n    deconv104->setStrideNd(DimsHW{2, 2});\n    deconv104->setNbGroups(128);\n    ITensor* inputTensors1[] = {deconv104->getOutput(0), ew36->getOutput(0)};\n    auto cat105 = network->addConcatenation(inputTensors1, 2);\n    auto lr106 = convBnLeaky(network, weightMap, *cat105->getOutput(0), 128, 1, 1, 0, 106);\n    auto lr107 = convBnLeaky(network, weightMap, *lr106->getOutput(0), 256, 3, 1, 1, 107);\n    auto lr108 = convBnLeaky(network, weightMap, *lr107->getOutput(0), 128, 1, 1, 0, 108);\n    auto lr109 = convBnLeaky(network, weightMap, *lr108->getOutput(0), 256, 3, 1, 1, 109);\n    auto lr110 = convBnLeaky(network, weightMap, *lr109->getOutput(0), 128, 1, 1, 0, 110);\n    auto lr111 = convBnLeaky(network, weightMap, *lr110->getOutput(0), 256, 3, 1, 1, 111);\n    IConvolutionLayer* conv112 = network->addConvolutionNd(*lr111->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.112.Conv2d.weight\"], weightMap[\"module_list.112.Conv2d.bias\"]);\n    assert(conv112);\n\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    const PluginFieldCollection* pluginData = creator->getFieldNames();\n    IPluginV2 *pluginObj = creator->createPlugin(\"yololayer\", pluginData);\n    ITensor* inputTensors_yolo[] = {conv88->getOutput(0), conv100->getOutput(0), conv112->getOutput(0)};\n    auto yolo = network->addPluginV2(inputTensors_yolo, 3, *pluginObj);\n\n    auto dim = yolo->getOutput(0)->getDimensions();\n    std::cout << \"yololayer output shape: \";\n    for (int i = 0; i < dim.nbDims; i++) {\n        std::cout << dim.d[i] << \" \";\n    }\n    std::cout << std::endl;\n    yolo->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*yolo->getOutput(0));\n\n    IOptimizationProfile* profile = builder->createOptimizationProfile();\n    profile->setDimensions(INPUT_BLOB_NAME, OptProfileSelector::kMIN, Dims4(1, 3, MIN_INPUT_SIZE, MIN_INPUT_SIZE));\n    profile->setDimensions(INPUT_BLOB_NAME, OptProfileSelector::kOPT, Dims4(1, 3, OPT_INPUT_H, OPT_INPUT_W));\n    profile->setDimensions(INPUT_BLOB_NAME, OptProfileSelector::kMAX, Dims4(1, 3, MAX_INPUT_SIZE, MAX_INPUT_SIZE));\n    config->addOptimizationProfile(profile);\n\n    // Build engine\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#ifdef USE_FP16\n    config->setFlag(BuilderFlag::kFP16);\n#endif\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, cv::Size input_shape) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n    context.setBindingDimensions(inputIndex, Dims4(1, 3, input_shape.height, input_shape.width));\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], 3 * input_shape.height * input_shape.width * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, 3 * input_shape.height * input_shape.width * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueueV2(buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n                strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (argc == 2 && std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n        std::ofstream p(\"yolov3-spp.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    } else if (argc == 3 && std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"yolov3-spp.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./yolov3-spp -s  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./yolov3-spp -d ../samples  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(argv[2], file_names) < 0) {\n        std::cout << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    static float prob[OUTPUT_SIZE];\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n    context->setOptimizationProfile(0);\n\n    int fcount = 0;\n    for (auto f: file_names) {\n        fcount++;\n        std::cout << fcount << \"  \" << f << std::endl;\n        cv::Mat img = cv::imread(std::string(argv[2]) + \"/\" + f);\n        if (img.empty()) continue;\n        cv::Mat pr_img = letterbox(img);\n        std::cout << \"letterbox shape: \" << pr_img.cols << \", \" << pr_img.rows << std::endl;\n        if (pr_img.cols < MIN_INPUT_SIZE || pr_img.rows < MIN_INPUT_SIZE) continue;\n        cv::Mat blob = cv::dnn::blobFromImage(pr_img, 1.0 / 255.0, pr_img.size(), cv::Scalar(0, 0, 0), true, false);\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, blob.ptr<float>(0), prob, pr_img.size());\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n        std::vector<Yolo::Detection> res;\n        nms(res, prob);\n        std::cout << \"num of bbox: \" << res.size() << std::endl;\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect(img.size(), pr_img.size(), res[j].bbox);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n        }\n        cv::imwrite(\"_\" + f, img);\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n    return 0;\n}\n"
  },
  {
    "path": "yolov3-tiny/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(yolov3-tiny)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\nif (CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n    message(\"embed_platform on\")\n    include_directories(/usr/local/cuda/targets/aarch64-linux/include)\n    link_directories(/usr/local/cuda/targets/aarch64-linux/lib)\nelse()\n    message(\"embed_platform off\")\n    include_directories(/usr/local/cuda/include)\n    link_directories(/usr/local/cuda/lib64)\nendif()\n\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\n#cuda_add_library(leaky ${PROJECT_SOURCE_DIR}/leaky.cu)\ncuda_add_library(yololayer SHARED ${PROJECT_SOURCE_DIR}/yololayer.cu)\ntarget_link_libraries(yololayer nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(yolov3-tiny ${PROJECT_SOURCE_DIR}/yolov3-tiny.cpp)\ntarget_link_libraries(yolov3-tiny nvinfer)\ntarget_link_libraries(yolov3-tiny cudart)\ntarget_link_libraries(yolov3-tiny yololayer)\ntarget_link_libraries(yolov3-tiny ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "yolov3-tiny/README.md",
    "content": "# yolov3-tiny\n\nThe Pytorch implementation is [ultralytics/yolov3 archive branch](https://github.com/ultralytics/yolov3/tree/archive).\n\n## Excute:\n\n```\n1. generate yolov3-tiny.wts from pytorch implementation with yolov3-tiny.cfg and yolov3-tiny.weights, or download .wts from model zoo\n\ngit clone -b archive https://github.com/ultralytics/yolov3.git\n// download its weights 'yolov3-tiny.pt' or 'yolov3-tiny.weights'\n// put tensorrtx/yolov3-tiny/gen_wts.py into ultralytics/yolov3 and run\npython gen_wts.py yolov3-tiny.weights\n// a file 'yolov3-tiny.wts' will be generated.\n\n2. put yolov3-tiny.wts into tensorrtx/yolov3-tiny, build and run\n\n// go to tensorrtx/yolov3-tiny\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./yolov3-tiny -s             // serialize model to plan file i.e. 'yolov3-tiny.engine'\nsudo ./yolov3-tiny -d  ../../yolov3-spp/samples // deserialize plan file and run inference, the images in samples will be processed.\n\n3. check the images generated, as follows. _zidane.jpg and _bus.jpg\n```\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/78247927-4d9fac00-751e-11ea-8b1b-704a0aeb3fcf.jpg\">\n</p>\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/78247970-60b27c00-751e-11ea-88df-41473fed4823.jpg\">\n</p>\n\n## Config\n\n- Input shape defined in yololayer.h\n- Number of classes defined in yololayer.h\n- FP16/FP32 can be selected by the macro in yolov3-tiny.cpp\n- GPU id can be selected by the macro in yolov3-tiny.cpp\n- NMS thresh in yolov3-tiny.cpp\n- BBox confidence thresh in yolov3-tiny.cpp\n\n## More Information\n\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n\n"
  },
  {
    "path": "yolov3-tiny/gen_wts.py",
    "content": "import struct\nimport sys\nimport torch\nfrom models import *  # noqa: F403\nfrom utils.utils import *  # noqa: F403\n\nmodel = Darknet('cfg/yolov3-tiny.cfg', (608, 608))  # noqa: F405\nweights = sys.argv[1]\ndevice = torch_utils.select_device('0')  # noqa: F405\nif weights.endswith('.pt'):  # pytorch format\n    model.load_state_dict(torch.load(weights, map_location=device, weights_only=False)['model'])\nelse:  # darknet format\n    load_darknet_weights(model, weights)  # noqa: F405\nmodel = model.eval()\n\nwith open('yolov3-tiny.wts', 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\n"
  },
  {
    "path": "yolov3-tiny/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "yolov3-tiny/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H"
  },
  {
    "path": "yolov3-tiny/utils.h",
    "content": "#ifndef __TRT_UTILS_H_\n#define __TRT_UTILS_H_\n\n#include <iostream>\n#include <vector>\n#include <algorithm>\n#include <cudnn.h>\n#include \"macros.h\"\n\n#ifndef CUDA_CHECK\n\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n\n#endif\n\nnamespace Tn\n{\n    class Profiler : public nvinfer1::IProfiler\n    {\n    public:\n        void printLayerTimes(int itrationsTimes)\n        {\n            float totalTime = 0;\n            for (size_t i = 0; i < mProfile.size(); i++)\n            {\n                printf(\"%-40.40s %4.3fms\\n\", mProfile[i].first.c_str(), mProfile[i].second / itrationsTimes);\n                totalTime += mProfile[i].second;\n            }\n            printf(\"Time over all layers: %4.3f\\n\", totalTime / itrationsTimes);\n        }\n    private:\n        typedef std::pair<std::string, float> Record;\n        std::vector<Record> mProfile;\n\n        virtual void reportLayerTime(const char* layerName, float ms) TRT_NOEXCEPT\n        {\n            auto record = std::find_if(mProfile.begin(), mProfile.end(), [&](const Record& r){ return r.first == layerName; });\n            if (record == mProfile.end())\n                mProfile.push_back(std::make_pair(layerName, ms));\n            else\n                record->second += ms;\n        }\n    };\n\n    //Logger for TensorRT info/warning/errors\n    class Logger : public nvinfer1::ILogger\n    {\n    public:\n\n        Logger(): Logger(Severity::kWARNING) {}\n\n        Logger(Severity severity): reportableSeverity(severity) {}\n\n        void log(Severity severity, const char* msg) TRT_NOEXCEPT override\n        {\n            // suppress messages with severity enum value greater than the reportable\n            if (severity > reportableSeverity) return;\n\n            switch (severity)\n            {\n                case Severity::kINTERNAL_ERROR: std::cerr << \"INTERNAL_ERROR: \"; break;\n                case Severity::kERROR: std::cerr << \"ERROR: \"; break;\n                case Severity::kWARNING: std::cerr << \"WARNING: \"; break;\n                case Severity::kINFO: std::cerr << \"INFO: \"; break;\n                default: std::cerr << \"UNKNOWN: \"; break;\n            }\n            std::cerr << msg << std::endl;\n        }\n\n        Severity reportableSeverity{Severity::kWARNING};\n    };\n\n    template<typename T> \n    void write(char*& buffer, const T& val)\n    {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T> \n    void read(const char*& buffer, T& val)\n    {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n}\n\n#endif"
  },
  {
    "path": "yolov3-tiny/yololayer.cu",
    "content": "#include <assert.h>\n#include \"yololayer.h\"\n#include \"utils.h\"\n\nusing namespace Yolo;\n\nnamespace nvinfer1\n{\n    YoloLayerPlugin::YoloLayerPlugin()\n    {\n        mClassCount = CLASS_NUM;\n        mYoloKernel.clear();\n        mYoloKernel.push_back(yolo1);\n        mYoloKernel.push_back(yolo2);\n\n        mKernelCount = mYoloKernel.size();\n    }\n    \n    YoloLayerPlugin::~YoloLayerPlugin()\n    {\n    }\n\n    // create the plugin at runtime from a byte stream\n    YoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length)\n    {\n        using namespace Tn;\n        const char *d = reinterpret_cast<const char *>(data), *a = d;\n        read(d, mClassCount);\n        read(d, mThreadCount);\n        read(d, mKernelCount);\n        mYoloKernel.resize(mKernelCount);\n        auto kernelSize = mKernelCount*sizeof(YoloKernel);\n        memcpy(mYoloKernel.data(),d,kernelSize);\n        d += kernelSize;\n\n        assert(d == a + length);\n    }\n\n    void YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT\n    {\n        using namespace Tn;\n        char* d = static_cast<char*>(buffer), *a = d;\n        write(d, mClassCount);\n        write(d, mThreadCount);\n        write(d, mKernelCount);\n        auto kernelSize = mKernelCount*sizeof(YoloKernel);\n        memcpy(d,mYoloKernel.data(),kernelSize);\n        d += kernelSize;\n\n        assert(d == a + getSerializationSize());\n    }\n    \n    size_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT\n    {  \n        return sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mKernelCount)  + sizeof(Yolo::YoloKernel) * mYoloKernel.size();\n    }\n\n    int YoloLayerPlugin::initialize() TRT_NOEXCEPT\n    { \n        return 0;\n    }\n    \n    Dims YoloLayerPlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT\n    {\n        //output the result to channel\n        int totalsize = MAX_OUTPUT_BBOX_COUNT * sizeof(Detection) / sizeof(float);\n\n        return Dims3(totalsize + 1, 1, 1);\n    }\n\n    // Set plugin namespace\n    void YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT\n    {\n        mPluginNamespace = pluginNamespace;\n    }\n\n    const char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT\n    {\n        return mPluginNamespace;\n    }\n\n    // Return the DataType of the plugin output at the requested index\n    DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT\n    {\n        return DataType::kFLOAT;\n    }\n\n    // Return true if output tensor is broadcast across a batch.\n    bool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    // Return true if plugin can use input that is broadcast across batch without replication.\n    bool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    void YoloLayerPlugin::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT\n    {\n    }\n\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\n    void YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT\n    {\n    }\n\n    // Detach the plugin object from its execution context.\n    void YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\n    const char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT\n    {\n        return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT\n    {\n        return \"1\";\n    }\n\n    void YoloLayerPlugin::destroy() TRT_NOEXCEPT\n    {\n        delete this;\n    }\n\n    // Clone the plugin\n    IPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT\n    {\n        YoloLayerPlugin *p = new YoloLayerPlugin();\n        p->setPluginNamespace(mPluginNamespace);\n        return p;\n    }\n\n    __device__ float Logist(float data){ return 1.0f / (1.0f + expf(-data)); };\n\n    __global__ void CalDetection(const float *input, float *output,int noElements, \n            int yoloWidth,int yoloHeight,const float anchors[CHECK_COUNT*2],int classes,int outputElem) {\n \n        int idx = threadIdx.x + blockDim.x * blockIdx.x;\n        if (idx >= noElements) return;\n\n        int total_grid = yoloWidth * yoloHeight;\n        int bnIdx = idx / total_grid;\n        idx = idx - total_grid*bnIdx;\n        int info_len_i = 5 + classes;\n        const float* curInput = input + bnIdx * (info_len_i * total_grid * CHECK_COUNT);\n\n        for (int k = 0; k < 3; ++k) {\n            int class_id = 0;\n            float max_cls_prob = 0.0;\n            for (int i = 5; i < info_len_i; ++i) {\n                float p = Logist(curInput[idx + k * info_len_i * total_grid + i * total_grid]);\n                if (p > max_cls_prob) {\n                    max_cls_prob = p;\n                    class_id = i - 5;\n                }\n            }\n            float box_prob = Logist(curInput[idx + k * info_len_i * total_grid + 4 * total_grid]);\n            if (max_cls_prob < IGNORE_THRESH || box_prob < IGNORE_THRESH) continue;\n\n            float *res_count = output + bnIdx*outputElem;\n            int count = (int)atomicAdd(res_count, 1);\n            if (count >= MAX_OUTPUT_BBOX_COUNT) return;\n            char* data = (char * )res_count + sizeof(float) + count*sizeof(Detection);\n            Detection* det =  (Detection*)(data);\n\n            int row = idx / yoloWidth;\n            int col = idx % yoloWidth;\n\n            //Location\n            det->bbox[0] = (col + Logist(curInput[idx + k * info_len_i * total_grid + 0 * total_grid])) * INPUT_W / yoloWidth;\n            det->bbox[1] = (row + Logist(curInput[idx + k * info_len_i * total_grid + 1 * total_grid])) * INPUT_H / yoloHeight;\n            det->bbox[2] = expf(curInput[idx + k * info_len_i * total_grid + 2 * total_grid]) * anchors[2*k];\n            det->bbox[3] = expf(curInput[idx + k * info_len_i * total_grid + 3 * total_grid]) * anchors[2*k + 1];\n            det->det_confidence = box_prob;\n            det->class_id = class_id;\n            det->class_confidence = max_cls_prob;\n        }\n    }\n\n    void YoloLayerPlugin::forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize) {\n        void* devAnchor;\n        size_t AnchorLen = sizeof(float)* CHECK_COUNT*2;\n        CUDA_CHECK(cudaMalloc(&devAnchor,AnchorLen));\n\n        int outputElem = 1 + MAX_OUTPUT_BBOX_COUNT * sizeof(Detection) / sizeof(float);\n\n        for(int idx = 0 ; idx < batchSize; ++idx) {\n            CUDA_CHECK(cudaMemset(output + idx*outputElem, 0, sizeof(float)));\n        }\n        int numElem = 0;\n        for (unsigned int i = 0;i< mYoloKernel.size();++i)\n        {\n            const auto& yolo = mYoloKernel[i];\n            numElem = yolo.width*yolo.height*batchSize;\n            if (numElem < mThreadCount)\n                mThreadCount = numElem;\n            CUDA_CHECK(cudaMemcpy(devAnchor, yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n            CalDetection<<< (yolo.width*yolo.height*batchSize + mThreadCount - 1) / mThreadCount, mThreadCount>>>\n                (inputs[i],output, numElem, yolo.width, yolo.height, (float *)devAnchor, mClassCount ,outputElem);\n        }\n\n        CUDA_CHECK(cudaFree(devAnchor));\n    }\n\n\n    int YoloLayerPlugin::enqueue(int batchSize, const void*const * inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT\n    {\n        //assert(batchSize == 1);\n        //GPU\n        //CUDA_CHECK(cudaStreamSynchronize(stream));\n        forwardGpu((const float *const *)inputs, (float*)outputs[0], stream, batchSize);\n\n        return 0;\n    }\n\n    PluginFieldCollection YoloPluginCreator::mFC{};\n    std::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\n    YoloPluginCreator::YoloPluginCreator()\n    {\n        mPluginAttributes.clear();\n\n        mFC.nbFields = mPluginAttributes.size();\n        mFC.fields = mPluginAttributes.data();\n    }\n\n    const char* YoloPluginCreator::getPluginName() const TRT_NOEXCEPT\n    {\n            return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloPluginCreator::getPluginVersion() const TRT_NOEXCEPT\n    {\n            return \"1\";\n    }\n\n    const PluginFieldCollection* YoloPluginCreator::getFieldNames() TRT_NOEXCEPT\n    {\n            return &mFC;\n    }\n\n    IPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT\n    {\n        YoloLayerPlugin* obj = new YoloLayerPlugin();\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n    IPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT\n    {\n        // This object will be deleted when the network is destroyed, which will\n        // call MishPlugin::destroy()\n        YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n}\n"
  },
  {
    "path": "yolov3-tiny/yololayer.h",
    "content": "#ifndef _YOLO_LAYER_H\n#define _YOLO_LAYER_H\n\n#include <vector>\n#include <string>\n#include \"NvInfer.h\"\n#include \"macros.h\"\n\n\nnamespace Yolo\n{\n    static constexpr int CHECK_COUNT = 3;\n    static constexpr float IGNORE_THRESH = 0.1f;\n    static constexpr int MAX_OUTPUT_BBOX_COUNT = 1000;\n    static constexpr int CLASS_NUM = 80;\n    static constexpr int INPUT_H = 608;\n    static constexpr int INPUT_W = 608;\n\n    struct YoloKernel\n    {\n        int width;\n        int height;\n        float anchors[CHECK_COUNT*2];\n    };\n\n    static constexpr YoloKernel yolo1 = {\n        INPUT_W / 32,\n        INPUT_H / 32,\n        {81,82, 135,169, 344,319}\n    };\n    static constexpr YoloKernel yolo2 = {\n        INPUT_W / 16,\n        INPUT_H / 16,\n        {23,27, 37,58, 81,82}\n    };\n\n    static constexpr int LOCATIONS = 4;\n    struct alignas(float) Detection{\n        //x y w h\n        float bbox[LOCATIONS];\n        float det_confidence;\n        float class_id;\n        float class_confidence;\n    };\n}\n\n\nnamespace nvinfer1\n{\n    class YoloLayerPlugin: public IPluginV2IOExt\n    {\n        public:\n            explicit YoloLayerPlugin();\n            YoloLayerPlugin(const void* data, size_t length);\n\n            ~YoloLayerPlugin();\n\n            int getNbOutputs() const TRT_NOEXCEPT override\n            {\n                return 1;\n            }\n\n            Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n            int initialize() TRT_NOEXCEPT override;\n\n            virtual void terminate() TRT_NOEXCEPT override {};\n\n            virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0;}\n\n            virtual int enqueue(int batchSize, const void*const * inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT override;\n\n            virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n            virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n            bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const TRT_NOEXCEPT override {\n                return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n            }\n\n            const char* getPluginType() const TRT_NOEXCEPT override;\n\n            const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n            void destroy() TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n            void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n            const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n            DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override;\n\n            bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT override;\n\n            bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n            void attachToContext(\n                    cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n            void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT override;\n\n            void detachFromContext() TRT_NOEXCEPT override;\n\n        private:\n            void forwardGpu(const float *const * inputs,float * output, cudaStream_t stream,int batchSize = 1);\n            int mClassCount;\n            int mKernelCount;\n            std::vector<Yolo::YoloKernel> mYoloKernel;\n            int mThreadCount = 256;\n            const char* mPluginNamespace;\n    };\n\n    class YoloPluginCreator : public IPluginCreator\n    {\n        public:\n            YoloPluginCreator();\n\n            ~YoloPluginCreator() override = default;\n\n            const char* getPluginName() const TRT_NOEXCEPT override;\n\n            const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n            const PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n            IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT override;\n\n            void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override\n            {\n                mNamespace = libNamespace;\n            }\n\n            const char* getPluginNamespace() const TRT_NOEXCEPT override\n            {\n                return mNamespace.c_str();\n            }\n\n        private:\n            std::string mNamespace;\n            static PluginFieldCollection mFC;\n            static std::vector<PluginField> mPluginAttributes;\n    };\n    REGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n};\n\n#endif \n"
  },
  {
    "path": "yolov3-tiny/yolov3-tiny.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <opencv2/opencv.hpp>\n#include <dirent.h>\n#include \"NvInfer.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include \"yololayer.h\"\n\n#define CHECK(status) \\\n    do\\\n    {\\\n        auto ret = (status);\\\n        if (ret != 0)\\\n        {\\\n            std::cerr << \"Cuda failure: \" << ret << std::endl;\\\n            abort();\\\n        }\\\n    } while (0)\n\n#define USE_FP16  // comment out this if want to use FP32\n#define DEVICE 0  // GPU id\n#define NMS_THRESH 0.5\n#define BBOX_CONF_THRESH 0.4\n\nusing namespace nvinfer1;\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = Yolo::INPUT_H;\nstatic const int INPUT_W = Yolo::INPUT_W;\nstatic const int OUTPUT_SIZE = 1000 * 7 + 1;  // we assume the yololayer outputs no more than 1000 boxes that conf >= 0.1\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\nstatic Logger gLogger;\n\ncv::Mat preprocess_img(cv::Mat& img) {\n    int w, h, x, y;\n    float r_w = INPUT_W / (img.cols*1.0);\n    float r_h = INPUT_H / (img.rows*1.0);\n    if (r_h > r_w) {\n        w = INPUT_W;\n        h = r_w * img.rows;\n        x = 0;\n        y = (INPUT_H - h) / 2;\n    } else {\n        w = r_h* img.cols;\n        h = INPUT_H;\n        x = (INPUT_W - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_CUBIC);\n    cv::Mat out(INPUT_H, INPUT_W, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    int l, r, t, b;\n    float r_w = INPUT_W / (img.cols * 1.0);\n    float r_h = INPUT_H / (img.rows * 1.0);\n    if (r_h > r_w) {\n        l = bbox[0] - bbox[2]/2.f;\n        r = bbox[0] + bbox[2]/2.f;\n        t = bbox[1] - bbox[3]/2.f - (INPUT_H - r_w * img.rows) / 2;\n        b = bbox[1] + bbox[3]/2.f - (INPUT_H - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - bbox[2]/2.f - (INPUT_W - r_h * img.cols) / 2;\n        r = bbox[0] + bbox[2]/2.f - (INPUT_W - r_h * img.cols) / 2;\n        t = bbox[1] - bbox[3]/2.f;\n        b = bbox[1] + bbox[3]/2.f;\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    return cv::Rect(l, t, r-l, b-t);\n}\n\nfloat iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n        std::max(lbox[0] - lbox[2]/2.f , rbox[0] - rbox[2]/2.f), //left\n        std::min(lbox[0] + lbox[2]/2.f , rbox[0] + rbox[2]/2.f), //right\n        std::max(lbox[1] - lbox[3]/2.f , rbox[1] - rbox[3]/2.f), //top\n        std::min(lbox[1] + lbox[3]/2.f , rbox[1] + rbox[3]/2.f), //bottom\n    };\n\n    if(interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS =(interBox[1]-interBox[0])*(interBox[3]-interBox[2]);\n    return interBoxS/(lbox[2]*lbox[3] + rbox[2]*rbox[3] -interBoxS);\n}\n\nbool cmp(const Yolo::Detection& a, const Yolo::Detection& b) {\n    return a.det_confidence > b.det_confidence;\n}\n\nvoid nms(std::vector<Yolo::Detection>& res, float *output, float nms_thresh = NMS_THRESH) {\n    std::map<float, std::vector<Yolo::Detection>> m;\n    for (int i = 0; i < output[0] && i < 1000; i++) {\n        if (output[1 + 7 * i + 4] <= BBOX_CONF_THRESH) continue;\n        Yolo::Detection det;\n        memcpy(&det, &output[1 + 7 * i], 7 * sizeof(float));\n        if (m.count(det.class_id) == 0) m.emplace(det.class_id, std::vector<Yolo::Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        //std::cout << it->second[0].class_id << \" --- \" << std::endl;\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                    dets.erase(dets.begin()+n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* convBnLeaky(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input,  int outch, int ksize, int s, int p, int linx) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[\"module_list.\" + std::to_string(linx) + \".Conv2d.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"module_list.\" + std::to_string(linx) + \".BatchNorm2d\", 1e-4);\n\n    auto lr = network->addActivation(*bn1->getOutput(0), ActivationType::kLEAKY_RELU);\n    lr->setAlpha(0.1);\n\n    return lr;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../yolov3-tiny.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    auto lr0 = convBnLeaky(network, weightMap, *data, 16, 3, 1, 1, 0);\n    auto pool1 = network->addPoolingNd(*lr0->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    pool1->setStrideNd(DimsHW{2, 2});\n    auto lr2 = convBnLeaky(network, weightMap, *pool1->getOutput(0), 32, 3, 1, 1, 2);\n    auto pool3 = network->addPoolingNd(*lr2->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    pool3->setStrideNd(DimsHW{2, 2});\n    auto lr4 = convBnLeaky(network, weightMap, *pool3->getOutput(0), 64, 3, 1, 1, 4);\n    auto pool5 = network->addPoolingNd(*lr4->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    pool5->setStrideNd(DimsHW{2, 2});\n    auto lr6 = convBnLeaky(network, weightMap, *pool5->getOutput(0), 128, 3, 1, 1, 6);\n    auto pool7 = network->addPoolingNd(*lr6->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    pool7->setStrideNd(DimsHW{2, 2});\n    auto lr8 = convBnLeaky(network, weightMap, *pool7->getOutput(0), 256, 3, 1, 1, 8);\n    auto pool9 = network->addPoolingNd(*lr8->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    pool9->setStrideNd(DimsHW{2, 2});\n    auto lr10 = convBnLeaky(network, weightMap, *pool9->getOutput(0), 512, 3, 1, 1, 10);\n    auto pad11 = network->addPaddingNd(*lr10->getOutput(0), DimsHW{0, 0}, DimsHW{1, 1});\n    auto pool11 = network->addPoolingNd(*pad11->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});\n    pool11->setStrideNd(DimsHW{1, 1});\n    auto lr12 = convBnLeaky(network, weightMap, *pool11->getOutput(0), 1024, 3, 1, 1, 12);\n    auto lr13 = convBnLeaky(network, weightMap, *lr12->getOutput(0), 256, 1, 1, 0, 13);\n    auto lr14 = convBnLeaky(network, weightMap, *lr13->getOutput(0), 512, 3, 1, 1, 14);\n    IConvolutionLayer* conv15 = network->addConvolutionNd(*lr14->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.15.Conv2d.weight\"], weightMap[\"module_list.15.Conv2d.bias\"]);\n    // 16 is yolo\n    auto l17 = lr13;\n    auto lr18 = convBnLeaky(network, weightMap, *l17->getOutput(0), 128, 1, 1, 0, 18);\n\n    float *deval = reinterpret_cast<float*>(malloc(sizeof(float) * 128 * 2 * 2));\n    for (int i = 0; i < 128 * 2 * 2; i++) {\n        deval[i] = 1.0;\n    }\n    Weights deconvwts19{DataType::kFLOAT, deval, 128 * 2 * 2};\n    IDeconvolutionLayer* deconv19 = network->addDeconvolutionNd(*lr18->getOutput(0), 128, DimsHW{2, 2}, deconvwts19, emptywts);\n    assert(deconv19);\n    deconv19->setStrideNd(DimsHW{2, 2});\n    deconv19->setNbGroups(128);\n    weightMap[\"deconv19\"] = deconvwts19;\n\n    ITensor* inputTensors[] = {deconv19->getOutput(0), lr8->getOutput(0)};\n    auto cat20 = network->addConcatenation(inputTensors, 2);\n    auto lr21 = convBnLeaky(network, weightMap, *cat20->getOutput(0), 256, 3, 1, 1, 21);\n    IConvolutionLayer* conv22 = network->addConvolutionNd(*lr21->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.22.Conv2d.weight\"], weightMap[\"module_list.22.Conv2d.bias\"]);\n    // 22 is yolo\n\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    const PluginFieldCollection* pluginData = creator->getFieldNames();\n    IPluginV2 *pluginObj = creator->createPlugin(\"yololayer\", pluginData);\n    ITensor* inputTensors_yolo[] = {conv15->getOutput(0), conv22->getOutput(0)};\n    auto yolo = network->addPluginV2(inputTensors_yolo, 2, *pluginObj);\n\n    yolo->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*yolo->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#ifdef USE_FP16\n    config->setFlag(BuilderFlag::kFP16);\n#endif\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CHECK(cudaFree(buffers[inputIndex]));\n    CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n                strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (argc == 2 && std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(1, &modelStream);\n        assert(modelStream != nullptr);\n        std::ofstream p(\"yolov3-tiny.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    } else if (argc == 3 && std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"yolov3-tiny.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./yolov3-tiny -s  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./yolov3-tiny -d ../samples  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(argv[2], file_names) < 0) {\n        std::cout << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    // prepare input data ---------------------------\n    static float data[3 * INPUT_H * INPUT_W];\n    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n    //    data[i] = 1.0;\n    static float prob[OUTPUT_SIZE];\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    int fcount = 0;\n    for (auto f: file_names) {\n        fcount++;\n        std::cout << fcount << \"  \" << f << std::endl;\n        cv::Mat img = cv::imread(std::string(argv[2]) + \"/\" + f);\n        if (img.empty()) continue;\n        cv::Mat pr_img = preprocess_img(img);\n        for (int i = 0; i < INPUT_H * INPUT_W; i++) {\n            data[i] = pr_img.at<cv::Vec3b>(i)[2] / 255.0;\n            data[i + INPUT_H * INPUT_W] = pr_img.at<cv::Vec3b>(i)[1] / 255.0;\n            data[i + 2 * INPUT_H * INPUT_W] = pr_img.at<cv::Vec3b>(i)[0] / 255.0;\n        }\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, 1);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n        std::vector<Yolo::Detection> res;\n        nms(res, prob);\n        for (int i=0; i<20; i++) {\n            std::cout << prob[i] << \",\";\n        }\n        std::cout << res.size() << std::endl;\n        for (size_t j = 0; j < res.size(); j++) {\n            float *p = (float*)&res[j];\n            for (size_t k = 0; k < 7; k++) {\n                std::cout << p[k] << \", \";\n            }\n            std::cout << std::endl;\n            cv::Rect r = get_rect(img, res[j].bbox);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n        }\n        cv::imwrite(\"_\" + f, img);\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n    return 0;\n}\n"
  },
  {
    "path": "yolov4/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 2.6)\n\nproject(yolov4)\n\nadd_definitions(-std=c++11)\n\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\ninclude_directories(/usr/include/x86_64-linux-gnu/)\nlink_directories(/usr/lib/x86_64-linux-gnu/)\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\ncuda_add_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/yololayer.cu ${PROJECT_SOURCE_DIR}/mish.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(yolov4 ${PROJECT_SOURCE_DIR}/yolov4.cpp)\ntarget_link_libraries(yolov4 nvinfer)\ntarget_link_libraries(yolov4 cudart)\ntarget_link_libraries(yolov4 myplugins)\ntarget_link_libraries(yolov4 ${OpenCV_LIBS})\n\nadd_definitions(-O2 -pthread)\n\n"
  },
  {
    "path": "yolov4/README.md",
    "content": "# yolov4\n\nThe Pytorch implementation is from [ultralytics/yolov3 archive branch](https://github.com/ultralytics/yolov3/tree/archive). It can load yolov4.cfg and yolov4.weights(from AlexeyAB/darknet).\n\n## Config\n\n- Input shape `INPUT_H`, `INPUT_W` defined in yololayer.h\n- Number of classes `CLASS_NUM` defined in yololayer.h\n- FP16/FP32 can be selected by the macro `USE_FP16` in yolov4.cpp\n- GPU id can be selected by the macro `DEVICE` in yolov4.cpp\n- NMS thresh `NMS_THRESH` in yolov4.cpp\n- bbox confidence threshold `BBOX_CONF_THRESH` in yolov4.cpp\n- `BATCH_SIZE` in yolov4.cpp\n\n## How to run\n\n1. generate yolov4.wts from pytorch implementation with yolov4.cfg and yolov4.weights, or download .wts from model zoo\n\n```\ngit clone https://github.com/wang-xinyu/tensorrtx.git\ngit clone -b archive https://github.com/ultralytics/yolov3.git\n// download yolov4.weights from https://github.com/AlexeyAB/darknet#pre-trained-models\ncp {tensorrtx}/yolov4/gen_wts.py {ultralytics/yolov3/}\ncd {ultralytics/yolov3/}\npython gen_wts.py yolov4.weights\n// a file 'yolov4.wts' will be generated.\n// the master branch of yolov3 should work, if not, you can checkout be87b41aa2fe59be8e62f4b488052b24ad0bd450\n```\n\n2. put yolov4.wts into {tensorrtx}/yolov4, build and run\n\n```\nmv yolov4.wts {tensorrtx}/yolov4/\ncd {tensorrtx}/yolov4\nmkdir build\ncd build\ncmake ..\nmake\nsudo ./yolov4 -s                          // serialize model to plan file i.e. 'yolov4.engine'\nsudo ./yolov4 -d ../../yolov3-spp/samples // deserialize plan file and run inference, the images in samples will be processed.\n```\n\n3. check the images generated, as follows. _zidane.jpg and _bus.jpg\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/80863728-cbd3a780-8cb0-11ea-8640-7983bb41c354.jpg\">\n</p>\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/80863730-cfffc500-8cb0-11ea-810e-94d693e71d80.jpg\">\n</p>\n\n## More Information\n\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n"
  },
  {
    "path": "yolov4/gen_wts.py",
    "content": "import struct\nimport sys\nimport torch\nfrom models import *  # noqa: F403\nfrom utils.utils import *  # noqa: F403\n\nmodel = Darknet('cfg/yolov4.cfg', (608, 608))  # noqa: F405\nweights = sys.argv[1]\ndevice = torch_utils.select_device('0')  # noqa: F405\nif weights.endswith('.pt'):  # pytorch format\n    model.load_state_dict(torch.load(weights, map_location=device, weights_only=False)['model'])\nelse:  # darknet format\n    load_darknet_weights(model, weights)  # noqa: F405\n\nwith open('yolov4.wts', 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\n"
  },
  {
    "path": "yolov4/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) override\n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "yolov4/mish.cu",
    "content": "#include <cmath>\n#include <stdio.h>\n#include <cassert>\n#include <iostream>\n#include \"mish.h\"\n\nnamespace nvinfer1\n{\n    MishPlugin::MishPlugin()\n    {\n    }\n\n    MishPlugin::~MishPlugin()\n    {\n    }\n\n    // create the plugin at runtime from a byte stream\n    MishPlugin::MishPlugin(const void* data, size_t length)\n    {\n        assert(length == sizeof(input_size_));\n        input_size_ = *reinterpret_cast<const int*>(data);\n    }\n\n    void MishPlugin::serialize(void* buffer) const\n    {\n        *reinterpret_cast<int*>(buffer) = input_size_;\n    }\n\n    size_t MishPlugin::getSerializationSize() const\n    {  \n        return sizeof(input_size_);\n    }\n\n    int MishPlugin::initialize()\n    { \n        return 0;\n    }\n\n    Dims MishPlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims)\n    {\n        assert(nbInputDims == 1);\n        assert(index == 0);\n        input_size_ = inputs[0].d[0] * inputs[0].d[1] * inputs[0].d[2];\n        // Output dimensions\n        return Dims3(inputs[0].d[0], inputs[0].d[1], inputs[0].d[2]);\n    }\n\n    // Set plugin namespace\n    void MishPlugin::setPluginNamespace(const char* pluginNamespace)\n    {\n        mPluginNamespace = pluginNamespace;\n    }\n\n    const char* MishPlugin::getPluginNamespace() const\n    {\n        return mPluginNamespace;\n    }\n\n    // Return the DataType of the plugin output at the requested index\n    DataType MishPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const\n    {\n        return DataType::kFLOAT;\n    }\n\n    // Return true if output tensor is broadcast across a batch.\n    bool MishPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const\n    {\n        return false;\n    }\n\n    // Return true if plugin can use input that is broadcast across batch without replication.\n    bool MishPlugin::canBroadcastInputAcrossBatch(int inputIndex) const\n    {\n        return false;\n    }\n\n    void MishPlugin::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput)\n    {\n    }\n\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\n    void MishPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator)\n    {\n    }\n\n    // Detach the plugin object from its execution context.\n    void MishPlugin::detachFromContext() {}\n\n    const char* MishPlugin::getPluginType() const\n    {\n        return \"Mish_TRT\";\n    }\n\n    const char* MishPlugin::getPluginVersion() const\n    {\n        return \"1\";\n    }\n\n    void MishPlugin::destroy()\n    {\n        delete this;\n    }\n\n    // Clone the plugin\n    IPluginV2IOExt* MishPlugin::clone() const\n    {\n        MishPlugin *p = new MishPlugin();\n        p->input_size_ = input_size_;\n        p->setPluginNamespace(mPluginNamespace);\n        return p;\n    }\n\n    __device__ float tanh_activate_kernel(float x){return (2/(1 + expf(-2*x)) - 1);}\n\n    __device__ float softplus_kernel(float x, float threshold = 20) {\n        if (x > threshold) return x;                // too large\n        else if (x < -threshold) return expf(x);    // too small\n        return logf(expf(x) + 1);\n    }\n\n    __global__ void mish_kernel(const float *input, float *output, int num_elem) {\n\n        int idx = threadIdx.x + blockDim.x * blockIdx.x;\n        if (idx >= num_elem) return;\n\n        //float t = exp(input[idx]);\n        //if (input[idx] > 20.0) {\n        //    t *= t;\n        //    output[idx] = (t - 1.0) / (t + 1.0);\n        //} else {\n        //    float tt = t * t;\n        //    output[idx] = (tt + 2.0 * t) / (tt + 2.0 * t + 2.0);\n        //}\n        //output[idx] *= input[idx];\n        output[idx] = input[idx] * tanh_activate_kernel(softplus_kernel(input[idx]));\n    }\n\n    void MishPlugin::forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize) {\n        int block_size = thread_count_;\n        int grid_size = (input_size_ * batchSize + block_size - 1) / block_size;\n        mish_kernel<<<grid_size, block_size>>>(inputs[0], output, input_size_ * batchSize);\n    }\n\n    int MishPlugin::enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream)\n    {\n        //assert(batchSize == 1);\n        //GPU\n        //CUDA_CHECK(cudaStreamSynchronize(stream));\n        forwardGpu((const float *const *)inputs, (float*)outputs[0], stream, batchSize);\n        return 0;\n    }\n\n    PluginFieldCollection MishPluginCreator::mFC{};\n    std::vector<PluginField> MishPluginCreator::mPluginAttributes;\n\n    MishPluginCreator::MishPluginCreator()\n    {\n        mPluginAttributes.clear();\n\n        mFC.nbFields = mPluginAttributes.size();\n        mFC.fields = mPluginAttributes.data();\n    }\n\n    const char* MishPluginCreator::getPluginName() const\n    {\n            return \"Mish_TRT\";\n    }\n\n    const char* MishPluginCreator::getPluginVersion() const\n    {\n            return \"1\";\n    }\n\n    const PluginFieldCollection* MishPluginCreator::getFieldNames()\n    {\n            return &mFC;\n    }\n\n    IPluginV2IOExt* MishPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc)\n    {\n        MishPlugin* obj = new MishPlugin();\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n    IPluginV2IOExt* MishPluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength)\n    {\n        // This object will be deleted when the network is destroyed, which will\n        // call MishPlugin::destroy()\n        MishPlugin* obj = new MishPlugin(serialData, serialLength);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n}\n\n"
  },
  {
    "path": "yolov4/mish.h",
    "content": "#ifndef _MISH_PLUGIN_H\n#define _MISH_PLUGIN_H\n\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n\nnamespace nvinfer1\n{\n    class MishPlugin: public IPluginV2IOExt\n    {\n        public:\n            explicit MishPlugin();\n            MishPlugin(const void* data, size_t length);\n\n            ~MishPlugin();\n\n            int getNbOutputs() const override\n            {\n                return 1;\n            }\n\n            Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) override;\n\n            int initialize() override;\n\n            virtual void terminate() override {};\n\n            virtual size_t getWorkspaceSize(int maxBatchSize) const override { return 0;}\n\n            virtual int enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream) override;\n\n            virtual size_t getSerializationSize() const override;\n\n            virtual void serialize(void* buffer) const override;\n\n            bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const override {\n                return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n            }\n\n            const char* getPluginType() const override;\n\n            const char* getPluginVersion() const override;\n\n            void destroy() override;\n\n            IPluginV2IOExt* clone() const override;\n\n            void setPluginNamespace(const char* pluginNamespace) override;\n\n            const char* getPluginNamespace() const override;\n\n            DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const override;\n\n            bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const override;\n\n            bool canBroadcastInputAcrossBatch(int inputIndex) const override;\n\n            void attachToContext(\n                    cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) override;\n\n            void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) override;\n\n            void detachFromContext() override;\n\n            int input_size_;\n        private:\n            void forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize = 1);\n            int thread_count_ = 256;\n            const char* mPluginNamespace;\n    };\n\n    class MishPluginCreator : public IPluginCreator\n    {\n        public:\n            MishPluginCreator();\n\n            ~MishPluginCreator() override = default;\n\n            const char* getPluginName() const override;\n\n            const char* getPluginVersion() const override;\n\n            const PluginFieldCollection* getFieldNames() override;\n\n            IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) override;\n\n            IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) override;\n\n            void setPluginNamespace(const char* libNamespace) override\n            {\n                mNamespace = libNamespace;\n            }\n\n            const char* getPluginNamespace() const override\n            {\n                return mNamespace.c_str();\n            }\n\n        private:\n            std::string mNamespace;\n            static PluginFieldCollection mFC;\n            static std::vector<PluginField> mPluginAttributes;\n    };\n    REGISTER_TENSORRT_PLUGIN(MishPluginCreator);\n};\n#endif \n"
  },
  {
    "path": "yolov4/utils.h",
    "content": "#ifndef __TRT_UTILS_H_\n#define __TRT_UTILS_H_\n\n#include <iostream>\n#include <vector>\n#include <algorithm>\n#include <cudnn.h>\n\n#ifndef CUDA_CHECK\n\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n\n#endif\n\nnamespace Tn\n{\n    template<typename T> \n    void write(char*& buffer, const T& val)\n    {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T> \n    void read(const char*& buffer, T& val)\n    {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n}\n\n#endif\n"
  },
  {
    "path": "yolov4/yololayer.cu",
    "content": "#include <assert.h>\n#include \"yololayer.h\"\n#include \"utils.h\"\n\nusing namespace Yolo;\n\nnamespace nvinfer1\n{\n    YoloLayerPlugin::YoloLayerPlugin()\n    {\n        mClassCount = CLASS_NUM;\n        mYoloKernel.clear();\n        mYoloKernel.push_back(yolo1);\n        mYoloKernel.push_back(yolo2);\n        mYoloKernel.push_back(yolo3);\n\n        mKernelCount = mYoloKernel.size();\n\n        CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n        size_t AnchorLen = sizeof(float)* CHECK_COUNT*2;\n        for(int ii = 0; ii < mKernelCount; ii ++)\n        {\n            CUDA_CHECK(cudaMalloc(&mAnchor[ii],AnchorLen));\n            const auto& yolo = mYoloKernel[ii];\n            CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n        }\n    }\n    \n    YoloLayerPlugin::~YoloLayerPlugin()\n    {\n    }\n\n    // create the plugin at runtime from a byte stream\n    YoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length)\n    {\n        using namespace Tn;\n        const char *d = reinterpret_cast<const char *>(data), *a = d;\n        read(d, mClassCount);\n        read(d, mThreadCount);\n        read(d, mKernelCount);\n        mYoloKernel.resize(mKernelCount);\n        auto kernelSize = mKernelCount*sizeof(YoloKernel);\n        memcpy(mYoloKernel.data(),d,kernelSize);\n        d += kernelSize;\n\n        CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n        size_t AnchorLen = sizeof(float)* CHECK_COUNT*2;\n        for(int ii = 0; ii < mKernelCount; ii ++)\n        {\n            CUDA_CHECK(cudaMalloc(&mAnchor[ii],AnchorLen));\n            const auto& yolo = mYoloKernel[ii];\n            CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n        }\n\n        assert(d == a + length);\n    }\n\n    void YoloLayerPlugin::serialize(void* buffer) const\n    {\n        using namespace Tn;\n        char* d = static_cast<char*>(buffer), *a = d;\n        write(d, mClassCount);\n        write(d, mThreadCount);\n        write(d, mKernelCount);\n        auto kernelSize = mKernelCount*sizeof(YoloKernel);\n        memcpy(d,mYoloKernel.data(),kernelSize);\n        d += kernelSize;\n\n        assert(d == a + getSerializationSize());\n    }\n    \n    size_t YoloLayerPlugin::getSerializationSize() const\n    {  \n        return sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mKernelCount)  + sizeof(Yolo::YoloKernel) * mYoloKernel.size();\n    }\n\n    int YoloLayerPlugin::initialize()\n    { \n        return 0;\n    }\n    \n    Dims YoloLayerPlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims)\n    {\n        //output the result to channel\n        int totalsize = MAX_OUTPUT_BBOX_COUNT * sizeof(Detection) / sizeof(float);\n\n        return Dims3(totalsize + 1, 1, 1);\n    }\n\n    // Set plugin namespace\n    void YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace)\n    {\n        mPluginNamespace = pluginNamespace;\n    }\n\n    const char* YoloLayerPlugin::getPluginNamespace() const\n    {\n        return mPluginNamespace;\n    }\n\n    // Return the DataType of the plugin output at the requested index\n    DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const\n    {\n        return DataType::kFLOAT;\n    }\n\n    // Return true if output tensor is broadcast across a batch.\n    bool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const\n    {\n        return false;\n    }\n\n    // Return true if plugin can use input that is broadcast across batch without replication.\n    bool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const\n    {\n        return false;\n    }\n\n    void YoloLayerPlugin::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput)\n    {\n    }\n\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\n    void YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator)\n    {\n    }\n\n    // Detach the plugin object from its execution context.\n    void YoloLayerPlugin::detachFromContext() {}\n\n    const char* YoloLayerPlugin::getPluginType() const\n    {\n        return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloLayerPlugin::getPluginVersion() const\n    {\n        return \"1\";\n    }\n\n    void YoloLayerPlugin::destroy()\n    {\n        delete this;\n    }\n\n    // Clone the plugin\n    IPluginV2IOExt* YoloLayerPlugin::clone() const\n    {\n        YoloLayerPlugin *p = new YoloLayerPlugin();\n        p->setPluginNamespace(mPluginNamespace);\n        return p;\n    }\n\n    __device__ float Logist(float data){ return 1./(1. + exp(-data)); };\n\n    __global__ void CalDetection(const float *input, float *output,int noElements, \n            int yoloWidth,int yoloHeight,const float anchors[CHECK_COUNT*2],int classes,int outputElem) {\n \n        int idx = threadIdx.x + blockDim.x * blockIdx.x;\n        if (idx >= noElements) return;\n\n        int total_grid = yoloWidth * yoloHeight;\n        int bnIdx = idx / total_grid;\n        idx = idx - total_grid*bnIdx;\n        int info_len_i = 5 + classes;\n        const float* curInput = input + bnIdx * (info_len_i * total_grid * CHECK_COUNT);\n\n        for (int k = 0; k < 3; ++k) {\n            int class_id = 0;\n            float max_cls_prob = 0.0;\n            for (int i = 5; i < info_len_i; ++i) {\n                float p = Logist(curInput[idx + k * info_len_i * total_grid + i * total_grid]);\n                if (p > max_cls_prob) {\n                    max_cls_prob = p;\n                    class_id = i - 5;\n                }\n            }\n            float box_prob = Logist(curInput[idx + k * info_len_i * total_grid + 4 * total_grid]);\n            if (max_cls_prob < IGNORE_THRESH || box_prob < IGNORE_THRESH) continue;\n\n            float *res_count = output + bnIdx*outputElem;\n            int count = (int)atomicAdd(res_count, 1);\n            if (count >= MAX_OUTPUT_BBOX_COUNT) return;\n            char* data = (char * )res_count + sizeof(float) + count*sizeof(Detection);\n            Detection* det =  (Detection*)(data);\n\n            int row = idx / yoloWidth;\n            int col = idx % yoloWidth;\n\n            //Location\n            det->bbox[0] = (col + Logist(curInput[idx + k * info_len_i * total_grid + 0 * total_grid])) * INPUT_W / yoloWidth;\n            det->bbox[1] = (row + Logist(curInput[idx + k * info_len_i * total_grid + 1 * total_grid])) * INPUT_H / yoloHeight;\n            det->bbox[2] = exp(curInput[idx + k * info_len_i * total_grid + 2 * total_grid]) * anchors[2*k];\n            det->bbox[3] = exp(curInput[idx + k * info_len_i * total_grid + 3 * total_grid]) * anchors[2*k + 1];\n            det->det_confidence = box_prob;\n            det->class_id = class_id;\n            det->class_confidence = max_cls_prob;\n        }\n    }\n\n    void YoloLayerPlugin::forwardGpu(const float *const * inputs, float* output, cudaStream_t stream, int batchSize) {\n\n        int outputElem = 1 + MAX_OUTPUT_BBOX_COUNT * sizeof(Detection) / sizeof(float);\n\n        for(int idx = 0 ; idx < batchSize; ++idx) {\n            CUDA_CHECK(cudaMemset(output + idx*outputElem, 0, sizeof(float)));\n        }\n        int numElem = 0;\n        for (unsigned int i = 0;i< mYoloKernel.size();++i)\n        {\n            const auto& yolo = mYoloKernel[i];\n            numElem = yolo.width*yolo.height*batchSize;\n            if (numElem < mThreadCount)\n                mThreadCount = numElem;\n            CalDetection<<< (yolo.width*yolo.height*batchSize + mThreadCount - 1) / mThreadCount, mThreadCount>>>\n                (inputs[i],output, numElem, yolo.width, yolo.height, (float *)mAnchor[i], mClassCount ,outputElem);\n        }\n\n    }\n\n\n    int YoloLayerPlugin::enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream)\n    {\n        //assert(batchSize == 1);\n        //GPU\n        //CUDA_CHECK(cudaStreamSynchronize(stream));\n        forwardGpu((const float *const *)inputs, (float*)outputs[0], stream, batchSize);\n\n        return 0;\n    }\n\n    PluginFieldCollection YoloPluginCreator::mFC{};\n    std::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\n    YoloPluginCreator::YoloPluginCreator()\n    {\n        mPluginAttributes.clear();\n\n        mFC.nbFields = mPluginAttributes.size();\n        mFC.fields = mPluginAttributes.data();\n    }\n\n    const char* YoloPluginCreator::getPluginName() const\n    {\n            return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloPluginCreator::getPluginVersion() const\n    {\n            return \"1\";\n    }\n\n    const PluginFieldCollection* YoloPluginCreator::getFieldNames()\n    {\n            return &mFC;\n    }\n\n    IPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc)\n    {\n        YoloLayerPlugin* obj = new YoloLayerPlugin();\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n    IPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength)\n    {\n        // This object will be deleted when the network is destroyed, which will\n        // call MishPlugin::destroy()\n        YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n}\n"
  },
  {
    "path": "yolov4/yololayer.h",
    "content": "#ifndef _YOLO_LAYER_H\n#define _YOLO_LAYER_H\n\n#include <iostream>\n#include <vector>\n#include \"NvInfer.h\"\n\nnamespace Yolo\n{\n    static constexpr int CHECK_COUNT = 3;\n    static constexpr float IGNORE_THRESH = 0.1f;\n    static constexpr int MAX_OUTPUT_BBOX_COUNT = 1000;\n    static constexpr int CLASS_NUM = 80;\n    static constexpr int INPUT_H = 608;\n    static constexpr int INPUT_W = 608;\n\n    struct YoloKernel\n    {\n        int width;\n        int height;\n        float anchors[CHECK_COUNT*2];\n    };\n\n    static constexpr YoloKernel yolo1 = {\n        INPUT_W / 8,\n        INPUT_H / 8,\n        {12,16, 19,36, 40,28}\n    };\n    static constexpr YoloKernel yolo2 = {\n        INPUT_W / 16,\n        INPUT_H / 16,\n        {36,75, 76,55, 72,146}\n    };\n    static constexpr YoloKernel yolo3 = {\n        INPUT_W / 32,\n        INPUT_H / 32,\n        {142,110, 192,243, 459,401}\n    };\n\n    static constexpr int LOCATIONS = 4;\n    struct alignas(float) Detection{\n        //x y w h\n        float bbox[LOCATIONS];\n        float det_confidence;\n        float class_id;\n        float class_confidence;\n    };\n}\n\n\nnamespace nvinfer1\n{\n    class YoloLayerPlugin: public IPluginV2IOExt\n    {\n        public:\n            explicit YoloLayerPlugin();\n            YoloLayerPlugin(const void* data, size_t length);\n\n            ~YoloLayerPlugin();\n\n            int getNbOutputs() const override\n            {\n                return 1;\n            }\n\n            Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) override;\n\n            int initialize() override;\n\n            virtual void terminate() override {};\n\n            virtual size_t getWorkspaceSize(int maxBatchSize) const override { return 0;}\n\n            virtual int enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream) override;\n\n            virtual size_t getSerializationSize() const override;\n\n            virtual void serialize(void* buffer) const override;\n\n            bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const override {\n                return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n            }\n\n            const char* getPluginType() const override;\n\n            const char* getPluginVersion() const override;\n\n            void destroy() override;\n\n            IPluginV2IOExt* clone() const override;\n\n            void setPluginNamespace(const char* pluginNamespace) override;\n\n            const char* getPluginNamespace() const override;\n\n            DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const override;\n\n            bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const override;\n\n            bool canBroadcastInputAcrossBatch(int inputIndex) const override;\n\n            void attachToContext(\n                    cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) override;\n\n            void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) override;\n\n            void detachFromContext() override;\n\n        private:\n            void forwardGpu(const float *const * inputs,float * output, cudaStream_t stream,int batchSize = 1);\n            int mClassCount;\n            int mKernelCount;\n            std::vector<Yolo::YoloKernel> mYoloKernel;\n            int mThreadCount = 256;\n            void** mAnchor;\n            const char* mPluginNamespace;\n    };\n\n    class YoloPluginCreator : public IPluginCreator\n    {\n        public:\n            YoloPluginCreator();\n\n            ~YoloPluginCreator() override = default;\n\n            const char* getPluginName() const override;\n\n            const char* getPluginVersion() const override;\n\n            const PluginFieldCollection* getFieldNames() override;\n\n            IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) override;\n\n            IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) override;\n\n            void setPluginNamespace(const char* libNamespace) override\n            {\n                mNamespace = libNamespace;\n            }\n\n            const char* getPluginNamespace() const override\n            {\n                return mNamespace.c_str();\n            }\n\n        private:\n            std::string mNamespace;\n            static PluginFieldCollection mFC;\n            static std::vector<PluginField> mPluginAttributes;\n    };\n    REGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n};\n\n#endif \n"
  },
  {
    "path": "yolov4/yolov4.cpp",
    "content": "#include <fstream>\n#include <iostream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <chrono>\n#include <opencv2/opencv.hpp>\n#include <dirent.h>\n#include \"NvInfer.h\"\n#include \"utils.h\"\n#include \"cuda_runtime_api.h\"\n#include \"logging.h\"\n#include \"yololayer.h\"\n#include \"mish.h\"\n\n#define USE_FP16  // comment out this if want to use FP32\n#define DEVICE 0  // GPU id\n#define NMS_THRESH 0.4\n#define BBOX_CONF_THRESH 0.5\n#define BATCH_SIZE 1\n\nusing namespace nvinfer1;\n\n// stuff we know about the network and the input/output blobs\nstatic const int INPUT_H = Yolo::INPUT_H;\nstatic const int INPUT_W = Yolo::INPUT_W;\nstatic const int DETECTION_SIZE = sizeof(Yolo::Detection) / sizeof(float);\nstatic const int OUTPUT_SIZE = Yolo::MAX_OUTPUT_BBOX_COUNT * DETECTION_SIZE + 1;  // we assume the yololayer outputs no more than MAX_OUTPUT_BBOX_COUNT boxes that conf >= 0.1\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\nstatic Logger gLogger;\n\ncv::Mat preprocess_img(cv::Mat& img) {\n    int w, h, x, y;\n    float r_w = INPUT_W / (img.cols*1.0);\n    float r_h = INPUT_H / (img.rows*1.0);\n    if (r_h > r_w) {\n        w = INPUT_W;\n        h = r_w * img.rows;\n        x = 0;\n        y = (INPUT_H - h) / 2;\n    } else {\n        w = r_h* img.cols;\n        h = INPUT_H;\n        x = (INPUT_W - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size());\n    cv::Mat out(INPUT_H, INPUT_W, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    int l, r, t, b;\n    float r_w = INPUT_W / (img.cols * 1.0);\n    float r_h = INPUT_H / (img.rows * 1.0);\n    if (r_h > r_w) {\n        l = bbox[0] - bbox[2]/2.f;\n        r = bbox[0] + bbox[2]/2.f;\n        t = bbox[1] - bbox[3]/2.f - (INPUT_H - r_w * img.rows) / 2;\n        b = bbox[1] + bbox[3]/2.f - (INPUT_H - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - bbox[2]/2.f - (INPUT_W - r_h * img.cols) / 2;\n        r = bbox[0] + bbox[2]/2.f - (INPUT_W - r_h * img.cols) / 2;\n        t = bbox[1] - bbox[3]/2.f;\n        b = bbox[1] + bbox[3]/2.f;\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    return cv::Rect(l, t, r-l, b-t);\n}\n\nfloat iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n        std::max(lbox[0] - lbox[2]/2.f , rbox[0] - rbox[2]/2.f), //left\n        std::min(lbox[0] + lbox[2]/2.f , rbox[0] + rbox[2]/2.f), //right\n        std::max(lbox[1] - lbox[3]/2.f , rbox[1] - rbox[3]/2.f), //top\n        std::min(lbox[1] + lbox[3]/2.f , rbox[1] + rbox[3]/2.f), //bottom\n    };\n\n    if(interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS =(interBox[1]-interBox[0])*(interBox[3]-interBox[2]);\n    return interBoxS/(lbox[2]*lbox[3] + rbox[2]*rbox[3] -interBoxS);\n}\n\nbool cmp(const Yolo::Detection& a, const Yolo::Detection& b) {\n    return a.det_confidence > b.det_confidence;\n}\n\nvoid nms(std::vector<Yolo::Detection>& res, float *output, float nms_thresh = NMS_THRESH) {\n    std::map<float, std::vector<Yolo::Detection>> m;\n    for (int i = 0; i < output[0] && i < Yolo::MAX_OUTPUT_BBOX_COUNT; i++) {\n        if (output[1 + DETECTION_SIZE * i + 4] <= BBOX_CONF_THRESH) continue;\n        Yolo::Detection det;\n        memcpy(&det, &output[1 + DETECTION_SIZE * i], DETECTION_SIZE * sizeof(float));\n        if (m.count(det.class_id) == 0) m.emplace(det.class_id, std::vector<Yolo::Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        //std::cout << it->second[0].class_id << \" --- \" << std::endl;\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                    dets.erase(dets.begin()+n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file.\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        \n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nIScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{DataType::kFLOAT, scval, len};\n    \n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{DataType::kFLOAT, shval, len};\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{DataType::kFLOAT, pval, len};\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nILayer* convBnMish(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int ksize, int s, int p, int linx) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[\"module_list.\" + std::to_string(linx) + \".Conv2d.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"module_list.\" + std::to_string(linx) + \".BatchNorm2d\", 1e-4);\n\n    auto creator = getPluginRegistry()->getPluginCreator(\"Mish_TRT\", \"1\");\n    const PluginFieldCollection* pluginData = creator->getFieldNames();\n    IPluginV2 *pluginObj = creator->createPlugin((\"mish\" + std::to_string(linx)).c_str(), pluginData);\n    ITensor* inputTensors[] = {bn1->getOutput(0)};\n    auto mish = network->addPluginV2(&inputTensors[0], 1, *pluginObj);\n    return mish;\n}\n\nILayer* convBnLeaky(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int ksize, int s, int p, int linx) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ksize, ksize}, weightMap[\"module_list.\" + std::to_string(linx) + \".Conv2d.weight\"], emptywts);\n    assert(conv1);\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), \"module_list.\" + std::to_string(linx) + \".BatchNorm2d\", 1e-4);\n\n    auto lr = network->addActivation(*bn1->getOutput(0), ActivationType::kLEAKY_RELU);\n    lr->setAlpha(0.1);\n\n    return lr;\n}\n\n// Creat the engine using only the API and not any parser.\nICudaEngine* createEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME\n    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});\n    assert(data);\n\n    std::map<std::string, Weights> weightMap = loadWeights(\"../yolov4.wts\");\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n\n    // define each layer.\n    auto l0 = convBnMish(network, weightMap, *data, 32, 3, 1, 1, 0);\n    auto l1 = convBnMish(network, weightMap, *l0->getOutput(0), 64, 3, 2, 1, 1);\n    auto l2 = convBnMish(network, weightMap, *l1->getOutput(0), 64, 1, 1, 0, 2);\n    auto l3 = l1;\n    auto l4 = convBnMish(network, weightMap, *l3->getOutput(0), 64, 1, 1, 0, 4);\n    auto l5 = convBnMish(network, weightMap, *l4->getOutput(0), 32, 1, 1, 0, 5);\n    auto l6 = convBnMish(network, weightMap, *l5->getOutput(0), 64, 3, 1, 1, 6);\n    auto ew7 = network->addElementWise(*l6->getOutput(0), *l4->getOutput(0), ElementWiseOperation::kSUM);\n    auto l8 = convBnMish(network, weightMap, *ew7->getOutput(0), 64, 1, 1, 0, 8);\n\n    ITensor* inputTensors9[] = {l8->getOutput(0), l2->getOutput(0)};\n    auto cat9 = network->addConcatenation(inputTensors9, 2);\n\n    auto l10 = convBnMish(network, weightMap, *cat9->getOutput(0), 64, 1, 1, 0, 10);\n    auto l11 = convBnMish(network, weightMap, *l10->getOutput(0), 128, 3, 2, 1, 11);\n    auto l12 = convBnMish(network, weightMap, *l11->getOutput(0), 64, 1, 1, 0, 12);\n    auto l13 = l11;\n    auto l14 = convBnMish(network, weightMap, *l13->getOutput(0), 64, 1, 1, 0, 14);\n    auto l15 = convBnMish(network, weightMap, *l14->getOutput(0), 64, 1, 1, 0, 15);\n    auto l16 = convBnMish(network, weightMap, *l15->getOutput(0), 64, 3, 1, 1, 16);\n    auto ew17 = network->addElementWise(*l16->getOutput(0), *l14->getOutput(0), ElementWiseOperation::kSUM);\n    auto l18 = convBnMish(network, weightMap, *ew17->getOutput(0), 64, 1, 1, 0, 18);\n    auto l19 = convBnMish(network, weightMap, *l18->getOutput(0), 64, 3, 1, 1, 19);\n    auto ew20 = network->addElementWise(*l19->getOutput(0), *ew17->getOutput(0), ElementWiseOperation::kSUM);\n    auto l21 = convBnMish(network, weightMap, *ew20->getOutput(0), 64, 1, 1, 0, 21);\n\n    ITensor* inputTensors22[] = {l21->getOutput(0), l12->getOutput(0)};\n    auto cat22 = network->addConcatenation(inputTensors22, 2);\n\n    auto l23 = convBnMish(network, weightMap, *cat22->getOutput(0), 128, 1, 1, 0, 23);\n    auto l24 = convBnMish(network, weightMap, *l23->getOutput(0), 256, 3, 2, 1, 24);\n    auto l25 = convBnMish(network, weightMap, *l24->getOutput(0), 128, 1, 1, 0, 25);\n    auto l26 = l24;\n    auto l27 = convBnMish(network, weightMap, *l26->getOutput(0), 128, 1, 1, 0, 27);\n    auto l28 = convBnMish(network, weightMap, *l27->getOutput(0), 128, 1, 1, 0, 28);\n    auto l29 = convBnMish(network, weightMap, *l28->getOutput(0), 128, 3, 1, 1, 29);\n    auto ew30 = network->addElementWise(*l29->getOutput(0), *l27->getOutput(0), ElementWiseOperation::kSUM);\n    auto l31 = convBnMish(network, weightMap, *ew30->getOutput(0), 128, 1, 1, 0, 31);\n    auto l32 = convBnMish(network, weightMap, *l31->getOutput(0), 128, 3, 1, 1, 32);\n    auto ew33 = network->addElementWise(*l32->getOutput(0), *ew30->getOutput(0), ElementWiseOperation::kSUM);\n    auto l34 = convBnMish(network, weightMap, *ew33->getOutput(0), 128, 1, 1, 0, 34);\n    auto l35 = convBnMish(network, weightMap, *l34->getOutput(0), 128, 3, 1, 1, 35);\n    auto ew36 = network->addElementWise(*l35->getOutput(0), *ew33->getOutput(0), ElementWiseOperation::kSUM);\n    auto l37 = convBnMish(network, weightMap, *ew36->getOutput(0), 128, 1, 1, 0, 37);\n    auto l38 = convBnMish(network, weightMap, *l37->getOutput(0), 128, 3, 1, 1, 38);\n    auto ew39 = network->addElementWise(*l38->getOutput(0), *ew36->getOutput(0), ElementWiseOperation::kSUM);\n    auto l40 = convBnMish(network, weightMap, *ew39->getOutput(0), 128, 1, 1, 0, 40);\n    auto l41 = convBnMish(network, weightMap, *l40->getOutput(0), 128, 3, 1, 1, 41);\n    auto ew42 = network->addElementWise(*l41->getOutput(0), *ew39->getOutput(0), ElementWiseOperation::kSUM);\n    auto l43 = convBnMish(network, weightMap, *ew42->getOutput(0), 128, 1, 1, 0, 43);\n    auto l44 = convBnMish(network, weightMap, *l43->getOutput(0), 128, 3, 1, 1, 44);\n    auto ew45 = network->addElementWise(*l44->getOutput(0), *ew42->getOutput(0), ElementWiseOperation::kSUM);\n    auto l46 = convBnMish(network, weightMap, *ew45->getOutput(0), 128, 1, 1, 0, 46);\n    auto l47 = convBnMish(network, weightMap, *l46->getOutput(0), 128, 3, 1, 1, 47);\n    auto ew48 = network->addElementWise(*l47->getOutput(0), *ew45->getOutput(0), ElementWiseOperation::kSUM);\n    auto l49 = convBnMish(network, weightMap, *ew48->getOutput(0), 128, 1, 1, 0, 49);\n    auto l50 = convBnMish(network, weightMap, *l49->getOutput(0), 128, 3, 1, 1, 50);\n    auto ew51 = network->addElementWise(*l50->getOutput(0), *ew48->getOutput(0), ElementWiseOperation::kSUM);\n    auto l52 = convBnMish(network, weightMap, *ew51->getOutput(0), 128, 1, 1, 0, 52);\n\n    ITensor* inputTensors53[] = {l52->getOutput(0), l25->getOutput(0)};\n    auto cat53 = network->addConcatenation(inputTensors53, 2);\n\n    auto l54 = convBnMish(network, weightMap, *cat53->getOutput(0), 256, 1, 1, 0, 54);\n    auto l55 = convBnMish(network, weightMap, *l54->getOutput(0), 512, 3, 2, 1, 55);\n    auto l56 = convBnMish(network, weightMap, *l55->getOutput(0), 256, 1, 1, 0, 56);\n    auto l57 = l55;\n    auto l58 = convBnMish(network, weightMap, *l57->getOutput(0), 256, 1, 1, 0, 58);\n    auto l59 = convBnMish(network, weightMap, *l58->getOutput(0), 256, 1, 1, 0, 59);\n    auto l60 = convBnMish(network, weightMap, *l59->getOutput(0), 256, 3, 1, 1, 60);\n    auto ew61 = network->addElementWise(*l60->getOutput(0), *l58->getOutput(0), ElementWiseOperation::kSUM);\n    auto l62 = convBnMish(network, weightMap, *ew61->getOutput(0), 256, 1, 1, 0, 62);\n    auto l63 = convBnMish(network, weightMap, *l62->getOutput(0), 256, 3, 1, 1, 63);\n    auto ew64 = network->addElementWise(*l63->getOutput(0), *ew61->getOutput(0), ElementWiseOperation::kSUM);\n    auto l65 = convBnMish(network, weightMap, *ew64->getOutput(0), 256, 1, 1, 0, 65);\n    auto l66 = convBnMish(network, weightMap, *l65->getOutput(0), 256, 3, 1, 1, 66);\n    auto ew67 = network->addElementWise(*l66->getOutput(0), *ew64->getOutput(0), ElementWiseOperation::kSUM);\n    auto l68 = convBnMish(network, weightMap, *ew67->getOutput(0), 256, 1, 1, 0, 68);\n    auto l69 = convBnMish(network, weightMap, *l68->getOutput(0), 256, 3, 1, 1, 69);\n    auto ew70 = network->addElementWise(*l69->getOutput(0), *ew67->getOutput(0), ElementWiseOperation::kSUM);\n    auto l71 = convBnMish(network, weightMap, *ew70->getOutput(0), 256, 1, 1, 0, 71);\n    auto l72 = convBnMish(network, weightMap, *l71->getOutput(0), 256, 3, 1, 1, 72);\n    auto ew73 = network->addElementWise(*l72->getOutput(0), *ew70->getOutput(0), ElementWiseOperation::kSUM);\n    auto l74 = convBnMish(network, weightMap, *ew73->getOutput(0), 256, 1, 1, 0, 74);\n    auto l75 = convBnMish(network, weightMap, *l74->getOutput(0), 256, 3, 1, 1, 75);\n    auto ew76 = network->addElementWise(*l75->getOutput(0), *ew73->getOutput(0), ElementWiseOperation::kSUM);\n    auto l77 = convBnMish(network, weightMap, *ew76->getOutput(0), 256, 1, 1, 0, 77);\n    auto l78 = convBnMish(network, weightMap, *l77->getOutput(0), 256, 3, 1, 1, 78);\n    auto ew79 = network->addElementWise(*l78->getOutput(0), *ew76->getOutput(0), ElementWiseOperation::kSUM);\n    auto l80 = convBnMish(network, weightMap, *ew79->getOutput(0), 256, 1, 1, 0, 80);\n    auto l81 = convBnMish(network, weightMap, *l80->getOutput(0), 256, 3, 1, 1, 81);\n    auto ew82 = network->addElementWise(*l81->getOutput(0), *ew79->getOutput(0), ElementWiseOperation::kSUM);\n    auto l83 = convBnMish(network, weightMap, *ew82->getOutput(0), 256, 1, 1, 0, 83);\n\n    ITensor* inputTensors84[] = {l83->getOutput(0), l56->getOutput(0)};\n    auto cat84 = network->addConcatenation(inputTensors84, 2);\n\n    auto l85 = convBnMish(network, weightMap, *cat84->getOutput(0), 512, 1, 1, 0, 85);\n    auto l86 = convBnMish(network, weightMap, *l85->getOutput(0), 1024, 3, 2, 1, 86);\n    auto l87 = convBnMish(network, weightMap, *l86->getOutput(0), 512, 1, 1, 0, 87);\n    auto l88 = l86;\n    auto l89 = convBnMish(network, weightMap, *l88->getOutput(0), 512, 1, 1, 0, 89);\n    auto l90 = convBnMish(network, weightMap, *l89->getOutput(0), 512, 1, 1, 0, 90);\n    auto l91 = convBnMish(network, weightMap, *l90->getOutput(0), 512, 3, 1, 1, 91);\n    auto ew92 = network->addElementWise(*l91->getOutput(0), *l89->getOutput(0), ElementWiseOperation::kSUM);\n    auto l93 = convBnMish(network, weightMap, *ew92->getOutput(0), 512, 1, 1, 0, 93);\n    auto l94 = convBnMish(network, weightMap, *l93->getOutput(0), 512, 3, 1, 1, 94);\n    auto ew95 = network->addElementWise(*l94->getOutput(0), *ew92->getOutput(0), ElementWiseOperation::kSUM);\n    auto l96 = convBnMish(network, weightMap, *ew95->getOutput(0), 512, 1, 1, 0, 96);\n    auto l97 = convBnMish(network, weightMap, *l96->getOutput(0), 512, 3, 1, 1, 97);\n    auto ew98 = network->addElementWise(*l97->getOutput(0), *ew95->getOutput(0), ElementWiseOperation::kSUM);\n    auto l99 = convBnMish(network, weightMap, *ew98->getOutput(0), 512, 1, 1, 0, 99);\n    auto l100 = convBnMish(network, weightMap, *l99->getOutput(0), 512, 3, 1, 1, 100);\n    auto ew101 = network->addElementWise(*l100->getOutput(0), *ew98->getOutput(0), ElementWiseOperation::kSUM);\n    auto l102 = convBnMish(network, weightMap, *ew101->getOutput(0), 512, 1, 1, 0, 102);\n\n    ITensor* inputTensors103[] = {l102->getOutput(0), l87->getOutput(0)};\n    auto cat103 = network->addConcatenation(inputTensors103, 2);\n\n    auto l104 = convBnMish(network, weightMap, *cat103->getOutput(0), 1024, 1, 1, 0, 104);\n\n    // ---------\n    auto l105 = convBnLeaky(network, weightMap, *l104->getOutput(0), 512, 1, 1, 0, 105);\n    auto l106 = convBnLeaky(network, weightMap, *l105->getOutput(0), 1024, 3, 1, 1, 106);\n    auto l107 = convBnLeaky(network, weightMap, *l106->getOutput(0), 512, 1, 1, 0, 107);\n\n    auto pool108 = network->addPoolingNd(*l107->getOutput(0), PoolingType::kMAX, DimsHW{5, 5});\n    pool108->setPaddingNd(DimsHW{2, 2});\n    pool108->setStrideNd(DimsHW{1, 1});\n\n    auto l109 = l107;\n\n    auto pool110 = network->addPoolingNd(*l109->getOutput(0), PoolingType::kMAX, DimsHW{9, 9});\n    pool110->setPaddingNd(DimsHW{4, 4});\n    pool110->setStrideNd(DimsHW{1, 1});\n\n    auto l111 = l107;\n\n    auto pool112 = network->addPoolingNd(*l111->getOutput(0), PoolingType::kMAX, DimsHW{13, 13});\n    pool112->setPaddingNd(DimsHW{6, 6});\n    pool112->setStrideNd(DimsHW{1, 1});\n\n    ITensor* inputTensors113[] = {pool112->getOutput(0), pool110->getOutput(0), pool108->getOutput(0), l107->getOutput(0)};\n    auto cat113 = network->addConcatenation(inputTensors113, 4);\n\n    auto l114 = convBnLeaky(network, weightMap, *cat113->getOutput(0), 512, 1, 1, 0, 114);\n    auto l115 = convBnLeaky(network, weightMap, *l114->getOutput(0), 1024, 3, 1, 1, 115);\n    auto l116 = convBnLeaky(network, weightMap, *l115->getOutput(0), 512, 1, 1, 0, 116);\n    auto l117 = convBnLeaky(network, weightMap, *l116->getOutput(0), 256, 1, 1, 0, 117);\n\n    float *deval = reinterpret_cast<float*>(malloc(sizeof(float) * 256 * 2 * 2));\n    for (int i = 0; i < 256 * 2 * 2; i++) {\n        deval[i] = 1.0;\n    }\n    Weights deconvwts118{DataType::kFLOAT, deval, 256 * 2 * 2};\n    IDeconvolutionLayer* deconv118 = network->addDeconvolutionNd(*l117->getOutput(0), 256, DimsHW{2, 2}, deconvwts118, emptywts);\n    assert(deconv118);\n    deconv118->setStrideNd(DimsHW{2, 2});\n    deconv118->setNbGroups(256);\n    weightMap[\"deconv118\"] = deconvwts118;\n\n    auto l119 = l85;\n    auto l120 = convBnLeaky(network, weightMap, *l119->getOutput(0), 256, 1, 1, 0, 120);\n\n    ITensor* inputTensors121[] = {l120->getOutput(0), deconv118->getOutput(0)};\n    auto cat121 = network->addConcatenation(inputTensors121, 2);\n\n    auto l122 = convBnLeaky(network, weightMap, *cat121->getOutput(0), 256, 1, 1, 0, 122);\n    auto l123 = convBnLeaky(network, weightMap, *l122->getOutput(0), 512, 3, 1, 1, 123);\n    auto l124 = convBnLeaky(network, weightMap, *l123->getOutput(0), 256, 1, 1, 0, 124);\n    auto l125 = convBnLeaky(network, weightMap, *l124->getOutput(0), 512, 3, 1, 1, 125);\n    auto l126 = convBnLeaky(network, weightMap, *l125->getOutput(0), 256, 1, 1, 0, 126);\n    auto l127 = convBnLeaky(network, weightMap, *l126->getOutput(0), 128, 1, 1, 0, 127);\n\n    Weights deconvwts128{DataType::kFLOAT, deval, 128 * 2 * 2};\n    IDeconvolutionLayer* deconv128 = network->addDeconvolutionNd(*l127->getOutput(0), 128, DimsHW{2, 2}, deconvwts128, emptywts);\n    assert(deconv128);\n    deconv128->setStrideNd(DimsHW{2, 2});\n    deconv128->setNbGroups(128);\n\n    auto l129 = l54;\n    auto l130 = convBnLeaky(network, weightMap, *l129->getOutput(0), 128, 1, 1, 0, 130);\n\n    ITensor* inputTensors131[] = {l130->getOutput(0), deconv128->getOutput(0)};\n    auto cat131 = network->addConcatenation(inputTensors131, 2);\n\n    auto l132 = convBnLeaky(network, weightMap, *cat131->getOutput(0), 128, 1, 1, 0, 132);\n    auto l133 = convBnLeaky(network, weightMap, *l132->getOutput(0), 256, 3, 1, 1, 133);\n    auto l134 = convBnLeaky(network, weightMap, *l133->getOutput(0), 128, 1, 1, 0, 134);\n    auto l135 = convBnLeaky(network, weightMap, *l134->getOutput(0), 256, 3, 1, 1, 135);\n    auto l136 = convBnLeaky(network, weightMap, *l135->getOutput(0), 128, 1, 1, 0, 136);\n    auto l137 = convBnLeaky(network, weightMap, *l136->getOutput(0), 256, 3, 1, 1, 137);\n    IConvolutionLayer* conv138 = network->addConvolutionNd(*l137->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.138.Conv2d.weight\"], weightMap[\"module_list.138.Conv2d.bias\"]);\n    assert(conv138);\n    // 139 is yolo layer\n\n    auto l140 = l136;\n    auto l141 = convBnLeaky(network, weightMap, *l140->getOutput(0), 256, 3, 2, 1, 141);\n\n    ITensor* inputTensors142[] = {l141->getOutput(0), l126->getOutput(0)};\n    auto cat142 = network->addConcatenation(inputTensors142, 2);\n\n    auto l143 = convBnLeaky(network, weightMap, *cat142->getOutput(0), 256, 1, 1, 0, 143);\n    auto l144 = convBnLeaky(network, weightMap, *l143->getOutput(0), 512, 3, 1, 1, 144);\n    auto l145 = convBnLeaky(network, weightMap, *l144->getOutput(0), 256, 1, 1, 0, 145);\n    auto l146 = convBnLeaky(network, weightMap, *l145->getOutput(0), 512, 3, 1, 1, 146);\n    auto l147 = convBnLeaky(network, weightMap, *l146->getOutput(0), 256, 1, 1, 0, 147);\n    auto l148 = convBnLeaky(network, weightMap, *l147->getOutput(0), 512, 3, 1, 1, 148);\n    IConvolutionLayer* conv149 = network->addConvolutionNd(*l148->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.149.Conv2d.weight\"], weightMap[\"module_list.149.Conv2d.bias\"]);\n    assert(conv149);\n    // 150 is yolo layer\n\n    auto l151 = l147;\n    auto l152 = convBnLeaky(network, weightMap, *l151->getOutput(0), 512, 3, 2, 1, 152);\n\n    ITensor* inputTensors153[] = {l152->getOutput(0), l116->getOutput(0)};\n    auto cat153 = network->addConcatenation(inputTensors153, 2);\n\n    auto l154 = convBnLeaky(network, weightMap, *cat153->getOutput(0), 512, 1, 1, 0, 154);\n    auto l155 = convBnLeaky(network, weightMap, *l154->getOutput(0), 1024, 3, 1, 1, 155);\n    auto l156 = convBnLeaky(network, weightMap, *l155->getOutput(0), 512, 1, 1, 0, 156);\n    auto l157 = convBnLeaky(network, weightMap, *l156->getOutput(0), 1024, 3, 1, 1, 157);\n    auto l158 = convBnLeaky(network, weightMap, *l157->getOutput(0), 512, 1, 1, 0, 158);\n    auto l159 = convBnLeaky(network, weightMap, *l158->getOutput(0), 1024, 3, 1, 1, 159);\n    IConvolutionLayer* conv160 = network->addConvolutionNd(*l159->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap[\"module_list.160.Conv2d.weight\"], weightMap[\"module_list.160.Conv2d.bias\"]);\n    assert(conv160);\n    // 161 is yolo layer\n\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    const PluginFieldCollection* pluginData = creator->getFieldNames();\n    IPluginV2 *pluginObj = creator->createPlugin(\"yololayer\", pluginData);\n    ITensor* inputTensors_yolo[] = {conv138->getOutput(0), conv149->getOutput(0), conv160->getOutput(0)};\n    auto yolo = network->addPluginV2(inputTensors_yolo, 3, *pluginObj);\n\n    yolo->getOutput(0)->setName(OUTPUT_BLOB_NAME);\n    network->markOutput(*yolo->getOutput(0));\n\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#ifdef USE_FP16\n    config->setFlag(BuilderFlag::kFP16);\n#endif\n    std::cout << \"Building tensorrt engine, please wait for a while...\" << std::endl;\n    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    network->destroy();\n\n    // Release host memory\n    for (auto& mem : weightMap)\n    {\n        free((void*) (mem.second.values));\n    }\n\n    return engine;\n}\n\nvoid APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    ICudaEngine* engine = createEngine(maxBatchSize, builder, config, DataType::kFLOAT);\n    assert(engine != nullptr);\n\n    // Serialize the engine\n    (*modelStream) = engine->serialize();\n\n    // Close everything down\n    engine->destroy();\n    builder->destroy();\n    config->destroy();\n}\n\nvoid doInference(IExecutionContext& context, float* input, float* output, int batchSize) {\n    const ICudaEngine& engine = context.getEngine();\n\n    // Pointers to input and output device buffers to pass to engine.\n    // Engine requires exactly IEngine::getNbBindings() number of buffers.\n    assert(engine.getNbBindings() == 2);\n    void* buffers[2];\n\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);\n    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);\n\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));\n    CUDA_CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));\n\n    // Create stream\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\n    CUDA_CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\n    cudaStreamSynchronize(stream);\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(buffers[inputIndex]));\n    CUDA_CHECK(cudaFree(buffers[outputIndex]));\n}\n\nint read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n                strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(DEVICE);\n    // create a model using the API directly and serialize it to a stream\n    char *trtModelStream{nullptr};\n    size_t size{0};\n\n    if (argc == 2 && std::string(argv[1]) == \"-s\") {\n        IHostMemory* modelStream{nullptr};\n        APIToModel(BATCH_SIZE, &modelStream);\n        assert(modelStream != nullptr);\n        std::ofstream p(\"yolov4.engine\", std::ios::binary);\n        if (!p) {\n            std::cerr << \"could not open plan output file\" << std::endl;\n            return -1;\n        }\n        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());\n        modelStream->destroy();\n        return 0;\n    } else if (argc == 3 && std::string(argv[1]) == \"-d\") {\n        std::ifstream file(\"yolov4.engine\", std::ios::binary);\n        if (file.good()) {\n            file.seekg(0, file.end);\n            size = file.tellg();\n            file.seekg(0, file.beg);\n            trtModelStream = new char[size];\n            assert(trtModelStream);\n            file.read(trtModelStream, size);\n            file.close();\n        }\n    } else {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./yolov4 -s  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./yolov4 -d ../samples  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(argv[2], file_names) < 0) {\n        std::cout << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    // prepare input data ---------------------------\n    static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];\n    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\n    //    data[i] = 1.0;\n    static float prob[BATCH_SIZE * OUTPUT_SIZE];\n    IRuntime* runtime = createInferRuntime(gLogger);\n    assert(runtime != nullptr);\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\n    assert(engine != nullptr);\n    IExecutionContext* context = engine->createExecutionContext();\n    assert(context != nullptr);\n    delete[] trtModelStream;\n\n    int fcount = 0;\n    for (int f = 0; f < (int)file_names.size(); f++) {\n        fcount++;\n        if (fcount < BATCH_SIZE && f + 1 != (int)file_names.size()) continue;\n        for (int b = 0; b < fcount; b++) {\n            cv::Mat img = cv::imread(std::string(argv[2]) + \"/\" + file_names[f - fcount + 1 + b]);\n            if (img.empty()) continue;\n            cv::Mat pr_img = preprocess_img(img);\n            for (int i = 0; i < INPUT_H * INPUT_W; i++) {\n                data[b * 3 * INPUT_H * INPUT_W + i] = pr_img.at<cv::Vec3b>(i)[2] / 255.0;\n                data[b * 3 * INPUT_H * INPUT_W + i + INPUT_H * INPUT_W] = pr_img.at<cv::Vec3b>(i)[1] / 255.0;\n                data[b * 3 * INPUT_H * INPUT_W + i + 2 * INPUT_H * INPUT_W] = pr_img.at<cv::Vec3b>(i)[0] / 255.0;\n            }\n        }\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n        doInference(*context, data, prob, BATCH_SIZE);\n        auto end = std::chrono::system_clock::now();\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n        std::vector<std::vector<Yolo::Detection>> batch_res(fcount);\n        for (int b = 0; b < fcount; b++) {\n            auto& res = batch_res[b];\n            nms(res, &prob[b * OUTPUT_SIZE]);\n        }\n        for (int b = 0; b < fcount; b++) {\n            auto& res = batch_res[b];\n            //std::cout << res.size() << std::endl;\n            cv::Mat img = cv::imread(std::string(argv[2]) + \"/\" + file_names[f - fcount + 1 + b]);\n            for (size_t j = 0; j < res.size(); j++) {\n                //float *p = (float*)&res[j];\n                //for (size_t k = 0; k < 7; k++) {\n                //    std::cout << p[k] << \", \";\n                //}\n                //std::cout << std::endl;\n                cv::Rect r = get_rect(img, res[j].bbox);\n                cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n                cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n            }\n            cv::imwrite(\"_\" + file_names[f - fcount + 1 + b], img);\n        }\n        fcount = 0;\n    }\n\n    // Destroy the engine\n    context->destroy();\n    engine->destroy();\n    runtime->destroy();\n\n    //Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << i / 10 << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov5/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\n\nproject(yolov5)\n\nadd_definitions(-std=c++11)\nadd_definitions(-DAPI_EXPORTS)\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\n# TODO(Call for PR): make cmake compatible with Windows\nset(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)\nenable_language(CUDA)\n\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\n# TODO(Call for PR): make TRT path configurable from command line\ninclude_directories(/home/nvidia/TensorRT-8.2.5.1/include/)\nlink_directories(/home/nvidia/TensorRT-8.2.5.1/lib/)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/src/)\ninclude_directories(${PROJECT_SOURCE_DIR}/plugin/)\nfile(GLOB_RECURSE SRCS ${PROJECT_SOURCE_DIR}/src/*.cpp ${PROJECT_SOURCE_DIR}/src/*.cu)\nfile(GLOB_RECURSE PLUGIN_SRCS ${PROJECT_SOURCE_DIR}/plugin/*.cu)\n\nadd_library(myplugins SHARED ${PLUGIN_SRCS})\ntarget_link_libraries(myplugins nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nadd_executable(yolov5_det yolov5_det.cpp ${SRCS})\ntarget_link_libraries(yolov5_det nvinfer)\ntarget_link_libraries(yolov5_det cudart)\ntarget_link_libraries(yolov5_det myplugins)\ntarget_link_libraries(yolov5_det ${OpenCV_LIBS})\n\nadd_executable(yolov5_cls yolov5_cls.cpp ${SRCS})\ntarget_link_libraries(yolov5_cls nvinfer)\ntarget_link_libraries(yolov5_cls cudart)\ntarget_link_libraries(yolov5_cls myplugins)\ntarget_link_libraries(yolov5_cls ${OpenCV_LIBS})\n\nadd_executable(yolov5_seg yolov5_seg.cpp ${SRCS})\ntarget_link_libraries(yolov5_seg nvinfer)\ntarget_link_libraries(yolov5_seg cudart)\ntarget_link_libraries(yolov5_seg myplugins)\ntarget_link_libraries(yolov5_seg ${OpenCV_LIBS})\n\n"
  },
  {
    "path": "yolov5/README.md",
    "content": "# YOLOv5\n\nTensorRTx inference code base for [ultralytics/yolov5](https://github.com/ultralytics/yolov5).\n\n## Contributors\n\n<a href=\"https://github.com/wang-xinyu\"><img src=\"https://avatars.githubusercontent.com/u/15235574?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/BaofengZan\"><img src=\"https://avatars.githubusercontent.com/u/20653176?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/upczww\"><img src=\"https://avatars.githubusercontent.com/u/16224249?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/cesarandreslopez\"><img src=\"https://avatars.githubusercontent.com/u/14029177?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/makaveli10\"><img src=\"https://avatars.githubusercontent.com/u/39617050?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/priteshgohil\"><img src=\"https://avatars.githubusercontent.com/u/43172056?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/rymzt\"><img src=\"https://avatars.githubusercontent.com/u/3270954?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/AsakusaRinne\"><img src=\"https://avatars.githubusercontent.com/u/47343601?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/freedenS\"><img src=\"https://avatars.githubusercontent.com/u/26213470?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/smarttowel\"><img src=\"https://avatars.githubusercontent.com/u/1128528?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/wwqgtxx\"><img src=\"https://avatars.githubusercontent.com/u/582584?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/adujardin\"><img src=\"https://avatars.githubusercontent.com/u/12609780?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/jow905\"><img src=\"https://avatars.githubusercontent.com/u/19189198?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/CristiFati\"><img src=\"https://avatars.githubusercontent.com/u/29705787?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/HaiyangPeng\"><img src=\"https://avatars.githubusercontent.com/u/46739135?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/Armassarion\"><img src=\"https://avatars.githubusercontent.com/u/33727511?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/xupengao\"><img src=\"https://avatars.githubusercontent.com/u/51817015?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/liuqi123123\"><img src=\"https://avatars.githubusercontent.com/u/46275888?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/ASONG0506\"><img src=\"https://avatars.githubusercontent.com/u/26050577?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/bobo0810\"><img src=\"https://avatars.githubusercontent.com/u/26057879?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/Silmeria112\"><img src=\"https://avatars.githubusercontent.com/u/16464837?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/LW-SCU\"><img src=\"https://avatars.githubusercontent.com/u/28128257?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/AdanWang\"><img src=\"https://avatars.githubusercontent.com/u/32757980?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/triple-Mu\"><img src=\"https://avatars.githubusercontent.com/u/92794867?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/xiang-wuu\"><img src=\"https://avatars.githubusercontent.com/u/107029401?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/uyolo1314\"><img src=\"https://avatars.githubusercontent.com/u/101853326?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/Rex-LK\"><img src=\"https://avatars.githubusercontent.com/u/74702576?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/PrinceP\"><img src=\"https://avatars.githubusercontent.com/u/10251537?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/hky3535\"><img src=\"https://avatars.githubusercontent.com/u/126926285?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/CharlesHuan\"><img src=\"https://avatars.githubusercontent.com/u/47875698?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n\n## Different versions of yolov5\n\nCurrently, we support yolov5 v1.0, v2.0, v3.0, v3.1, v4.0, v5.0, v6.0, v6.2, v7.0\n\n- For yolov5 v7.0, download .pt from [yolov5 release v7.0](https://github.com/ultralytics/yolov5/releases/tag/v7.0), `git clone -b v7.0 https://github.com/ultralytics/yolov5.git` and `git clone -b yolov5-v7.0 https://github.com/wang-xinyu/tensorrtx.git`, then follow how-to-run in [tensorrtx/yolov5-v7.0](https://github.com/wang-xinyu/tensorrtx/tree/yolov5-v7.0/yolov5)\n- For yolov5 v6.2, download .pt from [yolov5 release v6.2](https://github.com/ultralytics/yolov5/releases/tag/v6.2), `git clone -b v6.2 https://github.com/ultralytics/yolov5.git` and `git clone -b yolov5-v6.2 https://github.com/wang-xinyu/tensorrtx.git`, then follow how-to-run in [tensorrtx/yolov5-v6.2](https://github.com/wang-xinyu/tensorrtx/tree/yolov5-v6.2/yolov5)\n- For yolov5 v6.0, download .pt from [yolov5 release v6.0](https://github.com/ultralytics/yolov5/releases/tag/v6.0), `git clone -b v6.0 https://github.com/ultralytics/yolov5.git` and `git clone -b yolov5-v6.0 https://github.com/wang-xinyu/tensorrtx.git`, then follow how-to-run in [tensorrtx/yolov5-v6.0](https://github.com/wang-xinyu/tensorrtx/tree/yolov5-v6.0/yolov5).\n- For yolov5 v5.0, download .pt from [yolov5 release v5.0](https://github.com/ultralytics/yolov5/releases/tag/v5.0), `git clone -b v5.0 https://github.com/ultralytics/yolov5.git` and `git clone -b yolov5-v5.0 https://github.com/wang-xinyu/tensorrtx.git`, then follow how-to-run in [tensorrtx/yolov5-v5.0](https://github.com/wang-xinyu/tensorrtx/tree/yolov5-v5.0/yolov5).\n- For yolov5 v4.0, download .pt from [yolov5 release v4.0](https://github.com/ultralytics/yolov5/releases/tag/v4.0), `git clone -b v4.0 https://github.com/ultralytics/yolov5.git` and `git clone -b yolov5-v4.0 https://github.com/wang-xinyu/tensorrtx.git`, then follow how-to-run in [tensorrtx/yolov5-v4.0](https://github.com/wang-xinyu/tensorrtx/tree/yolov5-v4.0/yolov5).\n- For yolov5 v3.1, download .pt from [yolov5 release v3.1](https://github.com/ultralytics/yolov5/releases/tag/v3.1), `git clone -b v3.1 https://github.com/ultralytics/yolov5.git` and `git clone -b yolov5-v3.1 https://github.com/wang-xinyu/tensorrtx.git`, then follow how-to-run in [tensorrtx/yolov5-v3.1](https://github.com/wang-xinyu/tensorrtx/tree/yolov5-v3.1/yolov5).\n- For yolov5 v3.0, download .pt from [yolov5 release v3.0](https://github.com/ultralytics/yolov5/releases/tag/v3.0), `git clone -b v3.0 https://github.com/ultralytics/yolov5.git` and `git clone -b yolov5-v3.0 https://github.com/wang-xinyu/tensorrtx.git`, then follow how-to-run in [tensorrtx/yolov5-v3.0](https://github.com/wang-xinyu/tensorrtx/tree/yolov5-v3.0/yolov5).\n- For yolov5 v2.0, download .pt from [yolov5 release v2.0](https://github.com/ultralytics/yolov5/releases/tag/v2.0), `git clone -b v2.0 https://github.com/ultralytics/yolov5.git` and `git clone -b yolov5-v2.0 https://github.com/wang-xinyu/tensorrtx.git`, then follow how-to-run in [tensorrtx/yolov5-v2.0](https://github.com/wang-xinyu/tensorrtx/tree/yolov5-v2.0/yolov5).\n- For yolov5 v1.0, download .pt from [yolov5 release v1.0](https://github.com/ultralytics/yolov5/releases/tag/v1.0), `git clone -b v1.0 https://github.com/ultralytics/yolov5.git` and `git clone -b yolov5-v1.0 https://github.com/wang-xinyu/tensorrtx.git`, then follow how-to-run in [tensorrtx/yolov5-v1.0](https://github.com/wang-xinyu/tensorrtx/tree/yolov5-v1.0/yolov5).\n\n## Config\n\n- Choose the YOLOv5 sub-model n/s/m/l/x/n6/s6/m6/l6/x6 from command line arguments.\n- Other configs please check [src/config.h](src/config.h)\n\n## Build and Run\n\n### Detection\n\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\n\n```\ngit clone -b v7.0 https://github.com/ultralytics/yolov5.git\ngit clone -b yolov5-v7.0 https://github.com/wang-xinyu/tensorrtx.git\ncd yolov5/\nwget https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5s.pt\ncp [PATH-TO-TENSORRTX]/yolov5/gen_wts.py .\npython gen_wts.py -w yolov5s.pt -o yolov5s.wts\n# A file 'yolov5s.wts' will be generated.\n```\n\n2. build tensorrtx/yolov5 and run\n\n```\ncd [PATH-TO-TENSORRTX]/yolov5/\n# Update kNumClass in src/config.h if your model is trained on custom dataset\nmkdir build\ncd build\ncp [PATH-TO-ultralytics-yolov5]/yolov5s.wts . \ncmake ..\nmake\n\n./yolov5_det -s [.wts] [.engine] [n/s/m/l/x/n6/s6/m6/l6/x6 or c/c6 gd gw]  // serialize model to plan file\n./yolov5_det -d [.engine] [image folder]  // deserialize and run inference, the images in [image folder] will be processed.\n\n# For example yolov5s\n./yolov5_det -s yolov5s.wts yolov5s.engine s\n./yolov5_det -d yolov5s.engine ../images\n\n# For example Custom model with depth_multiple=0.17, width_multiple=0.25 in yolov5.yaml\n./yolov5_det -s yolov5_custom.wts yolov5.engine c 0.17 0.25\n./yolov5_det -d yolov5.engine ../images\n```\n\n3. Check the images generated, _zidane.jpg and _bus.jpg\n\n4. Optional, load and run the tensorrt model in Python\n\n```\n// Install python-tensorrt, pycuda, etc.\n// Ensure the yolov5s.engine and libmyplugins.so have been built\npython yolov5_det_trt.py\n\n// Another version of python script, which is using CUDA Python instead of pycuda.\npython yolov5_det_trt_cuda_python.py\n```\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/78247927-4d9fac00-751e-11ea-8b1b-704a0aeb3fcf.jpg\" height=\"360px;\">\n</p>\n\n### Classification\n\n```\n# Download ImageNet labels\nwget https://github.com/joannzhang00/ImageNet-dataset-classes-labels/blob/main/imagenet_classes.txt\n\n# Build and serialize TensorRT engine\n./yolov5_cls -s yolov5s-cls.wts yolov5s-cls.engine s\n\n# Run inference\n./yolov5_cls -d yolov5s-cls.engine ../images\n```\n\n### Instance Segmentation\n\n```\n# Build and serialize TensorRT engine\n./yolov5_seg -s yolov5s-seg.wts yolov5s-seg.engine s\n\n# Download the labels file\nwget -O coco.txt https://raw.githubusercontent.com/amikelive/coco-labels/master/coco-labels-2014_2017.txt\n\n# Run inference with labels file\n./yolov5_seg -d yolov5s-seg.engine ../images coco.txt\n```\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/10251537/211291625-1b912483-b6a6-4e92-80c1-434d165b6776.jpg\" height=\"360px;\">\n</p>\n\n# INT8 Quantization\n\n1. Prepare calibration images, you can randomly select 1000s images from your train set. For coco, you can also download my calibration images `coco_calib` from [GoogleDrive](https://drive.google.com/drive/folders/1s7jE9DtOngZMzJC1uL307J2MiaGwdRSI?usp=sharing) or [BaiduPan](https://pan.baidu.com/s/1GOm_-JobpyLMAqZWCDUhKg) pwd: a9wh\n\n2. unzip it in yolov5/build\n\n3. set the macro `USE_INT8` in src/config.h and make\n\n4. serialize the model and test\n\n\n## More Information\n\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n\n"
  },
  {
    "path": "yolov5/gen_wts.py",
    "content": "import argparse\nimport os\nimport struct\nimport torch\nfrom utils.torch_utils import select_device\n\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description='Convert .pt file to .wts')\n    parser.add_argument('-w', '--weights', required=True,\n                        help='Input weights (.pt) file path (required)')\n    parser.add_argument(\n        '-o', '--output', help='Output (.wts) file path (optional)')\n    parser.add_argument(\n        '-t', '--type', type=str, default='detect', choices=['detect', 'cls', 'seg'],\n        help='determines the model is detection/classification')\n    args = parser.parse_args()\n    if not os.path.isfile(args.weights):\n        raise SystemExit('Invalid input file')\n    if not args.output:\n        args.output = os.path.splitext(args.weights)[0] + '.wts'\n    elif os.path.isdir(args.output):\n        args.output = os.path.join(\n            args.output,\n            os.path.splitext(os.path.basename(args.weights))[0] + '.wts')\n    return args.weights, args.output, args.type\n\n\npt_file, wts_file, m_type = parse_args()\nprint(f'Generating .wts for {m_type} model')\n\n# Load model\nprint(f'Loading {pt_file}')\ndevice = select_device('cpu')\nmodel = torch.load(pt_file, map_location=device, weights_only=False)  # Load FP32 weights\nmodel = model['ema' if model.get('ema') else 'model'].float()\n\nif m_type in ['detect', 'seg']:\n    # update anchor_grid info\n    anchor_grid = model.model[-1].anchors * model.model[-1].stride[..., None, None]\n    # model.model[-1].anchor_grid = anchor_grid\n    delattr(model.model[-1], 'anchor_grid')  # model.model[-1] is detect layer\n    # The parameters are saved in the OrderDict through the \"register_buffer\" method, and then saved to the weight.\n    model.model[-1].register_buffer(\"anchor_grid\", anchor_grid)\n    model.model[-1].register_buffer(\"strides\", model.model[-1].stride)\n\nmodel.to(device).eval()\n\nprint(f'Writing into {wts_file}')\nwith open(wts_file, 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\n"
  },
  {
    "path": "yolov5/plugin/yololayer.cu",
    "content": "#include \"yololayer.h\"\n#include \"cuda_utils.h\"\n\n#include <cassert>\n#include <vector>\n#include <iostream>\n\nnamespace Tn {\ntemplate<typename T> \nvoid write(char*& buffer, const T& val) {\n  *reinterpret_cast<T*>(buffer) = val;\n  buffer += sizeof(T);\n}\n\ntemplate<typename T> \nvoid read(const char*& buffer, T& val) {\n  val = *reinterpret_cast<const T*>(buffer);\n  buffer += sizeof(T);\n}\n}\n\nnamespace nvinfer1 {\nYoloLayerPlugin::YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, bool is_segmentation, const std::vector<YoloKernel>& vYoloKernel) {\n  mClassCount = classCount;\n  mYoloV5NetWidth = netWidth;\n  mYoloV5NetHeight = netHeight;\n  mMaxOutObject = maxOut;\n  is_segmentation_ = is_segmentation;\n  mYoloKernel = vYoloKernel;\n  mKernelCount = vYoloKernel.size();\n\n  CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n  size_t AnchorLen = sizeof(float)* kNumAnchor * 2;\n  for (int ii = 0; ii < mKernelCount; ii++) {\n    CUDA_CHECK(cudaMalloc(&mAnchor[ii], AnchorLen));\n    const auto& yolo = mYoloKernel[ii];\n    CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n  }\n}\n\nYoloLayerPlugin::~YoloLayerPlugin() {\n  for (int ii = 0; ii < mKernelCount; ii++) {\n    CUDA_CHECK(cudaFree(mAnchor[ii]));\n  }\n  CUDA_CHECK(cudaFreeHost(mAnchor));\n}\n\n// create the plugin at runtime from a byte stream\nYoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length) {\n  using namespace Tn;\n  const char *d = reinterpret_cast<const char *>(data), *a = d;\n  read(d, mClassCount);\n  read(d, mThreadCount);\n  read(d, mKernelCount);\n  read(d, mYoloV5NetWidth);\n  read(d, mYoloV5NetHeight);\n  read(d, mMaxOutObject);\n  read(d, is_segmentation_);\n  mYoloKernel.resize(mKernelCount);\n  auto kernelSize = mKernelCount * sizeof(YoloKernel);\n  memcpy(mYoloKernel.data(), d, kernelSize);\n  d += kernelSize;\n  CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n  size_t AnchorLen = sizeof(float)* kNumAnchor * 2;\n  for (int ii = 0; ii < mKernelCount; ii++) {\n    CUDA_CHECK(cudaMalloc(&mAnchor[ii], AnchorLen));\n    const auto& yolo = mYoloKernel[ii];\n    CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n  }\n  assert(d == a + length);\n}\n\nvoid YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT {\n  using namespace Tn;\n  char* d = static_cast<char*>(buffer), *a = d;\n  write(d, mClassCount);\n  write(d, mThreadCount);\n  write(d, mKernelCount);\n  write(d, mYoloV5NetWidth);\n  write(d, mYoloV5NetHeight);\n  write(d, mMaxOutObject);\n  write(d, is_segmentation_);\n  auto kernelSize = mKernelCount * sizeof(YoloKernel);\n  memcpy(d, mYoloKernel.data(), kernelSize);\n  d += kernelSize;\n\n  assert(d == a + getSerializationSize());\n}\n\nsize_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT {\n  size_t s = sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mKernelCount);\n  s += sizeof(YoloKernel) * mYoloKernel.size();\n  s += sizeof(mYoloV5NetWidth) + sizeof(mYoloV5NetHeight);\n  s += sizeof(mMaxOutObject) + sizeof(is_segmentation_);\n  return s;\n}\n\nint YoloLayerPlugin::initialize() TRT_NOEXCEPT {\n  return 0;\n}\n\nDims YoloLayerPlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT {\n  //output the result to channel\n  int totalsize = mMaxOutObject * sizeof(Detection) / sizeof(float);\n  return Dims3(totalsize + 1, 1, 1);\n}\n\n// Set plugin namespace\nvoid YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT {\n  mPluginNamespace = pluginNamespace;\n}\n\nconst char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT {\n  return mPluginNamespace;\n}\n\n// Return the DataType of the plugin output at the requested index\nDataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT {\n  return DataType::kFLOAT;\n}\n\n// Return true if output tensor is broadcast across a batch.\nbool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT {\n  return false;\n}\n\n// Return true if plugin can use input that is broadcast across batch without replication.\nbool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT {\n  return false;\n}\n\nvoid YoloLayerPlugin::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT {}\n\n// Attach the plugin object to an execution context and grant the plugin the access to some context resource.\nvoid YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT {}\n\n// Detach the plugin object from its execution context.\nvoid YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\nconst char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT {\n  return \"YoloLayer_TRT\";\n}\n\nconst char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT {\n  return \"1\";\n}\n\nvoid YoloLayerPlugin::destroy() TRT_NOEXCEPT {\n  delete this;\n}\n\n// Clone the plugin\nIPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT {\n  YoloLayerPlugin* p = new YoloLayerPlugin(mClassCount, mYoloV5NetWidth, mYoloV5NetHeight, mMaxOutObject, is_segmentation_, mYoloKernel);\n  p->setPluginNamespace(mPluginNamespace);\n  return p;\n}\n\n__device__ float Logist(float data) { return 1.0f / (1.0f + expf(-data)); };\n\n__global__ void CalDetection(const float *input, float *output, int noElements,\n    const int netwidth, const int netheight, int maxoutobject, int yoloWidth,\n    int yoloHeight, const float anchors[kNumAnchor * 2], int classes, int outputElem, bool is_segmentation) {\n\n  int idx = threadIdx.x + blockDim.x * blockIdx.x;\n  if (idx >= noElements) return;\n\n  int total_grid = yoloWidth * yoloHeight;\n  int bnIdx = idx / total_grid;\n  idx = idx - total_grid * bnIdx;\n  int info_len_i = 5 + classes;\n  if (is_segmentation) info_len_i += 32;\n  const float* curInput = input + bnIdx * (info_len_i * total_grid * kNumAnchor);\n\n  for (int k = 0; k < kNumAnchor; ++k) {\n    float box_prob = Logist(curInput[idx + k * info_len_i * total_grid + 4 * total_grid]);\n    if (box_prob < kIgnoreThresh) continue;\n    int class_id = 0;\n    float max_cls_prob = 0.0;\n    for (int i = 5; i < 5 + classes; ++i) {\n      float p = Logist(curInput[idx + k * info_len_i * total_grid + i * total_grid]);\n      if (p > max_cls_prob) {\n        max_cls_prob = p;\n        class_id = i - 5;\n      }\n    }\n    float *res_count = output + bnIdx * outputElem;\n    int count = (int)atomicAdd(res_count, 1);\n    if (count >= maxoutobject) return;\n    char *data = (char*)res_count + sizeof(float) + count * sizeof(Detection);\n    Detection *det = (Detection*)(data);\n\n    int row = idx / yoloWidth;\n    int col = idx % yoloWidth;\n\n    det->bbox[0] = (col - 0.5f + 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 0 * total_grid])) * netwidth / yoloWidth;\n    det->bbox[1] = (row - 0.5f + 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 1 * total_grid])) * netheight / yoloHeight;\n\n    det->bbox[2] = 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 2 * total_grid]);\n    det->bbox[2] = det->bbox[2] * det->bbox[2] * anchors[2 * k];\n    det->bbox[3] = 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 3 * total_grid]);\n    det->bbox[3] = det->bbox[3] * det->bbox[3] * anchors[2 * k + 1];\n    det->conf = box_prob * max_cls_prob;\n    det->class_id = class_id;\n\n    for (int i = 0; is_segmentation && i < 32; i++) {\n      det->mask[i] = curInput[idx + k * info_len_i * total_grid + (i + 5 + classes) * total_grid];\n    }\n  }\n}\n\nvoid YoloLayerPlugin::forwardGpu(const float* const* inputs, float *output, cudaStream_t stream, int batchSize) {\n  int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);\n  for (int idx = 0; idx < batchSize; ++idx) {\n    CUDA_CHECK(cudaMemsetAsync(output + idx * outputElem, 0, sizeof(float), stream));\n  }\n  int numElem = 0;\n  for (unsigned int i = 0; i < mYoloKernel.size(); ++i) {\n    const auto& yolo = mYoloKernel[i];\n    numElem = yolo.width * yolo.height * batchSize;\n    if (numElem < mThreadCount) mThreadCount = numElem;\n\n    CalDetection << < (numElem + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream >> >\n      (inputs[i], output, numElem, mYoloV5NetWidth, mYoloV5NetHeight, mMaxOutObject, yolo.width, yolo.height, (float*)mAnchor[i], mClassCount, outputElem, is_segmentation_);\n  }\n}\n\n\nint YoloLayerPlugin::enqueue(int batchSize, const void* const* inputs, void* TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT {\n  forwardGpu((const float* const*)inputs, (float*)outputs[0], stream, batchSize);\n  return 0;\n}\n\nPluginFieldCollection YoloPluginCreator::mFC{};\nstd::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\nYoloPluginCreator::YoloPluginCreator() {\n  mPluginAttributes.clear();\n  mFC.nbFields = mPluginAttributes.size();\n  mFC.fields = mPluginAttributes.data();\n}\n\nconst char* YoloPluginCreator::getPluginName() const TRT_NOEXCEPT {\n  return \"YoloLayer_TRT\";\n}\n\nconst char* YoloPluginCreator::getPluginVersion() const TRT_NOEXCEPT {\n  return \"1\";\n}\n\nconst PluginFieldCollection* YoloPluginCreator::getFieldNames() TRT_NOEXCEPT {\n  return &mFC;\n}\n\nIPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT {\n  assert(fc->nbFields == 2);\n  assert(strcmp(fc->fields[0].name, \"netinfo\") == 0);\n  assert(strcmp(fc->fields[1].name, \"kernels\") == 0);\n  int *p_netinfo = (int*)(fc->fields[0].data);\n  int class_count = p_netinfo[0];\n  int input_w = p_netinfo[1];\n  int input_h = p_netinfo[2];\n  int max_output_object_count = p_netinfo[3];\n  bool is_segmentation = (bool)p_netinfo[4];\n  std::vector<YoloKernel> kernels(fc->fields[1].length);\n  memcpy(&kernels[0], fc->fields[1].data, kernels.size() * sizeof(YoloKernel));\n  YoloLayerPlugin* obj = new YoloLayerPlugin(class_count, input_w, input_h, max_output_object_count, is_segmentation, kernels);\n  obj->setPluginNamespace(mNamespace.c_str());\n  return obj;\n}\n\nIPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT {\n  // This object will be deleted when the network is destroyed, which will\n  // call YoloLayerPlugin::destroy()\n  YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n  obj->setPluginNamespace(mNamespace.c_str());\n  return obj;\n}\n}\n\n"
  },
  {
    "path": "yolov5/plugin/yololayer.h",
    "content": "#pragma once\n\n#include \"types.h\"\n#include \"macros.h\"\n\n#include <vector>\n#include <string>\n\nnamespace nvinfer1 {\nclass API YoloLayerPlugin : public IPluginV2IOExt {\npublic:\n  YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, bool is_segmentation, const std::vector<YoloKernel>& vYoloKernel);\n  YoloLayerPlugin(const void* data, size_t length);\n  ~YoloLayerPlugin();\n\n  int getNbOutputs() const TRT_NOEXCEPT override { return 1; }\n\n  Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n  int initialize() TRT_NOEXCEPT override;\n\n  virtual void terminate() TRT_NOEXCEPT override {};\n\n  virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0; }\n\n  virtual int enqueue(int batchSize, const void* const* inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT override;\n\n  virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n  virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n  bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const TRT_NOEXCEPT override {\n    return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n  }\n\n  const char* getPluginType() const TRT_NOEXCEPT override;\n\n  const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n  void destroy() TRT_NOEXCEPT override;\n\n  IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n  void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n  const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n  DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override;\n\n  bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT override;\n\n  bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n  void attachToContext(\n      cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n  void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT override;\n\n  void detachFromContext() TRT_NOEXCEPT override;\n\n private:\n  void forwardGpu(const float* const* inputs, float *output, cudaStream_t stream, int batchSize = 1);\n  int mThreadCount = 256;\n  const char* mPluginNamespace;\n  int mKernelCount;\n  int mClassCount;\n  int mYoloV5NetWidth;\n  int mYoloV5NetHeight;\n  int mMaxOutObject;\n  bool is_segmentation_;\n  std::vector<YoloKernel> mYoloKernel;\n  void** mAnchor;\n};\n\nclass API YoloPluginCreator : public IPluginCreator {\n public:\n  YoloPluginCreator();\n\n  ~YoloPluginCreator() override = default;\n\n  const char* getPluginName() const TRT_NOEXCEPT override;\n\n  const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n  const PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n  IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n  IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT override;\n\n  void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override {\n    mNamespace = libNamespace;\n  }\n\n  const char* getPluginNamespace() const TRT_NOEXCEPT override {\n    return mNamespace.c_str();\n  }\n\n private:\n  std::string mNamespace;\n  static PluginFieldCollection mFC;\n  static std::vector<PluginField> mPluginAttributes;\n};\nREGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n};\n\n"
  },
  {
    "path": "yolov5/src/calibrator.cpp",
    "content": "#include \"calibrator.h\"\n#include \"cuda_utils.h\"\n#include \"utils.h\"\n\n#include <iostream>\n#include <iterator>\n#include <fstream>\n#include <opencv2/opencv.hpp>\n#include <opencv2/dnn/dnn.hpp>\n\ncv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h) {\n  int w, h, x, y;\n  float r_w = input_w / (img.cols * 1.0);\n  float r_h = input_h / (img.rows * 1.0);\n  if (r_h > r_w) {\n    w = input_w;\n    h = r_w * img.rows;\n    x = 0;\n    y = (input_h - h) / 2;\n  } else {\n    w = r_h * img.cols;\n    h = input_h;\n    x = (input_w - w) / 2;\n    y = 0;\n  }\n  cv::Mat re(h, w, CV_8UC3);\n  cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n  cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(128, 128, 128));\n  re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n  return out;\n}\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name, const char* input_blob_name, bool read_cache)\n    : batchsize_(batchsize),\n      input_w_(input_w),\n      input_h_(input_h),\n      img_idx_(0),\n      img_dir_(img_dir),\n      calib_table_name_(calib_table_name),\n      input_blob_name_(input_blob_name),\n      read_cache_(read_cache) {\n  input_count_ = 3 * input_w * input_h * batchsize;\n  CUDA_CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n  read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2() {\n  CUDA_CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const TRT_NOEXCEPT {\n  return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT {\n  if (img_idx_ + batchsize_ > (int)img_files_.size()) {\n    return false;\n  }\n\n  std::vector<cv::Mat> input_imgs_;\n  for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n    std::cout << img_files_[i] << \"  \" << i << std::endl;\n    cv::Mat temp = cv::imread(img_dir_ + img_files_[i]);\n    if (temp.empty()) {\n      std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n      return false;\n    }\n    cv::Mat pr_img = preprocess_img(temp, input_w_, input_h_);\n    input_imgs_.push_back(pr_img);\n  }\n  img_idx_ += batchsize_;\n  cv::Mat blob = cv::dnn::blobFromImages(input_imgs_, 1.0 / 255.0, cv::Size(input_w_, input_h_), cv::Scalar(0, 0, 0), true, false);\n\n  CUDA_CHECK(cudaMemcpy(device_input_, blob.ptr<float>(0), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n  assert(!strcmp(names[0], input_blob_name_));\n  bindings[0] = device_input_;\n  return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length) TRT_NOEXCEPT {\n  std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n  calib_cache_.clear();\n  std::ifstream input(calib_table_name_, std::ios::binary);\n  input >> std::noskipws;\n  if (read_cache_ && input.good()) {\n    std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n  }\n  length = calib_cache_.size();\n  return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT {\n  std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n  std::ofstream output(calib_table_name_, std::ios::binary);\n  output.write(reinterpret_cast<const char*>(cache), length);\n}\n\n"
  },
  {
    "path": "yolov5/src/calibrator.h",
    "content": "#pragma once\n\n#include \"macros.h\"\n#include <string>\n#include <vector>\n#include <opencv2/opencv.hpp>\n\ncv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h);\n\n//! \\class Int8EntropyCalibrator2\n//!\n//! \\brief Implements Entropy calibrator 2.\n//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.\n//!\nclass Int8EntropyCalibrator2 : public nvinfer1::IInt8EntropyCalibrator2 {\n public:\n  Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name, const char* input_blob_name, bool read_cache = true);\n\n  virtual ~Int8EntropyCalibrator2();\n  int getBatchSize() const TRT_NOEXCEPT override;\n  bool getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT override;\n  const void* readCalibrationCache(size_t& length) TRT_NOEXCEPT override;\n  void writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT override;\n\n private:\n  int batchsize_;\n  int input_w_;\n  int input_h_;\n  int img_idx_;\n  std::string img_dir_;\n  std::vector<std::string> img_files_;\n  size_t input_count_;\n  std::string calib_table_name_;\n  const char* input_blob_name_;\n  bool read_cache_;\n  void* device_input_;\n  std::vector<char> calib_cache_;\n};\n\n"
  },
  {
    "path": "yolov5/src/config.h",
    "content": "#pragma once\n\n/* --------------------------------------------------------\n * These configs are related to tensorrt model, if these are changed,\n * please re-compile and re-serialize the tensorrt model.\n * --------------------------------------------------------*/\n\n// For INT8, you need prepare the calibration dataset, please refer to\n// https://github.com/wang-xinyu/tensorrtx/tree/master/yolov5#int8-quantization\n#define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32\n\n// These are used to define input/output tensor names,\n// you can set them to whatever you want.\nconst static char* kInputTensorName = \"data\";\nconst static char* kOutputTensorName = \"prob\";\n\n// Detection model and Segmentation model' number of classes\nconstexpr static int kNumClass = 80;\n\n// Classfication model's number of classes\nconstexpr static int kClsNumClass = 1000;\n\nconstexpr static int kBatchSize = 1;\n\n// Yolo's input width and height must by divisible by 32\nconstexpr static int kInputH = 640;\nconstexpr static int kInputW = 640;\n\n// Classfication model's input shape\nconstexpr static int kClsInputH = 224;\nconstexpr static int kClsInputW = 224;\n\n// Maximum number of output bounding boxes from yololayer plugin.\n// That is maximum number of output bounding boxes before NMS.\nconstexpr static int kMaxNumOutputBbox = 1000;\n\nconstexpr static int kNumAnchor = 3;\n\n// The bboxes whose confidence is lower than kIgnoreThresh will be ignored in yololayer plugin.\nconstexpr static float kIgnoreThresh = 0.1f;\n\n/* --------------------------------------------------------\n * These configs are NOT related to tensorrt model, if these are changed,\n * please re-compile, but no need to re-serialize the tensorrt model.\n * --------------------------------------------------------*/\n\n// NMS overlapping thresh and final detection confidence thresh\nconst static float kNmsThresh = 0.45f;\nconst static float kConfThresh = 0.5f;\n\nconst static int kGpuId = 0;\n\n// If your image size is larger than 4096 * 3112, please increase this value\nconst static int kMaxInputImageSize = 4096 * 3112;\n\n"
  },
  {
    "path": "yolov5/src/cuda_utils.h",
    "content": "#ifndef TRTX_CUDA_UTILS_H_\n#define TRTX_CUDA_UTILS_H_\n\n#include <cuda_runtime_api.h>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)\\\n    {\\\n        cudaError_t error_code = callstr;\\\n        if (error_code != cudaSuccess) {\\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__;\\\n            assert(0);\\\n        }\\\n    }\n#endif  // CUDA_CHECK\n\n#endif  // TRTX_CUDA_UTILS_H_\n\n"
  },
  {
    "path": "yolov5/src/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override \n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "yolov5/src/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#include <NvInfer.h>\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "yolov5/src/model.cpp",
    "content": "#include \"model.h\"\n#include \"calibrator.h\"\n#include \"config.h\"\n#include \"yololayer.h\"\n\n#include <iostream>\n#include <fstream>\n#include <map>\n#include <cassert>\n#include <cmath>\n#include <cstring>\n\nusing namespace nvinfer1;\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstatic std::map<std::string, Weights> loadWeights(const std::string file) {\n  std::cout << \"Loading weights: \" << file << std::endl;\n  std::map<std::string, Weights> weightMap;\n\n  // Open weights file\n  std::ifstream input(file);\n  assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n  // Read number of weight blobs\n  int32_t count;\n  input >> count;\n  assert(count > 0 && \"Invalid weight map file.\");\n\n  while (count--) {\n    Weights wt{ DataType::kFLOAT, nullptr, 0 };\n    uint32_t size;\n\n    // Read name and type of blob\n    std::string name;\n    input >> name >> std::dec >> size;\n    wt.type = DataType::kFLOAT;\n\n    // Load blob\n    uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n    for (uint32_t x = 0, y = size; x < y; ++x) {\n      input >> std::hex >> val[x];\n    }\n    wt.values = val;\n\n    wt.count = size;\n    weightMap[name] = wt;\n  }\n\n  return weightMap;\n}\n\nstatic int get_width(int x, float gw, int divisor = 8) {\n  return int(ceil((x * gw) / divisor)) * divisor;\n}\n\nstatic int get_depth(int x, float gd) {\n  if (x == 1) return 1;\n  int r = round(x * gd);\n  if (x * gd - int(x * gd) == 0.5 && (int(x * gd) % 2) == 0) {\n    --r;\n  }\n  return std::max<int>(r, 1);\n}\n\nstatic IScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n  float* gamma = (float*)weightMap[lname + \".weight\"].values;\n  float* beta = (float*)weightMap[lname + \".bias\"].values;\n  float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n  float* var = (float*)weightMap[lname + \".running_var\"].values;\n  int len = weightMap[lname + \".running_var\"].count;\n\n  float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n  for (int i = 0; i < len; i++) {\n    scval[i] = gamma[i] / sqrt(var[i] + eps);\n  }\n  Weights scale{ DataType::kFLOAT, scval, len };\n\n  float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n  for (int i = 0; i < len; i++) {\n    shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n  }\n  Weights shift{ DataType::kFLOAT, shval, len };\n\n  float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n  for (int i = 0; i < len; i++) {\n    pval[i] = 1.0;\n  }\n  Weights power{ DataType::kFLOAT, pval, len };\n\n  weightMap[lname + \".scale\"] = scale;\n  weightMap[lname + \".shift\"] = shift;\n  weightMap[lname + \".power\"] = power;\n  IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n  assert(scale_1);\n  return scale_1;\n}\n\nstatic ILayer* convBlock(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int ksize, int s, int g, std::string lname) {\n  Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n  int p = ksize / 3;\n  IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ ksize, ksize }, weightMap[lname + \".conv.weight\"], emptywts);\n  assert(conv1);\n  conv1->setStrideNd(DimsHW{ s, s });\n  conv1->setPaddingNd(DimsHW{ p, p });\n  conv1->setNbGroups(g);\n  conv1->setName((lname + \".conv\").c_str());\n  IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn\", 1e-3);\n\n  // silu = x * sigmoid\n  auto sig = network->addActivation(*bn1->getOutput(0), ActivationType::kSIGMOID);\n  assert(sig);\n  auto ew = network->addElementWise(*bn1->getOutput(0), *sig->getOutput(0), ElementWiseOperation::kPROD);\n  assert(ew);\n  return ew;\n}\n\nstatic ILayer* focus(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch, int outch, int ksize, std::string lname) {\n  ISliceLayer* s1 = network->addSlice(input, Dims3{ 0, 0, 0 }, Dims3{ inch, kInputH / 2, kInputW / 2 }, Dims3{ 1, 2, 2 });\n  ISliceLayer* s2 = network->addSlice(input, Dims3{ 0, 1, 0 }, Dims3{ inch, kInputH / 2, kInputW / 2 }, Dims3{ 1, 2, 2 });\n  ISliceLayer* s3 = network->addSlice(input, Dims3{ 0, 0, 1 }, Dims3{ inch, kInputH / 2, kInputW / 2 }, Dims3{ 1, 2, 2 });\n  ISliceLayer* s4 = network->addSlice(input, Dims3{ 0, 1, 1 }, Dims3{ inch, kInputH / 2, kInputW / 2 }, Dims3{ 1, 2, 2 });\n  ITensor* inputTensors[] = { s1->getOutput(0), s2->getOutput(0), s3->getOutput(0), s4->getOutput(0) };\n  auto cat = network->addConcatenation(inputTensors, 4);\n  auto conv = convBlock(network, weightMap, *cat->getOutput(0), outch, ksize, 1, 1, lname + \".conv\");\n  return conv;\n}\n\nstatic ILayer* bottleneck(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2, bool shortcut, int g, float e, std::string lname) {\n  auto cv1 = convBlock(network, weightMap, input, (int)((float)c2 * e), 1, 1, 1, lname + \".cv1\");\n  auto cv2 = convBlock(network, weightMap, *cv1->getOutput(0), c2, 3, 1, g, lname + \".cv2\");\n  if (shortcut && c1 == c2) {\n    auto ew = network->addElementWise(input, *cv2->getOutput(0), ElementWiseOperation::kSUM);\n    return ew;\n  }\n  return cv2;\n}\n\nstatic ILayer* bottleneckCSP(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2, int n, bool shortcut, int g, float e, std::string lname) {\n  Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n  int c_ = (int)((float)c2 * e);\n  auto cv1 = convBlock(network, weightMap, input, c_, 1, 1, 1, lname + \".cv1\");\n  auto cv2 = network->addConvolutionNd(input, c_, DimsHW{ 1, 1 }, weightMap[lname + \".cv2.weight\"], emptywts);\n  ITensor* y1 = cv1->getOutput(0);\n  for (int i = 0; i < n; i++) {\n    auto b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, g, 1.0, lname + \".m.\" + std::to_string(i));\n    y1 = b->getOutput(0);\n  }\n  auto cv3 = network->addConvolutionNd(*y1, c_, DimsHW{ 1, 1 }, weightMap[lname + \".cv3.weight\"], emptywts);\n\n  ITensor* inputTensors[] = { cv3->getOutput(0), cv2->getOutput(0) };\n  auto cat = network->addConcatenation(inputTensors, 2);\n\n  IScaleLayer* bn = addBatchNorm2d(network, weightMap, *cat->getOutput(0), lname + \".bn\", 1e-4);\n  auto lr = network->addActivation(*bn->getOutput(0), ActivationType::kLEAKY_RELU);\n  lr->setAlpha(0.1);\n\n  auto cv4 = convBlock(network, weightMap, *lr->getOutput(0), c2, 1, 1, 1, lname + \".cv4\");\n  return cv4;\n}\n\nstatic ILayer* C3(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2, int n, bool shortcut, int g, float e, std::string lname) {\n  int c_ = (int)((float)c2 * e);\n  auto cv1 = convBlock(network, weightMap, input, c_, 1, 1, 1, lname + \".cv1\");\n  auto cv2 = convBlock(network, weightMap, input, c_, 1, 1, 1, lname + \".cv2\");\n  ITensor *y1 = cv1->getOutput(0);\n  for (int i = 0; i < n; i++) {\n    auto b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, g, 1.0, lname + \".m.\" + std::to_string(i));\n    y1 = b->getOutput(0);\n  }\n\n  ITensor* inputTensors[] = { y1, cv2->getOutput(0) };\n  auto cat = network->addConcatenation(inputTensors, 2);\n\n  auto cv3 = convBlock(network, weightMap, *cat->getOutput(0), c2, 1, 1, 1, lname + \".cv3\");\n  return cv3;\n}\n\nstatic ILayer* SPP(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2, int k1, int k2, int k3, std::string lname) {\n  int c_ = c1 / 2;\n  auto cv1 = convBlock(network, weightMap, input, c_, 1, 1, 1, lname + \".cv1\");\n\n  auto pool1 = network->addPoolingNd(*cv1->getOutput(0), PoolingType::kMAX, DimsHW{ k1, k1 });\n  pool1->setPaddingNd(DimsHW{ k1 / 2, k1 / 2 });\n  pool1->setStrideNd(DimsHW{ 1, 1 });\n  auto pool2 = network->addPoolingNd(*cv1->getOutput(0), PoolingType::kMAX, DimsHW{ k2, k2 });\n  pool2->setPaddingNd(DimsHW{ k2 / 2, k2 / 2 });\n  pool2->setStrideNd(DimsHW{ 1, 1 });\n  auto pool3 = network->addPoolingNd(*cv1->getOutput(0), PoolingType::kMAX, DimsHW{ k3, k3 });\n  pool3->setPaddingNd(DimsHW{ k3 / 2, k3 / 2 });\n  pool3->setStrideNd(DimsHW{ 1, 1 });\n\n  ITensor* inputTensors[] = { cv1->getOutput(0), pool1->getOutput(0), pool2->getOutput(0), pool3->getOutput(0) };\n  auto cat = network->addConcatenation(inputTensors, 4);\n\n  auto cv2 = convBlock(network, weightMap, *cat->getOutput(0), c2, 1, 1, 1, lname + \".cv2\");\n  return cv2;\n}\n\nstatic ILayer* SPPF(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2, int k, std::string lname) {\n  int c_ = c1 / 2;\n  auto cv1 = convBlock(network, weightMap, input, c_, 1, 1, 1, lname + \".cv1\");\n\n  auto pool1 = network->addPoolingNd(*cv1->getOutput(0), PoolingType::kMAX, DimsHW{ k, k });\n  pool1->setPaddingNd(DimsHW{ k / 2, k / 2 });\n  pool1->setStrideNd(DimsHW{ 1, 1 });\n  auto pool2 = network->addPoolingNd(*pool1->getOutput(0), PoolingType::kMAX, DimsHW{ k, k });\n  pool2->setPaddingNd(DimsHW{ k / 2, k / 2 });\n  pool2->setStrideNd(DimsHW{ 1, 1 });\n  auto pool3 = network->addPoolingNd(*pool2->getOutput(0), PoolingType::kMAX, DimsHW{ k, k });\n  pool3->setPaddingNd(DimsHW{ k / 2, k / 2 });\n  pool3->setStrideNd(DimsHW{ 1, 1 });\n  ITensor* inputTensors[] = { cv1->getOutput(0), pool1->getOutput(0), pool2->getOutput(0), pool3->getOutput(0) };\n  auto cat = network->addConcatenation(inputTensors, 4);\n  auto cv2 = convBlock(network, weightMap, *cat->getOutput(0), c2, 1, 1, 1, lname + \".cv2\");\n  return cv2;\n}\n\nstatic ILayer* Proto(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c_, int c2, std::string lname) {\n  auto cv1 = convBlock(network, weightMap, input, c_, 3, 1, 1, lname + \".cv1\");\n\n  auto upsample = network->addResize(*cv1->getOutput(0));\n  assert(upsample);\n  upsample->setResizeMode(ResizeMode::kNEAREST);\n  const float scales[] = {1, 2, 2};\n  upsample->setScales(scales, 3);\n\n  auto cv2 = convBlock(network, weightMap, *upsample->getOutput(0), c_, 3, 1, 1, lname + \".cv2\");\n  auto cv3 = convBlock(network, weightMap, *cv2->getOutput(0), c2, 1, 1, 1, lname + \".cv3\");\n  assert(cv3);\n  return cv3;\n}\n\nstatic std::vector<std::vector<float>> getAnchors(std::map<std::string, Weights>& weightMap, std::string lname) {\n  std::vector<std::vector<float>> anchors;\n  Weights wts = weightMap[lname + \".anchor_grid\"];\n  int anchor_len = kNumAnchor * 2;\n  for (int i = 0; i < wts.count / anchor_len; i++) {\n    auto *p = (const float*)wts.values + i * anchor_len;\n    std::vector<float> anchor(p, p + anchor_len);\n    anchors.push_back(anchor);\n  }\n  return anchors;\n}\n\nstatic IPluginV2Layer* addYoLoLayer(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, std::string lname, std::vector<IConvolutionLayer*> dets, bool is_segmentation = false) {\n  auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n  auto anchors = getAnchors(weightMap, lname);\n  PluginField plugin_fields[2];\n  int netinfo[5] = {kNumClass, kInputW, kInputH, kMaxNumOutputBbox, (int)is_segmentation};\n  plugin_fields[0].data = netinfo;\n  plugin_fields[0].length = 5;\n  plugin_fields[0].name = \"netinfo\";\n  plugin_fields[0].type = PluginFieldType::kFLOAT32;\n\n  //load strides from Detect layer\n  assert(weightMap.find(lname + \".strides\") != weightMap.end() && \"Not found `strides`, please check gen_wts.py!!!\");\n  Weights strides = weightMap[lname + \".strides\"];\n  auto *p = (const float*)(strides.values);\n  std::vector<int> scales(p, p + strides.count);\n\n  std::vector<YoloKernel> kernels;\n  for (size_t i = 0; i < anchors.size(); i++) {\n    YoloKernel kernel;\n    kernel.width = kInputW / scales[i];\n    kernel.height = kInputH / scales[i];\n    memcpy(kernel.anchors, &anchors[i][0], anchors[i].size() * sizeof(float));\n    kernels.push_back(kernel);\n  }\n  plugin_fields[1].data = &kernels[0];\n  plugin_fields[1].length = kernels.size();\n  plugin_fields[1].name = \"kernels\";\n  plugin_fields[1].type = PluginFieldType::kFLOAT32;\n  PluginFieldCollection plugin_data;\n  plugin_data.nbFields = 2;\n  plugin_data.fields = plugin_fields;\n  IPluginV2 *plugin_obj = creator->createPlugin(\"yololayer\", &plugin_data);\n  std::vector<ITensor*> input_tensors;\n  for (auto det: dets) {\n    input_tensors.push_back(det->getOutput(0));\n  }\n  auto yolo = network->addPluginV2(&input_tensors[0], input_tensors.size(), *plugin_obj);\n  return yolo;\n}\n\nICudaEngine* build_det_engine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, float& gd, float& gw, std::string& wts_name) {\n  INetworkDefinition* network = builder->createNetworkV2(0U);\n\n  // Create input tensor of shape {3, kInputH, kInputW}\n  ITensor* data = network->addInput(kInputTensorName, dt, Dims3{ 3, kInputH, kInputW });\n  assert(data);\n  std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n  // Backbone\n  auto conv0 = convBlock(network, weightMap, *data,  get_width(64, gw), 6, 2, 1,  \"model.0\");\n  assert(conv0);\n  auto conv1 = convBlock(network, weightMap, *conv0->getOutput(0), get_width(128, gw), 3, 2, 1, \"model.1\");\n  auto bottleneck_CSP2 = C3(network, weightMap, *conv1->getOutput(0), get_width(128, gw), get_width(128, gw), get_depth(3, gd), true, 1, 0.5, \"model.2\");\n  auto conv3 = convBlock(network, weightMap, *bottleneck_CSP2->getOutput(0), get_width(256, gw), 3, 2, 1, \"model.3\");\n  auto bottleneck_csp4 = C3(network, weightMap, *conv3->getOutput(0), get_width(256, gw), get_width(256, gw), get_depth(6, gd), true, 1, 0.5, \"model.4\");\n  auto conv5 = convBlock(network, weightMap, *bottleneck_csp4->getOutput(0), get_width(512, gw), 3, 2, 1, \"model.5\");\n  auto bottleneck_csp6 = C3(network, weightMap, *conv5->getOutput(0), get_width(512, gw), get_width(512, gw), get_depth(9, gd), true, 1, 0.5, \"model.6\");\n  auto conv7 = convBlock(network, weightMap, *bottleneck_csp6->getOutput(0), get_width(1024, gw), 3, 2, 1, \"model.7\");\n  auto bottleneck_csp8 = C3(network, weightMap, *conv7->getOutput(0), get_width(1024, gw), get_width(1024, gw), get_depth(3, gd), true, 1, 0.5, \"model.8\");\n  auto spp9 = SPPF(network, weightMap, *bottleneck_csp8->getOutput(0), get_width(1024, gw), get_width(1024, gw), 5, \"model.9\");\n\n  // Head\n  auto conv10 = convBlock(network, weightMap, *spp9->getOutput(0), get_width(512, gw), 1, 1, 1, \"model.10\");\n\n  auto upsample11 = network->addResize(*conv10->getOutput(0));\n  assert(upsample11);\n  upsample11->setResizeMode(ResizeMode::kNEAREST);\n  upsample11->setOutputDimensions(bottleneck_csp6->getOutput(0)->getDimensions());\n\n  ITensor* inputTensors12[] = { upsample11->getOutput(0), bottleneck_csp6->getOutput(0) };\n  auto cat12 = network->addConcatenation(inputTensors12, 2);\n  auto bottleneck_csp13 = C3(network, weightMap, *cat12->getOutput(0), get_width(1024, gw), get_width(512, gw), get_depth(3, gd), false, 1, 0.5, \"model.13\");\n  auto conv14 = convBlock(network, weightMap, *bottleneck_csp13->getOutput(0), get_width(256, gw), 1, 1, 1, \"model.14\");\n\n  auto upsample15 = network->addResize(*conv14->getOutput(0));\n  assert(upsample15);\n  upsample15->setResizeMode(ResizeMode::kNEAREST);\n  upsample15->setOutputDimensions(bottleneck_csp4->getOutput(0)->getDimensions());\n\n  ITensor* inputTensors16[] = { upsample15->getOutput(0), bottleneck_csp4->getOutput(0) };\n  auto cat16 = network->addConcatenation(inputTensors16, 2);\n\n  auto bottleneck_csp17 = C3(network, weightMap, *cat16->getOutput(0), get_width(512, gw), get_width(256, gw), get_depth(3, gd), false, 1, 0.5, \"model.17\");\n\n  // Detect\n  IConvolutionLayer* det0 = network->addConvolutionNd(*bottleneck_csp17->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.24.m.0.weight\"], weightMap[\"model.24.m.0.bias\"]);\n  auto conv18 = convBlock(network, weightMap, *bottleneck_csp17->getOutput(0), get_width(256, gw), 3, 2, 1, \"model.18\");\n  ITensor* inputTensors19[] = { conv18->getOutput(0), conv14->getOutput(0) };\n  auto cat19 = network->addConcatenation(inputTensors19, 2);\n  auto bottleneck_csp20 = C3(network, weightMap, *cat19->getOutput(0), get_width(512, gw), get_width(512, gw), get_depth(3, gd), false, 1, 0.5, \"model.20\");\n  IConvolutionLayer* det1 = network->addConvolutionNd(*bottleneck_csp20->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.24.m.1.weight\"], weightMap[\"model.24.m.1.bias\"]);\n  auto conv21 = convBlock(network, weightMap, *bottleneck_csp20->getOutput(0), get_width(512, gw), 3, 2, 1, \"model.21\");\n  ITensor* inputTensors22[] = { conv21->getOutput(0), conv10->getOutput(0) };\n  auto cat22 = network->addConcatenation(inputTensors22, 2);\n  auto bottleneck_csp23 = C3(network, weightMap, *cat22->getOutput(0), get_width(1024, gw), get_width(1024, gw), get_depth(3, gd), false, 1, 0.5, \"model.23\");\n  IConvolutionLayer* det2 = network->addConvolutionNd(*bottleneck_csp23->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.24.m.2.weight\"], weightMap[\"model.24.m.2.bias\"]);\n\n  auto yolo = addYoLoLayer(network, weightMap, \"model.24\", std::vector<IConvolutionLayer*>{det0, det1, det2});\n  yolo->getOutput(0)->setName(kOutputTensorName);\n  network->markOutput(*yolo->getOutput(0));\n\n  // Engine config\n  builder->setMaxBatchSize(maxBatchSize);\n  config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#if defined(USE_FP16)\n  config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n  std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n  assert(builder->platformHasFastInt8());\n  config->setFlag(BuilderFlag::kINT8);\n  Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\n  config->setInt8Calibrator(calibrator);\n#endif\n\n  std::cout << \"Building engine, please wait for a while...\" << std::endl;\n  ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n  std::cout << \"Build engine successfully!\" << std::endl;\n\n  // Don't need the network any more\n  network->destroy();\n\n  // Release host memory\n  for (auto& mem : weightMap) {\n    free((void*)(mem.second.values));\n  }\n\n  return engine;\n}\n\nICudaEngine* build_det_p6_engine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, float& gd, float& gw, std::string& wts_name) {\n  INetworkDefinition* network = builder->createNetworkV2(0U);\n\n  // Create input tensor of shape {3, kInputH, kInputW}\n  ITensor* data = network->addInput(kInputTensorName, dt, Dims3{ 3, kInputH, kInputW });\n  assert(data);\n\n  std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n  // Backbone\n  auto conv0 = convBlock(network, weightMap, *data,  get_width(64, gw), 6, 2, 1,  \"model.0\");\n  auto conv1 = convBlock(network, weightMap, *conv0->getOutput(0), get_width(128, gw), 3, 2, 1, \"model.1\");\n  auto c3_2 = C3(network, weightMap, *conv1->getOutput(0), get_width(128, gw), get_width(128, gw), get_depth(3, gd), true, 1, 0.5, \"model.2\");\n  auto conv3 = convBlock(network, weightMap, *c3_2->getOutput(0), get_width(256, gw), 3, 2, 1, \"model.3\");\n  auto c3_4 = C3(network, weightMap, *conv3->getOutput(0), get_width(256, gw), get_width(256, gw), get_depth(6, gd), true, 1, 0.5, \"model.4\");\n  auto conv5 = convBlock(network, weightMap, *c3_4->getOutput(0), get_width(512, gw), 3, 2, 1, \"model.5\");\n  auto c3_6 = C3(network, weightMap, *conv5->getOutput(0), get_width(512, gw), get_width(512, gw), get_depth(9, gd), true, 1, 0.5, \"model.6\");\n  auto conv7 = convBlock(network, weightMap, *c3_6->getOutput(0), get_width(768, gw), 3, 2, 1, \"model.7\");\n  auto c3_8 = C3(network, weightMap, *conv7->getOutput(0), get_width(768, gw), get_width(768, gw), get_depth(3, gd), true, 1, 0.5, \"model.8\");\n  auto conv9 = convBlock(network, weightMap, *c3_8->getOutput(0), get_width(1024, gw), 3, 2, 1, \"model.9\");\n  auto c3_10 = C3(network, weightMap, *conv9->getOutput(0), get_width(1024, gw), get_width(1024, gw), get_depth(3, gd), true, 1, 0.5, \"model.10\");\n  auto sppf11 = SPPF(network, weightMap, *c3_10->getOutput(0), get_width(1024, gw), get_width(1024, gw), 5, \"model.11\");\n\n  // Head\n  auto conv12 = convBlock(network, weightMap, *sppf11->getOutput(0), get_width(768, gw), 1, 1, 1, \"model.12\");\n  auto upsample13 = network->addResize(*conv12->getOutput(0));\n  assert(upsample13);\n  upsample13->setResizeMode(ResizeMode::kNEAREST);\n  upsample13->setOutputDimensions(c3_8->getOutput(0)->getDimensions());\n  ITensor* inputTensors14[] = { upsample13->getOutput(0), c3_8->getOutput(0) };\n  auto cat14 = network->addConcatenation(inputTensors14, 2);\n  auto c3_15 = C3(network, weightMap, *cat14->getOutput(0), get_width(1536, gw), get_width(768, gw), get_depth(3, gd), false, 1, 0.5, \"model.15\");\n\n  auto conv16 = convBlock(network, weightMap, *c3_15->getOutput(0), get_width(512, gw), 1, 1, 1, \"model.16\");\n  auto upsample17 = network->addResize(*conv16->getOutput(0));\n  assert(upsample17);\n  upsample17->setResizeMode(ResizeMode::kNEAREST);\n  upsample17->setOutputDimensions(c3_6->getOutput(0)->getDimensions());\n  ITensor* inputTensors18[] = { upsample17->getOutput(0), c3_6->getOutput(0) };\n  auto cat18 = network->addConcatenation(inputTensors18, 2);\n  auto c3_19 = C3(network, weightMap, *cat18->getOutput(0), get_width(1024, gw), get_width(512, gw), get_depth(3, gd), false, 1, 0.5, \"model.19\");\n\n  auto conv20 = convBlock(network, weightMap, *c3_19->getOutput(0), get_width(256, gw), 1, 1, 1, \"model.20\");\n  auto upsample21 = network->addResize(*conv20->getOutput(0));\n  assert(upsample21);\n  upsample21->setResizeMode(ResizeMode::kNEAREST);\n  upsample21->setOutputDimensions(c3_4->getOutput(0)->getDimensions());\n  ITensor* inputTensors21[] = { upsample21->getOutput(0), c3_4->getOutput(0) };\n  auto cat22 = network->addConcatenation(inputTensors21, 2);\n  auto c3_23 = C3(network, weightMap, *cat22->getOutput(0), get_width(512, gw), get_width(256, gw), get_depth(3, gd), false, 1, 0.5, \"model.23\");\n\n  auto conv24 = convBlock(network, weightMap, *c3_23->getOutput(0), get_width(256, gw), 3, 2, 1, \"model.24\");\n  ITensor* inputTensors25[] = { conv24->getOutput(0), conv20->getOutput(0) };\n  auto cat25 = network->addConcatenation(inputTensors25, 2);\n  auto c3_26 = C3(network, weightMap, *cat25->getOutput(0), get_width(1024, gw), get_width(512, gw), get_depth(3, gd), false, 1, 0.5, \"model.26\");\n\n  auto conv27 = convBlock(network, weightMap, *c3_26->getOutput(0), get_width(512, gw), 3, 2, 1, \"model.27\");\n  ITensor* inputTensors28[] = { conv27->getOutput(0), conv16->getOutput(0) };\n  auto cat28 = network->addConcatenation(inputTensors28, 2);\n  auto c3_29 = C3(network, weightMap, *cat28->getOutput(0), get_width(1536, gw), get_width(768, gw), get_depth(3, gd), false, 1, 0.5, \"model.29\");\n\n  auto conv30 = convBlock(network, weightMap, *c3_29->getOutput(0), get_width(768, gw), 3, 2, 1, \"model.30\");\n  ITensor* inputTensors31[] = { conv30->getOutput(0), conv12->getOutput(0) };\n  auto cat31 = network->addConcatenation(inputTensors31, 2);\n  auto c3_32 = C3(network, weightMap, *cat31->getOutput(0), get_width(2048, gw), get_width(1024, gw), get_depth(3, gd), false, 1, 0.5, \"model.32\");\n\n  // Detect\n  IConvolutionLayer* det0 = network->addConvolutionNd(*c3_23->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.33.m.0.weight\"], weightMap[\"model.33.m.0.bias\"]);\n  IConvolutionLayer* det1 = network->addConvolutionNd(*c3_26->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.33.m.1.weight\"], weightMap[\"model.33.m.1.bias\"]);\n  IConvolutionLayer* det2 = network->addConvolutionNd(*c3_29->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.33.m.2.weight\"], weightMap[\"model.33.m.2.bias\"]);\n  IConvolutionLayer* det3 = network->addConvolutionNd(*c3_32->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.33.m.3.weight\"], weightMap[\"model.33.m.3.bias\"]);\n\n  auto yolo = addYoLoLayer(network, weightMap, \"model.33\", std::vector<IConvolutionLayer*>{det0, det1, det2, det3});\n  yolo->getOutput(0)->setName(kOutputTensorName);\n  network->markOutput(*yolo->getOutput(0));\n\n  // Engine config\n  builder->setMaxBatchSize(maxBatchSize);\n  config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#if defined(USE_FP16)\n  config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n  std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n  assert(builder->platformHasFastInt8());\n  config->setFlag(BuilderFlag::kINT8);\n  Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\n  config->setInt8Calibrator(calibrator);\n#endif\n\n  std::cout << \"Building engine, please wait for a while...\" << std::endl;\n  ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n  std::cout << \"Build engine successfully!\" << std::endl;\n\n  // Don't need the network any more\n  network->destroy();\n\n  // Release host memory\n  for (auto& mem : weightMap) {\n    free((void*)(mem.second.values));\n  }\n\n  return engine;\n}\n\nICudaEngine* build_cls_engine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, float& gd, float& gw, std::string& wts_name) {\n  INetworkDefinition* network = builder->createNetworkV2(0U);\n\n  // Create input tensor\n  ITensor* data = network->addInput(kInputTensorName, dt, Dims3{ 3, kClsInputH, kClsInputW });\n  assert(data);\n  std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n  // Backbone\n  auto conv0 = convBlock(network, weightMap, *data,  get_width(64, gw), 6, 2, 1,  \"model.0\");\n  assert(conv0);\n  auto conv1 = convBlock(network, weightMap, *conv0->getOutput(0), get_width(128, gw), 3, 2, 1, \"model.1\");\n  auto bottleneck_CSP2 = C3(network, weightMap, *conv1->getOutput(0), get_width(128, gw), get_width(128, gw), get_depth(3, gd), true, 1, 0.5, \"model.2\");\n  auto conv3 = convBlock(network, weightMap, *bottleneck_CSP2->getOutput(0), get_width(256, gw), 3, 2, 1, \"model.3\");\n  auto bottleneck_csp4 = C3(network, weightMap, *conv3->getOutput(0), get_width(256, gw), get_width(256, gw), get_depth(6, gd), true, 1, 0.5, \"model.4\");\n  auto conv5 = convBlock(network, weightMap, *bottleneck_csp4->getOutput(0), get_width(512, gw), 3, 2, 1, \"model.5\");\n  auto bottleneck_csp6 = C3(network, weightMap, *conv5->getOutput(0), get_width(512, gw), get_width(512, gw), get_depth(9, gd), true, 1, 0.5, \"model.6\");\n  auto conv7 = convBlock(network, weightMap, *bottleneck_csp6->getOutput(0), get_width(1024, gw), 3, 2, 1, \"model.7\");\n  auto bottleneck_csp8 = C3(network, weightMap, *conv7->getOutput(0), get_width(1024, gw), get_width(1024, gw), get_depth(3, gd), true, 1, 0.5, \"model.8\");\n\n  // Head\n  auto conv_class = convBlock(network, weightMap, *bottleneck_csp8->getOutput(0), 1280, 1, 1, 1, \"model.9.conv\");\n  int k = kClsInputH / 32;\n  IPoolingLayer* pool2 = network->addPoolingNd(*conv_class->getOutput(0), PoolingType::kAVERAGE, DimsHW{ k, k });\n  assert(pool2);\n  IFullyConnectedLayer* yolo = network->addFullyConnected(*pool2->getOutput(0), kClsNumClass, weightMap[\"model.9.linear.weight\"], weightMap[\"model.9.linear.bias\"]);\n  assert(yolo);\n\n  yolo->getOutput(0)->setName(kOutputTensorName);\n  network->markOutput(*yolo->getOutput(0));\n\n  // Engine config\n  builder->setMaxBatchSize(maxBatchSize);\n  config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n\n#if defined(USE_FP16)\n  config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n  std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n  assert(builder->platformHasFastInt8());\n  config->setFlag(BuilderFlag::kINT8);\n  Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kClsInputW, kClsInputW, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\n  config->setInt8Calibrator(calibrator);\n#endif\n\n  std::cout << \"Building engine, please wait for a while...\" << std::endl;\n  ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n  std::cout << \"Build engine successfully!\" << std::endl;\n\n  // Don't need the network any more\n  network->destroy();\n\n  // Release host memory\n  for (auto& mem : weightMap) {\n    free((void*)(mem.second.values));\n  }\n\n  return engine;\n}\n\nICudaEngine* build_seg_engine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, float& gd, float& gw, std::string& wts_name) {\n  INetworkDefinition* network = builder->createNetworkV2(0U);\n  ITensor* data = network->addInput(kInputTensorName, dt, Dims3{ 3, kInputH, kInputW });\n  assert(data);\n  std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n  // Backbone\n  auto conv0 = convBlock(network, weightMap, *data,  get_width(64, gw), 6, 2, 1,  \"model.0\");\n  assert(conv0);\n  auto conv1 = convBlock(network, weightMap, *conv0->getOutput(0), get_width(128, gw), 3, 2, 1, \"model.1\");\n  auto bottleneck_CSP2 = C3(network, weightMap, *conv1->getOutput(0), get_width(128, gw), get_width(128, gw), get_depth(3, gd), true, 1, 0.5, \"model.2\");\n  auto conv3 = convBlock(network, weightMap, *bottleneck_CSP2->getOutput(0), get_width(256, gw), 3, 2, 1, \"model.3\");\n  auto bottleneck_csp4 = C3(network, weightMap, *conv3->getOutput(0), get_width(256, gw), get_width(256, gw), get_depth(6, gd), true, 1, 0.5, \"model.4\");\n  auto conv5 = convBlock(network, weightMap, *bottleneck_csp4->getOutput(0), get_width(512, gw), 3, 2, 1, \"model.5\");\n  auto bottleneck_csp6 = C3(network, weightMap, *conv5->getOutput(0), get_width(512, gw), get_width(512, gw), get_depth(9, gd), true, 1, 0.5, \"model.6\");\n  auto conv7 = convBlock(network, weightMap, *bottleneck_csp6->getOutput(0), get_width(1024, gw), 3, 2, 1, \"model.7\");\n  auto bottleneck_csp8 = C3(network, weightMap, *conv7->getOutput(0), get_width(1024, gw), get_width(1024, gw), get_depth(3, gd), true, 1, 0.5, \"model.8\");\n  auto spp9 = SPPF(network, weightMap, *bottleneck_csp8->getOutput(0), get_width(1024, gw), get_width(1024, gw), 5, \"model.9\");\n\n  // Head\n  auto conv10 = convBlock(network, weightMap, *spp9->getOutput(0), get_width(512, gw), 1, 1, 1, \"model.10\");\n\n  auto upsample11 = network->addResize(*conv10->getOutput(0));\n  assert(upsample11);\n  upsample11->setResizeMode(ResizeMode::kNEAREST);\n  upsample11->setOutputDimensions(bottleneck_csp6->getOutput(0)->getDimensions());\n\n  ITensor* inputTensors12[] = { upsample11->getOutput(0), bottleneck_csp6->getOutput(0) };\n  auto cat12 = network->addConcatenation(inputTensors12, 2);\n  auto bottleneck_csp13 = C3(network, weightMap, *cat12->getOutput(0), get_width(1024, gw), get_width(512, gw), get_depth(3, gd), false, 1, 0.5, \"model.13\");\n  auto conv14 = convBlock(network, weightMap, *bottleneck_csp13->getOutput(0), get_width(256, gw), 1, 1, 1, \"model.14\");\n\n  auto upsample15 = network->addResize(*conv14->getOutput(0));\n  assert(upsample15);\n  upsample15->setResizeMode(ResizeMode::kNEAREST);\n  upsample15->setOutputDimensions(bottleneck_csp4->getOutput(0)->getDimensions());\n\n  ITensor* inputTensors16[] = { upsample15->getOutput(0), bottleneck_csp4->getOutput(0) };\n  auto cat16 = network->addConcatenation(inputTensors16, 2);\n\n  auto bottleneck_csp17 = C3(network, weightMap, *cat16->getOutput(0), get_width(512, gw), get_width(256, gw), get_depth(3, gd), false, 1, 0.5, \"model.17\");\n\n  // Segmentation\n  IConvolutionLayer* det0 = network->addConvolutionNd(*bottleneck_csp17->getOutput(0), kNumAnchor * (32 + kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.24.m.0.weight\"], weightMap[\"model.24.m.0.bias\"]);\n  auto conv18 = convBlock(network, weightMap, *bottleneck_csp17->getOutput(0), get_width(256, gw), 3, 2, 1, \"model.18\");\n  ITensor* inputTensors19[] = { conv18->getOutput(0), conv14->getOutput(0) };\n  auto cat19 = network->addConcatenation(inputTensors19, 2);\n  auto bottleneck_csp20 = C3(network, weightMap, *cat19->getOutput(0), get_width(512, gw), get_width(512, gw), get_depth(3, gd), false, 1, 0.5, \"model.20\");\n  IConvolutionLayer* det1 = network->addConvolutionNd(*bottleneck_csp20->getOutput(0), kNumAnchor * (32 + kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.24.m.1.weight\"], weightMap[\"model.24.m.1.bias\"]);\n  auto conv21 = convBlock(network, weightMap, *bottleneck_csp20->getOutput(0), get_width(512, gw), 3, 2, 1, \"model.21\");\n  ITensor* inputTensors22[] = { conv21->getOutput(0), conv10->getOutput(0) };\n  auto cat22 = network->addConcatenation(inputTensors22, 2);\n  auto bottleneck_csp23 = C3(network, weightMap, *cat22->getOutput(0), get_width(1024, gw), get_width(1024, gw), get_depth(3, gd), false, 1, 0.5, \"model.23\");\n  IConvolutionLayer* det2 = network->addConvolutionNd(*bottleneck_csp23->getOutput(0), kNumAnchor * (32 + kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.24.m.2.weight\"], weightMap[\"model.24.m.2.bias\"]);\n\n  auto yolo = addYoLoLayer(network, weightMap, \"model.24\", std::vector<IConvolutionLayer*>{det0, det1, det2}, true);\n  yolo->getOutput(0)->setName(kOutputTensorName);\n  network->markOutput(*yolo->getOutput(0));\n\n  auto proto = Proto(network, weightMap, *bottleneck_csp17->getOutput(0), get_width(256, gw), 32, \"model.24.proto\");\n  proto->getOutput(0)->setName(\"proto\");\n  network->markOutput(*proto->getOutput(0));\n\n  // Engine config\n  builder->setMaxBatchSize(maxBatchSize);\n  config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#if defined(USE_FP16)\n  config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n  std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n  assert(builder->platformHasFastInt8());\n  config->setFlag(BuilderFlag::kINT8);\n  Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\n  config->setInt8Calibrator(calibrator);\n#endif\n\n  std::cout << \"Building engine, please wait for a while...\" << std::endl;\n  ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\n  std::cout << \"Build engine successfully!\" << std::endl;\n\n  // Don't need the network any more\n  network->destroy();\n\n  // Release host memory\n  for (auto& mem : weightMap) {\n    free((void*)(mem.second.values));\n  }\n\n  return engine;\n}\n\n"
  },
  {
    "path": "yolov5/src/model.h",
    "content": "#pragma once\n\n#include <NvInfer.h>\n#include <string>\n\nnvinfer1::ICudaEngine* build_det_engine(unsigned int maxBatchSize, nvinfer1::IBuilder* builder,\n                                        nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt,\n                                        float& gd, float& gw, std::string& wts_name);\n\nnvinfer1::ICudaEngine* build_det_p6_engine(unsigned int maxBatchSize, nvinfer1::IBuilder* builder,\n                                           nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt,\n                                           float& gd, float& gw, std::string& wts_name);\n\nnvinfer1::ICudaEngine* build_cls_engine(unsigned int maxBatchSize, nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt, float& gd, float& gw, std::string& wts_name);\n\nnvinfer1::ICudaEngine* build_seg_engine(unsigned int maxBatchSize, nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt, float& gd, float& gw, std::string& wts_name);\n"
  },
  {
    "path": "yolov5/src/postprocess.cpp",
    "content": "#include \"postprocess.h\"\n#include \"utils.h\"\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n  float l, r, t, b;\n  float r_w = kInputW / (img.cols * 1.0);\n  float r_h = kInputH / (img.rows * 1.0);\n  if (r_h > r_w) {\n    l = bbox[0] - bbox[2] / 2.f;\n    r = bbox[0] + bbox[2] / 2.f;\n    t = bbox[1] - bbox[3] / 2.f - (kInputH - r_w * img.rows) / 2;\n    b = bbox[1] + bbox[3] / 2.f - (kInputH - r_w * img.rows) / 2;\n    l = l / r_w;\n    r = r / r_w;\n    t = t / r_w;\n    b = b / r_w;\n  } else {\n    l = bbox[0] - bbox[2] / 2.f - (kInputW - r_h * img.cols) / 2;\n    r = bbox[0] + bbox[2] / 2.f - (kInputW - r_h * img.cols) / 2;\n    t = bbox[1] - bbox[3] / 2.f;\n    b = bbox[1] + bbox[3] / 2.f;\n    l = l / r_h;\n    r = r / r_h;\n    t = t / r_h;\n    b = b / r_h;\n  }\n  return cv::Rect(round(l), round(t), round(r - l), round(b - t));\n}\n\nstatic float iou(float lbox[4], float rbox[4]) {\n  float interBox[] = {\n    (std::max)(lbox[0] - lbox[2] / 2.f , rbox[0] - rbox[2] / 2.f), //left\n    (std::min)(lbox[0] + lbox[2] / 2.f , rbox[0] + rbox[2] / 2.f), //right\n    (std::max)(lbox[1] - lbox[3] / 2.f , rbox[1] - rbox[3] / 2.f), //top\n    (std::min)(lbox[1] + lbox[3] / 2.f , rbox[1] + rbox[3] / 2.f), //bottom\n  };\n\n  if (interBox[2] > interBox[3] || interBox[0] > interBox[1])\n    return 0.0f;\n\n  float interBoxS = (interBox[1] - interBox[0])*(interBox[3] - interBox[2]);\n  return interBoxS / (lbox[2] * lbox[3] + rbox[2] * rbox[3] - interBoxS);\n}\n\nstatic bool cmp(const Detection& a, const Detection& b) {\n  return a.conf > b.conf;\n}\n\nvoid nms(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh) {\n  int det_size = sizeof(Detection) / sizeof(float);\n  std::map<float, std::vector<Detection>> m;\n  for (int i = 0; i < output[0] && i < kMaxNumOutputBbox; i++) {\n    if (output[1 + det_size * i + 4] <= conf_thresh) continue;\n    Detection det;\n    memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n    if (m.count(det.class_id) == 0) m.emplace(det.class_id, std::vector<Detection>());\n    m[det.class_id].push_back(det);\n  }\n  for (auto it = m.begin(); it != m.end(); it++) {\n    auto& dets = it->second;\n    std::sort(dets.begin(), dets.end(), cmp);\n    for (size_t m = 0; m < dets.size(); ++m) {\n      auto& item = dets[m];\n      res.push_back(item);\n      for (size_t n = m + 1; n < dets.size(); ++n) {\n        if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n          dets.erase(dets.begin() + n);\n          --n;\n        }\n      }\n    }\n  }\n}\n\nvoid batch_nms(std::vector<std::vector<Detection>>& res_batch, float *output, int batch_size, int output_size, float conf_thresh, float nms_thresh) {\n  res_batch.resize(batch_size);\n  for (int i = 0; i < batch_size; i++) {\n    nms(res_batch[i], &output[i * output_size], conf_thresh, nms_thresh);\n  }\n}\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n  for (size_t i = 0; i < img_batch.size(); i++) {\n    auto& res = res_batch[i];\n    cv::Mat img = img_batch[i];\n    for (size_t j = 0; j < res.size(); j++) {\n      cv::Rect r = get_rect(img, res[j].bbox);\n      cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n      cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n    }\n  }\n}\n\nstatic cv::Rect get_downscale_rect(float bbox[4], float scale) {\n  float left = bbox[0] - bbox[2] / 2;\n  float top = bbox[1] - bbox[3] / 2;\n  float right = bbox[0] + bbox[2] / 2;\n  float bottom = bbox[1] + bbox[3] / 2;\n  left /= scale;\n  top /= scale;\n  right /= scale;\n  bottom /= scale;\n  return cv::Rect(round(left), round(top), round(right - left), round(bottom - top));\n}\n\nstd::vector<cv::Mat> process_mask(const float* proto, int proto_size, std::vector<Detection>& dets) {\n  std::vector<cv::Mat> masks;\n  for (size_t i = 0; i < dets.size(); i++) {\n    cv::Mat mask_mat = cv::Mat::zeros(kInputH / 4, kInputW / 4, CV_32FC1);\n    auto r = get_downscale_rect(dets[i].bbox, 4);\n    for (int x = r.x; x < r.x + r.width; x++) {\n      for (int y = r.y; y < r.y + r.height; y++) {\n        float e = 0.0f;\n        for (int j = 0; j < 32; j++) {\n          e += dets[i].mask[j] * proto[j * proto_size / 32 + y * mask_mat.cols + x];\n        }\n        e = 1.0f / (1.0f + expf(-e));\n        mask_mat.at<float>(y, x) = e;\n      }\n    }\n    cv::resize(mask_mat, mask_mat, cv::Size(kInputW, kInputH));\n    masks.push_back(mask_mat);\n  }\n  return masks;\n}\n\ncv::Mat scale_mask(cv::Mat mask, cv::Mat img) {\n  int x, y, w, h;\n  float r_w = kInputW / (img.cols * 1.0);\n  float r_h = kInputH / (img.rows * 1.0);\n  if (r_h > r_w) {\n    w = kInputW;\n    h = r_w * img.rows;\n    x = 0;\n    y = (kInputH - h) / 2;\n  } else {\n    w = r_h * img.cols;\n    h = kInputH;\n    x = (kInputW - w) / 2;\n    y = 0;\n  }\n  cv::Rect r(x, y, w, h);\n  cv::Mat res;\n  cv::resize(mask(r), res, img.size());\n  return res;\n}\n\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks, std::unordered_map<int, std::string>& labels_map) {\n  static std::vector<uint32_t> colors = {0xFF3838, 0xFF9D97, 0xFF701F, 0xFFB21D, 0xCFD231, 0x48F90A,\n                                         0x92CC17, 0x3DDB86, 0x1A9334, 0x00D4BB, 0x2C99A8, 0x00C2FF,\n                                         0x344593, 0x6473FF, 0x0018EC, 0x8438FF, 0x520085, 0xCB38FF,\n                                         0xFF95C8, 0xFF37C7};\n  for (size_t i = 0; i < dets.size(); i++) {\n    cv::Mat img_mask = scale_mask(masks[i], img);\n    auto color = colors[(int)dets[i].class_id % colors.size()];\n    auto bgr = cv::Scalar(color & 0xFF, color >> 8 & 0xFF, color >> 16 & 0xFF);\n\n    cv::Rect r = get_rect(img, dets[i].bbox);\n    for (int x = r.x; x < r.x + r.width; x++) {\n      for (int y = r.y; y < r.y + r.height; y++) {\n        float val = img_mask.at<float>(y, x);\n        if (val <= 0.5) continue;\n        img.at<cv::Vec3b>(y, x)[0] = img.at<cv::Vec3b>(y, x)[0] / 2 + bgr[0] / 2;\n        img.at<cv::Vec3b>(y, x)[1] = img.at<cv::Vec3b>(y, x)[1] / 2 + bgr[1] / 2;\n        img.at<cv::Vec3b>(y, x)[2] = img.at<cv::Vec3b>(y, x)[2] / 2 + bgr[2] / 2;\n      }\n    }\n\n    cv::rectangle(img, r, bgr, 2);\n    \n    // Get the size of the text\n    cv::Size textSize = cv::getTextSize(labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf), cv::FONT_HERSHEY_PLAIN, 1.2, 2, NULL);\n    // Set the top left corner of the rectangle\n    cv::Point topLeft(r.x, r.y - textSize.height);\n\n    // Set the bottom right corner of the rectangle\n    cv::Point bottomRight(r.x + textSize.width, r.y + textSize.height);\n\n    // Set the thickness of the rectangle lines\n    int lineThickness = 2;\n\n    // Draw the rectangle on the image\n    cv::rectangle(img, topLeft, bottomRight, bgr, -1);\n\n    cv::putText(img, labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf), cv::Point(r.x, r.y + 4), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar::all(0xFF), 2);\n\n  }\n}\n\n"
  },
  {
    "path": "yolov5/src/postprocess.h",
    "content": "#pragma once\n\n#include \"types.h\"\n#include <opencv2/opencv.hpp>\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]);\n\nvoid nms(std::vector<Detection>& res, float *output, float conf_thresh, float nms_thresh = 0.5);\n\nvoid batch_nms(std::vector<std::vector<Detection>>& batch_res, float *output, int batch_size, int output_size, float conf_thresh, float nms_thresh = 0.5);\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nstd::vector<cv::Mat> process_mask(const float* proto, int proto_size, std::vector<Detection>& dets);\n\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks, std::unordered_map<int, std::string>& labels_map);\n"
  },
  {
    "path": "yolov5/src/preprocess.cu",
    "content": "#include \"preprocess.h\"\n#include \"cuda_utils.h\"\n\nstatic uint8_t* img_buffer_host = nullptr;\nstatic uint8_t* img_buffer_device = nullptr;\n\nstruct AffineMatrix {\n  float value[6];\n};\n\n__global__ void warpaffine_kernel(\n    uint8_t* src, int src_line_size, int src_width,\n    int src_height, float* dst, int dst_width,\n    int dst_height, uint8_t const_value_st,\n    AffineMatrix d2s, int edge) {\n  int position = blockDim.x * blockIdx.x + threadIdx.x;\n  if (position >= edge) return;\n\n  float m_x1 = d2s.value[0];\n  float m_y1 = d2s.value[1];\n  float m_z1 = d2s.value[2];\n  float m_x2 = d2s.value[3];\n  float m_y2 = d2s.value[4];\n  float m_z2 = d2s.value[5];\n\n  int dx = position % dst_width;\n  int dy = position / dst_width;\n  float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;\n  float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;\n  float c0, c1, c2;\n\n  if (src_x <= -1 || src_x >= src_width || src_y <= -1 || src_y >= src_height) {\n    // out of range\n    c0 = const_value_st;\n    c1 = const_value_st;\n    c2 = const_value_st;\n  } else {\n    int y_low = floorf(src_y);\n    int x_low = floorf(src_x);\n    int y_high = y_low + 1;\n    int x_high = x_low + 1;\n\n    uint8_t const_value[] = {const_value_st, const_value_st, const_value_st};\n    float ly = src_y - y_low;\n    float lx = src_x - x_low;\n    float hy = 1 - ly;\n    float hx = 1 - lx;\n    float w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;\n    uint8_t* v1 = const_value;\n    uint8_t* v2 = const_value;\n    uint8_t* v3 = const_value;\n    uint8_t* v4 = const_value;\n\n    if (y_low >= 0) {\n      if (x_low >= 0)\n        v1 = src + y_low * src_line_size + x_low * 3;\n\n      if (x_high < src_width)\n        v2 = src + y_low * src_line_size + x_high * 3;\n    }\n\n    if (y_high < src_height) {\n      if (x_low >= 0)\n        v3 = src + y_high * src_line_size + x_low * 3;\n\n      if (x_high < src_width)\n        v4 = src + y_high * src_line_size + x_high * 3;\n    }\n\n    c0 = w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0];\n    c1 = w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1];\n    c2 = w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2];\n  }\n\n  // bgr to rgb \n  float t = c2;\n  c2 = c0;\n  c0 = t;\n\n  // normalization\n  c0 = c0 / 255.0f;\n  c1 = c1 / 255.0f;\n  c2 = c2 / 255.0f;\n\n  // rgbrgbrgb to rrrgggbbb\n  int area = dst_width * dst_height;\n  float* pdst_c0 = dst + dy * dst_width + dx;\n  float* pdst_c1 = pdst_c0 + area;\n  float* pdst_c2 = pdst_c1 + area;\n  *pdst_c0 = c0;\n  *pdst_c1 = c1;\n  *pdst_c2 = c2;\n}\n\nvoid cuda_preprocess(\n    uint8_t* src, int src_width, int src_height,\n    float* dst, int dst_width, int dst_height,\n    cudaStream_t stream) {\n\n  int img_size = src_width * src_height * 3;\n  // copy data to pinned memory\n  memcpy(img_buffer_host, src, img_size);\n  // copy data to device memory\n  CUDA_CHECK(cudaMemcpyAsync(img_buffer_device, img_buffer_host, img_size, cudaMemcpyHostToDevice, stream));\n\n  AffineMatrix s2d, d2s;\n  float scale = std::min(dst_height / (float)src_height, dst_width / (float)src_width);\n\n  s2d.value[0] = scale;\n  s2d.value[1] = 0;\n  s2d.value[2] = -scale * src_width  * 0.5  + dst_width * 0.5;\n  s2d.value[3] = 0;\n  s2d.value[4] = scale;\n  s2d.value[5] = -scale * src_height * 0.5 + dst_height * 0.5;\n\n  cv::Mat m2x3_s2d(2, 3, CV_32F, s2d.value);\n  cv::Mat m2x3_d2s(2, 3, CV_32F, d2s.value);\n  cv::invertAffineTransform(m2x3_s2d, m2x3_d2s);\n\n  memcpy(d2s.value, m2x3_d2s.ptr<float>(0), sizeof(d2s.value));\n\n  int jobs = dst_height * dst_width;\n  int threads = 256;\n  int blocks = ceil(jobs / (float)threads);\n\n  warpaffine_kernel<<<blocks, threads, 0, stream>>>(\n      img_buffer_device, src_width * 3, src_width,\n      src_height, dst, dst_width,\n      dst_height, 128, d2s, jobs);\n}\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch,\n                           float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream) {\n  int dst_size = dst_width * dst_height * 3;\n  for (size_t i = 0; i < img_batch.size(); i++) {\n    cuda_preprocess(img_batch[i].ptr(), img_batch[i].cols, img_batch[i].rows, &dst[dst_size * i], dst_width, dst_height, stream);\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n  }\n}\n\nvoid cuda_preprocess_init(int max_image_size) {\n  // prepare input data in pinned memory\n  CUDA_CHECK(cudaMallocHost((void**)&img_buffer_host, max_image_size * 3));\n  // prepare input data in device memory\n  CUDA_CHECK(cudaMalloc((void**)&img_buffer_device, max_image_size * 3));\n}\n\nvoid cuda_preprocess_destroy() {\n  CUDA_CHECK(cudaFree(img_buffer_device));\n  CUDA_CHECK(cudaFreeHost(img_buffer_host));\n}\n\n"
  },
  {
    "path": "yolov5/src/preprocess.h",
    "content": "#pragma once\n\n#include <cuda_runtime.h>\n#include <cstdint>\n#include <opencv2/opencv.hpp>\n\nvoid cuda_preprocess_init(int max_image_size);\nvoid cuda_preprocess_destroy();\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height,\n                     float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream);\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch,\n                           float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream);\n\n"
  },
  {
    "path": "yolov5/src/types.h",
    "content": "#pragma once\n\n#include \"config.h\"\n\nstruct YoloKernel {\n  int width;\n  int height;\n  float anchors[kNumAnchor * 2];\n};\n\nstruct alignas(float) Detection {\n  float bbox[4];  // center_x center_y w h\n  float conf;  // bbox_conf * cls_conf\n  float class_id;\n  float mask[32];\n};\n\n"
  },
  {
    "path": "yolov5/src/utils.h",
    "content": "#pragma once\n\n#include <dirent.h>\n#include <fstream>\n#include <unordered_map>\n#include <string>\n#include <sstream>\n#include <vector>\n#include <cstring>\n\nstatic inline int read_files_in_dir(const char* p_dir_name, std::vector<std::string>& file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n            strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\n// Function to trim leading and trailing whitespace from a string\nstatic inline std::string trim_leading_whitespace(const std::string& str) {\n    size_t first = str.find_first_not_of(' ');\n    if (std::string::npos == first) {\n        return str;\n    }\n    size_t last = str.find_last_not_of(' ');\n    return str.substr(first, (last - first + 1));\n}\n\n// Src: https://stackoverflow.com/questions/16605967\nstatic inline std::string to_string_with_precision(const float a_value, const int n = 2) {\n    std::ostringstream out;\n    out.precision(n);\n    out << std::fixed << a_value;\n    return out.str();\n}\n\nstatic inline int read_labels(const std::string labels_filename, std::unordered_map<int, std::string>& labels_map) {\n\n    std::ifstream file(labels_filename);\n    // Read each line of the file\n    std::string line;\n    int index = 0;\n    while (std::getline(file, line)) {\n        // Strip the line of any leading or trailing whitespace\n        line = trim_leading_whitespace(line);\n\n        // Add the stripped line to the labels_map, using the loop index as the key\n        labels_map[index] = line;\n        index++;\n    }\n    // Close the file\n    file.close();\n\n    return 0;\n}\n\n"
  },
  {
    "path": "yolov5/yolov5_cls.cpp",
    "content": "#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"utils.h\"\n#include \"model.h\"\n#include \"config.h\"\n#include \"calibrator.h\"\n\n#include <iostream>\n#include <chrono>\n#include <cmath>\n#include <numeric>\n#include <opencv2/opencv.hpp>\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\nconst static int kOutputSize = kClsNumClass;\n\nvoid batch_preprocess(std::vector<cv::Mat>& imgs, float* output) {\n  for (size_t b = 0; b < imgs.size(); b++) {\n    cv::Mat img;\n    // cv::resize(imgs[b], img, cv::Size(kClsInputW, kClsInputH));\n    img = preprocess_img(imgs[b], kClsInputW, kClsInputH);\n    int i = 0;\n    for (int row = 0; row < img.rows; ++row) {\n      uchar* uc_pixel = img.data + row * img.step;\n      for (int col = 0; col < img.cols; ++col) {\n        output[b * 3 * img.rows * img.cols  + i] = ((float)uc_pixel[2] / 255.0 - 0.485) / 0.229;  // R - 0.485\n        output[b * 3 * img.rows * img.cols + i + img.rows * img.cols] = ((float)uc_pixel[1] / 255.0 - 0.456) / 0.224;\n        output[b * 3 * img.rows * img.cols + i + 2 * img.rows * img.cols] = ((float)uc_pixel[0] / 255.0 - 0.406) / 0.225;\n        uc_pixel += 3;\n        ++i;\n      }\n    }\n  }\n}\n\nstd::vector<float> softmax(float *prob, int n) {\n  std::vector<float> res;\n  float sum = 0.0f;\n  float t;\n  for (int i = 0; i < n; i++) {\n    t = expf(prob[i]);\n    res.push_back(t);\n    sum += t;\n  }\n  for (int i = 0; i < n; i++) {\n    res[i] /= sum;\n  }\n  return res;\n}\n\nstd::vector<int> topk(const std::vector<float>& vec, int k) {\n  std::vector<int> topk_index;\n  std::vector<size_t> vec_index(vec.size());\n  std::iota(vec_index.begin(), vec_index.end(), 0);\n\n  std::sort(vec_index.begin(), vec_index.end(), [&vec](size_t index_1, size_t index_2) { return vec[index_1] > vec[index_2]; });\n\n  int k_num = std::min<int>(vec.size(), k);\n\n  for (int i = 0; i < k_num; ++i) {\n    topk_index.push_back(vec_index[i]);\n  }\n\n  return topk_index;\n}\n\nstd::vector<std::string> read_classes(std::string file_name) {\n  std::vector<std::string> classes;\n  std::ifstream ifs(file_name, std::ios::in);\n  if (!ifs.is_open()) {\n    std::cerr << file_name << \" is not found, pls refer to README and download it.\" << std::endl;\n    assert(0);\n  }\n  std::string s;\n  while (std::getline(ifs, s)) {\n    classes.push_back(s);\n  }\n  ifs.close();\n  return classes;\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, float& gd, float& gw, std::string& img_dir) {\n  if (argc < 4) return false;\n  if (std::string(argv[1]) == \"-s\" && (argc == 5 || argc == 7)) {\n    wts = std::string(argv[2]);\n    engine = std::string(argv[3]);\n    auto net = std::string(argv[4]);\n    if (net[0] == 'n') {\n      gd = 0.33;\n      gw = 0.25;\n    } else if (net[0] == 's') {\n      gd = 0.33;\n      gw = 0.50;\n    } else if (net[0] == 'm') {\n      gd = 0.67;\n      gw = 0.75;\n    } else if (net[0] == 'l') {\n      gd = 1.0;\n      gw = 1.0;\n    } else if (net[0] == 'x') {\n      gd = 1.33;\n      gw = 1.25;\n    } else if (net[0] == 'c' && argc == 7) {\n      gd = atof(argv[5]);\n      gw = atof(argv[6]);\n    } else {\n      return false;\n    }\n  } else if (std::string(argv[1]) == \"-d\" && argc == 4) {\n    engine = std::string(argv[2]);\n    img_dir = std::string(argv[3]);\n  } else {\n    return false;\n  }\n  return true;\n}\n\nvoid prepare_buffers(ICudaEngine* engine, float** gpu_input_buffer, float** gpu_output_buffer, float** cpu_input_buffer, float** cpu_output_buffer) {\n  assert(engine->getNbBindings() == 2);\n  // In order to bind the buffers, we need to know the names of the input and output tensors.\n  // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n  const int inputIndex = engine->getBindingIndex(kInputTensorName);\n  const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n  assert(inputIndex == 0);\n  assert(outputIndex == 1);\n  // Create GPU buffers on device\n  CUDA_CHECK(cudaMalloc((void**)gpu_input_buffer, kBatchSize * 3 * kClsInputH * kClsInputW * sizeof(float)));\n  CUDA_CHECK(cudaMalloc((void**)gpu_output_buffer, kBatchSize * kOutputSize * sizeof(float)));\n\n  *cpu_input_buffer = new float[kBatchSize * 3 * kClsInputH * kClsInputW];\n  *cpu_output_buffer = new float[kBatchSize * kOutputSize];\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void **buffers, float* input, float* output, int batchSize) {\n  CUDA_CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * 3 * kClsInputH * kClsInputW * sizeof(float), cudaMemcpyHostToDevice, stream));\n  context.enqueue(batchSize, buffers, stream, nullptr);\n  CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost, stream));\n  cudaStreamSynchronize(stream);\n}\n\nvoid serialize_engine(unsigned int max_batchsize, float& gd, float& gw, std::string& wts_name, std::string& engine_name) {\n  // Create builder\n  IBuilder* builder = createInferBuilder(gLogger);\n  IBuilderConfig* config = builder->createBuilderConfig();\n\n  // Create model to populate the network, then set the outputs and create an engine\n  ICudaEngine *engine = nullptr;\n\n  engine = build_cls_engine(max_batchsize, builder, config, DataType::kFLOAT, gd, gw, wts_name);\n\n  assert(engine != nullptr);\n\n  // Serialize the engine\n  IHostMemory* serialized_engine = engine->serialize();\n  assert(serialized_engine != nullptr);\n\n  // Save engine to file\n  std::ofstream p(engine_name, std::ios::binary);\n  if (!p) {\n    std::cerr << \"Could not open plan output file\" << std::endl;\n    assert(false);\n  }\n  p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n  // Close everything down\n  engine->destroy();\n  config->destroy();\n  serialized_engine->destroy();\n  builder->destroy();\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine, IExecutionContext** context) {\n  std::ifstream file(engine_name, std::ios::binary);\n  if (!file.good()) {\n    std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n    assert(false);\n  }\n  size_t size = 0;\n  file.seekg(0, file.end);\n  size = file.tellg();\n  file.seekg(0, file.beg);\n  char* serialized_engine = new char[size];\n  assert(serialized_engine);\n  file.read(serialized_engine, size);\n  file.close();\n\n  *runtime = createInferRuntime(gLogger);\n  assert(*runtime);\n  *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n  assert(*engine);\n  *context = (*engine)->createExecutionContext();\n  assert(*context);\n  delete[] serialized_engine;\n}\n\nint main(int argc, char** argv) {\n  cudaSetDevice(kGpuId);\n\n  std::string wts_name = \"\";\n  std::string engine_name = \"\";\n  float gd = 0.0f, gw = 0.0f;\n  std::string img_dir;\n\n  if (!parse_args(argc, argv, wts_name, engine_name, gd, gw, img_dir)) {\n    std::cerr << \"arguments not right!\" << std::endl;\n    std::cerr << \"./yolov5_cls -s [.wts] [.engine] [n/s/m/l/x or c gd gw]  // serialize model to plan file\" << std::endl;\n    std::cerr << \"./yolov5_cls -d [.engine] ../images  // deserialize plan file and run inference\" << std::endl;\n    return -1;\n  }\n\n  // Create a model using the API directly and serialize it to a file\n  if (!wts_name.empty()) {\n    serialize_engine(kBatchSize, gd, gw, wts_name, engine_name);\n    return 0;\n  }\n\n  // Deserialize the engine from file\n  IRuntime* runtime = nullptr;\n  ICudaEngine* engine = nullptr;\n  IExecutionContext* context = nullptr;\n  deserialize_engine(engine_name, &runtime, &engine, &context);\n  cudaStream_t stream;\n  CUDA_CHECK(cudaStreamCreate(&stream));\n\n  // Prepare cpu and gpu buffers\n  float* gpu_buffers[2];\n  float* cpu_input_buffer = nullptr;\n  float* cpu_output_buffer = nullptr;\n  prepare_buffers(engine, &gpu_buffers[0], &gpu_buffers[1], &cpu_input_buffer, &cpu_output_buffer);\n\n  // Read images from directory\n  std::vector<std::string> file_names;\n  if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n    std::cerr << \"read_files_in_dir failed.\" << std::endl;\n    return -1;\n  }\n\n  // Read imagenet labels\n  auto classes = read_classes(\"imagenet_classes.txt\");\n\n  // batch predict\n  for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n    // Get a batch of images\n    std::vector<cv::Mat> img_batch;\n    std::vector<std::string> img_name_batch;\n    for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n      cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n      img_batch.push_back(img);\n      img_name_batch.push_back(file_names[j]);\n    }\n\n    // Preprocess\n    batch_preprocess(img_batch, cpu_input_buffer);\n\n    // Run inference\n    auto start = std::chrono::system_clock::now();\n    infer(*context, stream, (void**)gpu_buffers, cpu_input_buffer, cpu_output_buffer, kBatchSize);\n    auto end = std::chrono::system_clock::now();\n    std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n    // Postprocess and get top-k result\n    for (size_t b = 0; b < img_name_batch.size(); b++) {\n      float* p = &cpu_output_buffer[b * kOutputSize];\n      auto res = softmax(p, kOutputSize);\n      auto topk_idx = topk(res, 3);\n      std::cout << img_name_batch[b] << std::endl;\n      for (auto idx: topk_idx) {\n        std::cout << \"  \" << classes[idx] << \" \" << res[idx] << std::endl;\n      }\n    }\n  }\n\n  // Release stream and buffers\n  cudaStreamDestroy(stream);\n  CUDA_CHECK(cudaFree(gpu_buffers[0]));\n  CUDA_CHECK(cudaFree(gpu_buffers[1]));\n  delete[] cpu_input_buffer;\n  delete[] cpu_output_buffer;\n  // Destroy the engine\n  context->destroy();\n  engine->destroy();\n  runtime->destroy();\n\n  return 0;\n}\n\n"
  },
  {
    "path": "yolov5/yolov5_cls_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport os\nimport shutil\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport torch\nimport pycuda.autoinit\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\nwith open(\"imagenet_classes.txt\") as f:\n    classes = [line.strip() for line in f.readlines()]\n\n\nclass YoLov5TRT(object):\n    \"\"\"\n    description: A YOLOv5 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n        self.mean = (0.485, 0.456, 0.406)\n        self.std = (0.229, 0.224, 0.225)\n\n        for binding in engine:\n            print('binding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(\n                binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        engine = self.engine\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_input_image = np.empty(\n            shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            batch_image_raw.append(image_raw)\n            input_image = self.preprocess_cls_image(image_raw)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size,\n                              bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            classes_ls, predicted_conf_ls, category_id_ls = self.postprocess_cls(\n                output)\n            cv2.putText(batch_image_raw[i], str(\n                classes_ls), (10, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 1, cv2.LINE_AA)\n            print(classes_ls, predicted_conf_ls)\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_cls_image(self, input_img):\n        im = cv2.cvtColor(input_img, cv2.COLOR_BGR2RGB)\n        im = cv2.resize(im, (self.input_h, self.input_w))\n        im = np.float32(im)\n        im /= 255.0\n        im -= self.mean\n        im /= self.std\n        im = im.transpose(2, 0, 1)\n        # prepare batch\n        batch_data = np.expand_dims(im, axis=0)\n        return batch_data\n\n    def postprocess_cls(self, output_data):\n        classes_ls = []\n        predicted_conf_ls = []\n        category_id_ls = []\n        output_data = output_data.reshape(self.batch_size, -1)\n        output_data = torch.Tensor(output_data)\n        p = torch.nn.functional.softmax(output_data, dim=1)\n        score, index = torch.topk(p, 3)\n        for ind in range(index.shape[0]):\n            input_category_id = index[ind][0].item()  # 716\n            category_id_ls.append(input_category_id)\n            predicted_confidence = score[ind][0].item()\n            predicted_conf_ls.append(predicted_confidence)\n            classes_ls.append(classes[input_category_id])\n        return classes_ls, predicted_conf_ls, category_id_ls\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov5_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov5_wrapper = yolov5_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov5_wrapper.infer(\n            self.yolov5_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(\n            self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov5_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov5_wrapper = yolov5_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov5_wrapper.infer(\n            self.yolov5_wrapper.get_raw_image_zeros())\n        print(\n            'warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    engine_file_path = \"build/yolov5s-cls.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov5TRT instance\n    yolov5_wrapper = YoLov5TRT(engine_file_path)\n    try:\n        print('batch size is', yolov5_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(\n            yolov5_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov5_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov5_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov5_wrapper.destroy()\n"
  },
  {
    "path": "yolov5/yolov5_det.cpp",
    "content": "#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"utils.h\"\n#include \"preprocess.h\"\n#include \"postprocess.h\"\n#include \"model.h\"\n\n#include <iostream>\n#include <chrono>\n#include <cmath>\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\nconst static int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, bool& is_p6, float& gd, float& gw, std::string& img_dir) {\n  if (argc < 4) return false;\n  if (std::string(argv[1]) == \"-s\" && (argc == 5 || argc == 7)) {\n    wts = std::string(argv[2]);\n    engine = std::string(argv[3]);\n    auto net = std::string(argv[4]);\n    if (net[0] == 'n') {\n      gd = 0.33;\n      gw = 0.25;\n    } else if (net[0] == 's') {\n      gd = 0.33;\n      gw = 0.50;\n    } else if (net[0] == 'm') {\n      gd = 0.67;\n      gw = 0.75;\n    } else if (net[0] == 'l') {\n      gd = 1.0;\n      gw = 1.0;\n    } else if (net[0] == 'x') {\n      gd = 1.33;\n      gw = 1.25;\n    } else if (net[0] == 'c' && argc == 7) {\n      gd = atof(argv[5]);\n      gw = atof(argv[6]);\n    } else {\n      return false;\n    }\n    if (net.size() == 2 && net[1] == '6') {\n      is_p6 = true;\n    }\n  } else if (std::string(argv[1]) == \"-d\" && argc == 4) {\n    engine = std::string(argv[2]);\n    img_dir = std::string(argv[3]);\n  } else {\n    return false;\n  }\n  return true;\n}\n\nvoid prepare_buffers(ICudaEngine* engine, float** gpu_input_buffer, float** gpu_output_buffer, float** cpu_output_buffer) {\n  assert(engine->getNbBindings() == 2);\n  // In order to bind the buffers, we need to know the names of the input and output tensors.\n  // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n  const int inputIndex = engine->getBindingIndex(kInputTensorName);\n  const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n  assert(inputIndex == 0);\n  assert(outputIndex == 1);\n  // Create GPU buffers on device\n  CUDA_CHECK(cudaMalloc((void**)gpu_input_buffer, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n  CUDA_CHECK(cudaMalloc((void**)gpu_output_buffer, kBatchSize * kOutputSize * sizeof(float)));\n\n  *cpu_output_buffer = new float[kBatchSize * kOutputSize];\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** gpu_buffers, float* output, int batchsize) {\n  context.enqueue(batchsize, gpu_buffers, stream, nullptr);\n  CUDA_CHECK(cudaMemcpyAsync(output, gpu_buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost, stream));\n  cudaStreamSynchronize(stream);\n}\n\nvoid serialize_engine(unsigned int max_batchsize, bool& is_p6, float& gd, float& gw, std::string& wts_name, std::string& engine_name) {\n  // Create builder\n  IBuilder* builder = createInferBuilder(gLogger);\n  IBuilderConfig* config = builder->createBuilderConfig();\n\n  // Create model to populate the network, then set the outputs and create an engine\n  ICudaEngine *engine = nullptr;\n  if (is_p6) {\n    engine = build_det_p6_engine(max_batchsize, builder, config, DataType::kFLOAT, gd, gw, wts_name);\n  } else {\n    engine = build_det_engine(max_batchsize, builder, config, DataType::kFLOAT, gd, gw, wts_name);\n  }\n  assert(engine != nullptr);\n\n  // Serialize the engine\n  IHostMemory* serialized_engine = engine->serialize();\n  assert(serialized_engine != nullptr);\n\n  // Save engine to file\n  std::ofstream p(engine_name, std::ios::binary);\n  if (!p) {\n    std::cerr << \"Could not open plan output file\" << std::endl;\n    assert(false);\n  }\n  p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n  // Close everything down\n  engine->destroy();\n  config->destroy();\n  serialized_engine->destroy();\n  builder->destroy();\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine, IExecutionContext** context) {\n  std::ifstream file(engine_name, std::ios::binary);\n  if (!file.good()) {\n    std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n    assert(false);\n  }\n  size_t size = 0;\n  file.seekg(0, file.end);\n  size = file.tellg();\n  file.seekg(0, file.beg);\n  char* serialized_engine = new char[size];\n  assert(serialized_engine);\n  file.read(serialized_engine, size);\n  file.close();\n\n  *runtime = createInferRuntime(gLogger);\n  assert(*runtime);\n  *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n  assert(*engine);\n  *context = (*engine)->createExecutionContext();\n  assert(*context);\n  delete[] serialized_engine;\n}\n\nint main(int argc, char** argv) {\n  cudaSetDevice(kGpuId);\n\n  std::string wts_name = \"\";\n  std::string engine_name = \"\";\n  bool is_p6 = false;\n  float gd = 0.0f, gw = 0.0f;\n  std::string img_dir;\n\n  if (!parse_args(argc, argv, wts_name, engine_name, is_p6, gd, gw, img_dir)) {\n    std::cerr << \"arguments not right!\" << std::endl;\n    std::cerr << \"./yolov5_det -s [.wts] [.engine] [n/s/m/l/x/n6/s6/m6/l6/x6 or c/c6 gd gw]  // serialize model to plan file\" << std::endl;\n    std::cerr << \"./yolov5_det -d [.engine] ../images  // deserialize plan file and run inference\" << std::endl;\n    return -1;\n  }\n\n  // Create a model using the API directly and serialize it to a file\n  if (!wts_name.empty()) {\n    serialize_engine(kBatchSize, is_p6, gd, gw, wts_name, engine_name);\n    return 0;\n  }\n\n  // Deserialize the engine from file\n  IRuntime* runtime = nullptr;\n  ICudaEngine* engine = nullptr;\n  IExecutionContext* context = nullptr;\n  deserialize_engine(engine_name, &runtime, &engine, &context);\n  cudaStream_t stream;\n  CUDA_CHECK(cudaStreamCreate(&stream));\n\n  // Init CUDA preprocessing\n  cuda_preprocess_init(kMaxInputImageSize);\n\n  // Prepare cpu and gpu buffers\n  float* gpu_buffers[2];\n  float* cpu_output_buffer = nullptr;\n  prepare_buffers(engine, &gpu_buffers[0], &gpu_buffers[1], &cpu_output_buffer);\n\n  // Read images from directory\n  std::vector<std::string> file_names;\n  if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n    std::cerr << \"read_files_in_dir failed.\" << std::endl;\n    return -1;\n  }\n\n  // batch predict\n  for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n    // Get a batch of images\n    std::vector<cv::Mat> img_batch;\n    std::vector<std::string> img_name_batch;\n    for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n      cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n      img_batch.push_back(img);\n      img_name_batch.push_back(file_names[j]);\n    }\n\n    // Preprocess\n    cuda_batch_preprocess(img_batch, gpu_buffers[0], kInputW, kInputH, stream);\n\n    // Run inference\n    auto start = std::chrono::system_clock::now();\n    infer(*context, stream, (void**)gpu_buffers, cpu_output_buffer, kBatchSize);\n    auto end = std::chrono::system_clock::now();\n    std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n    // NMS\n    std::vector<std::vector<Detection>> res_batch;\n    batch_nms(res_batch, cpu_output_buffer, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n\n    // Draw bounding boxes\n    draw_bbox(img_batch, res_batch);\n\n    // Save images\n    for (size_t j = 0; j < img_batch.size(); j++) {\n      cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n    }\n  }\n\n  // Release stream and buffers\n  cudaStreamDestroy(stream);\n  CUDA_CHECK(cudaFree(gpu_buffers[0]));\n  CUDA_CHECK(cudaFree(gpu_buffers[1]));\n  delete[] cpu_output_buffer;\n  cuda_preprocess_destroy();\n  // Destroy the engine\n  context->destroy();\n  engine->destroy();\n  runtime->destroy();\n\n  // Print histogram of the output distribution\n  // std::cout << \"\\nOutput:\\n\\n\";\n  // for (unsigned int i = 0; i < kOutputSize; i++) {\n  //   std::cout << prob[i] << \", \";\n  //   if (i % 10 == 0) std::cout << std::endl;\n  // }\n  // std::cout << std::endl;\n\n  return 0;\n}\n\n"
  },
  {
    "path": "yolov5/yolov5_det_cuda_python.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nfrom cuda import cudart\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLov5 project.\n    param:\n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n            line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n        )\n\n\nclass YoLov5TRT(object):\n    \"\"\"\n    description: A YOLOv5 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n        # Create a Stream on this device,\n        _, stream = cudart.cudaStreamCreate()\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = np.empty(size, dtype=dtype)\n            _, cuda_mem = cudart.cudaMallocAsync(host_mem.nbytes, stream)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Restore\n        stream = self.stream\n        context = self.context\n        engine = self.engine\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cudart.cudaMemcpyAsync(cuda_inputs[0], host_inputs[0].ctypes.data, host_inputs[0].nbytes,\n                               cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream)\n        # Transfer predictions back from the GPU.\n        cudart.cudaMemcpyAsync(host_outputs[0].ctypes.data, cuda_outputs[0], host_outputs[0].nbytes,\n                               cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, stream)\n        # Synchronize the stream\n        cudart.cudaStreamSynchronize(stream)\n        end = time.time()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            result_boxes, result_scores, result_classid = self.post_process(\n                output[i * 6001: (i + 1) * 6001], batch_origin_h[i], batch_origin_w[i]\n            )\n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any stream and cuda mem\n        cudart.cudaStreamDestroy(self.stream)\n        cudart.cudaFree(self.cuda_inputs[0])\n        cudart.cudaFree(self.cuda_outputs[0])\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2\n            y /= r_h\n\n        return y\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...]\n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        pred = np.reshape(output[1:], (-1, 6))[:num, :]\n        # Do nms\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        return result_boxes, result_scores, result_classid\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None) * \\\n                     np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None)\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\n        # clip the coordinates\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1)\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1)\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1)\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1)\n        # Object confidence\n        confs = boxes[:, 4]\n        # Sort by the confs\n        boxes = boxes[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_boxes = []\n        while boxes.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\n            label_match = boxes[0, -1] == boxes[:, -1]\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\n            invalid = large_overlap & label_match\n            keep_boxes += [boxes[0]]\n            boxes = boxes[~invalid]\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\n        return boxes\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov5_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov5_wrapper = yolov5_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov5_wrapper.infer(self.yolov5_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov5_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov5_wrapper = yolov5_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov5_wrapper.infer(self.yolov5_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"build/libmyplugins.so\"\n    engine_file_path = \"build/yolov5s.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n    cudart.cudaDeviceSynchronize()\n\n    # load coco labels\n\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\",\n                  \"traffic light\",\n                  \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n                  \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\n                  \"frisbee\",\n                  \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\",\n                  \"surfboard\",\n                  \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n                  \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n                  \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\n                  \"cell phone\",\n                  \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\n                  \"teddy bear\",\n                  \"hair drier\", \"toothbrush\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov5TRT instance\n    yolov5_wrapper = YoLov5TRT(engine_file_path)\n    try:\n        print('batch size is', yolov5_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolov5_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov5_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov5_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov5_wrapper.destroy()\n"
  },
  {
    "path": "yolov5/yolov5_det_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\nLEN_ALL_RESULT = 38001\nLEN_ONE_RESULT = 38\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLov5 project.\n    param: \n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n        line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n        )\n\n\nclass YoLov5TRT(object):\n    \"\"\"\n    description: A YOLOv5 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        engine = self.engine\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            result_boxes, result_scores, result_classid = self.post_process(\n                output[i * LEN_ALL_RESULT: (i + 1) * LEN_ALL_RESULT], batch_origin_h[i], batch_origin_w[i]\n            )\n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        \n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n        \n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2\n            y /= r_h\n\n        return y\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...] \n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        pred = np.reshape(output[1:], (-1, LEN_ONE_RESULT))[:num, :]\n        pred = pred[:, :6]\n        # Do nms\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        return result_boxes, result_scores, result_classid\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))            \n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None) * \\\n                     np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None)\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\n        # clip the coordinates\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w -1)\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w -1)\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h -1)\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h -1)\n        # Object confidence\n        confs = boxes[:, 4]\n        # Sort by the confs\n        boxes = boxes[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_boxes = []\n        while boxes.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\n            label_match = boxes[0, -1] == boxes[:, -1]\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\n            invalid = large_overlap & label_match\n            keep_boxes += [boxes[0]]\n            boxes = boxes[~invalid]\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\n        return boxes\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov5_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov5_wrapper = yolov5_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov5_wrapper.infer(self.yolov5_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov5_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov5_wrapper = yolov5_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov5_wrapper.infer(self.yolov5_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"build/libmyplugins.so\"\n    engine_file_path = \"build/yolov5s.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load coco labels\n\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n            \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n            \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n            \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n            \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n            \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n            \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n            \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n            \"hair drier\", \"toothbrush\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov5TRT instance\n    yolov5_wrapper = YoLov5TRT(engine_file_path)\n    try:\n        print('batch size is', yolov5_wrapper.batch_size)\n        \n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolov5_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov5_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov5_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov5_wrapper.destroy()\n"
  },
  {
    "path": "yolov5/yolov5_seg.cpp",
    "content": "#include \"config.h\"\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"utils.h\"\n#include \"preprocess.h\"\n#include \"postprocess.h\"\n#include \"model.h\"\n\n#include <iostream>\n#include <chrono>\n#include <cmath>\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\nconst static int kOutputSize1 = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\nconst static int kOutputSize2 = 32 * (kInputH / 4) * (kInputW / 4);\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, float& gd, float& gw, std::string& img_dir, std::string& labels_filename) {\n    if (argc < 4) return false;\n    if (std::string(argv[1]) == \"-s\" && (argc == 5 || argc == 7)) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        auto net = std::string(argv[4]);\n        if (net[0] == 'n') {\n            gd = 0.33;\n            gw = 0.25;\n        } else if (net[0] == 's') {\n            gd = 0.33;\n            gw = 0.50;\n        } else if (net[0] == 'm') {\n            gd = 0.67;\n            gw = 0.75;\n        } else if (net[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n        } else if (net[0] == 'x') {\n            gd = 1.33;\n            gw = 1.25;\n        } else if (net[0] == 'c' && argc == 7) {\n            gd = atof(argv[5]);\n            gw = atof(argv[6]);\n        } else {\n            return false;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 5) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n        labels_filename = std::string(argv[4]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nvoid prepare_buffers(ICudaEngine* engine, float** gpu_input_buffer, float** gpu_output_buffer1, float** gpu_output_buffer2, float** cpu_output_buffer1, float** cpu_output_buffer2) {\n  assert(engine->getNbBindings() == 3);\n  // In order to bind the buffers, we need to know the names of the input and output tensors.\n  // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n  const int inputIndex = engine->getBindingIndex(kInputTensorName);\n  const int outputIndex1 = engine->getBindingIndex(kOutputTensorName);\n  const int outputIndex2 = engine->getBindingIndex(\"proto\");\n  assert(inputIndex == 0);\n  assert(outputIndex1 == 1);\n  assert(outputIndex2 == 2);\n\n  // Create GPU buffers on device\n  CUDA_CHECK(cudaMalloc((void**)gpu_input_buffer, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n  CUDA_CHECK(cudaMalloc((void**)gpu_output_buffer1, kBatchSize * kOutputSize1 * sizeof(float)));\n  CUDA_CHECK(cudaMalloc((void**)gpu_output_buffer2, kBatchSize * kOutputSize2 * sizeof(float)));\n\n  // Alloc CPU buffers\n  *cpu_output_buffer1 = new float[kBatchSize * kOutputSize1];\n  *cpu_output_buffer2 = new float[kBatchSize * kOutputSize2];\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void **buffers, float* output1, float* output2, int batchSize) {\n  context.enqueue(batchSize, buffers, stream, nullptr);\n  CUDA_CHECK(cudaMemcpyAsync(output1, buffers[1], batchSize * kOutputSize1 * sizeof(float), cudaMemcpyDeviceToHost, stream));\n  CUDA_CHECK(cudaMemcpyAsync(output2, buffers[2], batchSize * kOutputSize2 * sizeof(float), cudaMemcpyDeviceToHost, stream));\n  cudaStreamSynchronize(stream);\n}\n\nvoid serialize_engine(unsigned int max_batchsize, float& gd, float& gw, std::string& wts_name, std::string& engine_name) {\n  // Create builder\n  IBuilder* builder = createInferBuilder(gLogger);\n  IBuilderConfig* config = builder->createBuilderConfig();\n\n  // Create model to populate the network, then set the outputs and create an engine\n  ICudaEngine *engine = nullptr;\n\n  engine = build_seg_engine(max_batchsize, builder, config, DataType::kFLOAT, gd, gw, wts_name);\n\n  assert(engine != nullptr);\n\n  // Serialize the engine\n  IHostMemory* serialized_engine = engine->serialize();\n  assert(serialized_engine != nullptr);\n\n  // Save engine to file\n  std::ofstream p(engine_name, std::ios::binary);\n  if (!p) {\n    std::cerr << \"Could not open plan output file\" << std::endl;\n    assert(false);\n  }\n  p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n  // Close everything down\n  engine->destroy();\n  config->destroy();\n  serialized_engine->destroy();\n  builder->destroy();\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine, IExecutionContext** context) {\n  std::ifstream file(engine_name, std::ios::binary);\n  if (!file.good()) {\n    std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n    assert(false);\n  }\n  size_t size = 0;\n  file.seekg(0, file.end);\n  size = file.tellg();\n  file.seekg(0, file.beg);\n  char* serialized_engine = new char[size];\n  assert(serialized_engine);\n  file.read(serialized_engine, size);\n  file.close();\n\n  *runtime = createInferRuntime(gLogger);\n  assert(*runtime);\n  *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n  assert(*engine);\n  *context = (*engine)->createExecutionContext();\n  assert(*context);\n  delete[] serialized_engine;\n}\n\nint main(int argc, char** argv) {\n  cudaSetDevice(kGpuId);\n\n  std::string wts_name = \"\";\n  std::string engine_name = \"\";\n  std::string labels_filename = \"\";\n  float gd = 0.0f, gw = 0.0f;\n\n  std::string img_dir;\n  if (!parse_args(argc, argv, wts_name, engine_name, gd, gw, img_dir, labels_filename)) {\n    std::cerr << \"arguments not right!\" << std::endl;\n    std::cerr << \"./yolov5_seg -s [.wts] [.engine] [n/s/m/l/x or c gd gw]  // serialize model to plan file\" << std::endl;\n    std::cerr << \"./yolov5_seg -d [.engine] ../images coco.txt  // deserialize plan file, read the labels file and run inference\" << std::endl;\n    return -1;\n  }\n\n  // Create a model using the API directly and serialize it to a file\n  if (!wts_name.empty()) {\n    serialize_engine(kBatchSize, gd, gw, wts_name, engine_name);\n    return 0;\n  }\n\n  // Deserialize the engine from file\n  IRuntime* runtime = nullptr;\n  ICudaEngine* engine = nullptr;\n  IExecutionContext* context = nullptr;\n  deserialize_engine(engine_name, &runtime, &engine, &context);\n  cudaStream_t stream;\n  CUDA_CHECK(cudaStreamCreate(&stream));\n\n  // Init CUDA preprocessing\n  cuda_preprocess_init(kMaxInputImageSize);\n\n  // Prepare cpu and gpu buffers\n  float* gpu_buffers[3];\n  float* cpu_output_buffer1 = nullptr;\n  float* cpu_output_buffer2 = nullptr;\n  prepare_buffers(engine, &gpu_buffers[0], &gpu_buffers[1], &gpu_buffers[2], &cpu_output_buffer1, &cpu_output_buffer2);\n\n  // Read images from directory\n  std::vector<std::string> file_names;\n  if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n    std::cerr << \"read_files_in_dir failed.\" << std::endl;\n    return -1;\n  }\n\n  // Read the txt file for classnames\n  std::ifstream labels_file(labels_filename, std::ios::binary);\n  if (!labels_file.good()) {\n    std::cerr << \"read \" << labels_filename << \" error!\" << std::endl;\n    return -1;\n  }\n  std::unordered_map<int, std::string> labels_map;\n  read_labels(labels_filename, labels_map);\n  assert(kNumClass == labels_map.size());\n\n  // batch predict\n  for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n    // Get a batch of images\n    std::vector<cv::Mat> img_batch;\n    std::vector<std::string> img_name_batch;\n    for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n      cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n      img_batch.push_back(img);\n      img_name_batch.push_back(file_names[j]);\n    }\n\n    // Preprocess\n    cuda_batch_preprocess(img_batch, gpu_buffers[0], kInputW, kInputH, stream);\n\n    // Run inference\n    auto start = std::chrono::system_clock::now();\n    infer(*context, stream, (void**)gpu_buffers, cpu_output_buffer1, cpu_output_buffer2, kBatchSize);\n    auto end = std::chrono::system_clock::now();\n    std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n    // NMS\n    std::vector<std::vector<Detection>> res_batch;\n    batch_nms(res_batch, cpu_output_buffer1, img_batch.size(), kOutputSize1, kConfThresh, kNmsThresh);\n\n    // Draw result and save image\n    for (size_t b = 0; b < img_name_batch.size(); b++) {\n      auto& res = res_batch[b];\n      cv::Mat img = img_batch[b];\n\n      auto masks = process_mask(&cpu_output_buffer2[b * kOutputSize2], kOutputSize2, res);\n      draw_mask_bbox(img, res, masks, labels_map);\n      cv::imwrite(\"_\" + img_name_batch[b], img);\n    }\n  }\n\n  // Release stream and buffers\n  cudaStreamDestroy(stream);\n  CUDA_CHECK(cudaFree(gpu_buffers[0]));\n  CUDA_CHECK(cudaFree(gpu_buffers[1]));\n  CUDA_CHECK(cudaFree(gpu_buffers[2]));\n  delete[] cpu_output_buffer1;\n  delete[] cpu_output_buffer2;\n  cuda_preprocess_destroy();\n  // Destroy the engine\n  context->destroy();\n  engine->destroy();\n  runtime->destroy();\n\n  return 0;\n}\n\n"
  },
  {
    "path": "yolov5/yolov5_seg_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLov5 project.\n    param: \n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n        line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n        )\n\n\nclass YoLov5TRT(object):\n    \"\"\"\n    description: A YOLOv5 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n\n        # Data length\n        self.det_output_length  = host_outputs[0].shape[0]\n        self.mask_output_length = host_outputs[1].shape[0]\n        self.seg_w = int(self.input_w / 4)\n        self.seg_h = int(self.input_h / 4)\n        self.seg_c = int(self.mask_output_length / (self.seg_w * self.seg_w))\n        self.det_row_output_length = self.seg_c + 6\n        \n        # Draw mask\n        self.colors_obj = Colors()\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        engine = self.engine\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        cuda.memcpy_dtoh_async(host_outputs[1], cuda_outputs[1], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output_bbox = host_outputs[0]\n        output_proto_mask = host_outputs[1]\n        # Do postprocess\n        for i in range(self.batch_size):\n            result_boxes, result_scores, result_classid, result_proto_coef = self.post_process(\n                output_bbox[i * self.det_output_length: (i + 1) * self.det_output_length], batch_origin_h[i], batch_origin_w[i]\n            )\n            if result_proto_coef.shape[0] == 0:\n                continue\n            result_masks = self.process_mask(output_proto_mask, result_proto_coef, result_boxes, batch_origin_h[i], batch_origin_w[i])\n\n            # Draw masks on  the original image\n            self.draw_mask(result_masks, colors_=[self.colors_obj(x, True) for x in result_classid],im_src=batch_image_raw[i])\n\n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2\n            y /= r_h\n\n        return y\n\n    def post_process(self, output_boxes, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes, cx, cy, w, h, conf, cls_id, mask[32], cx, cy, w, h, conf, cls_id, mask[32] ...] \n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        # Get the num of boxes detected\n        num = int(output_boxes[0])\n        # Reshape to a two dimentional ndarray\n        pred = np.reshape(output_boxes[1:], (-1, self.det_row_output_length))[:num, :]\n        # Do nms\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH,\n                                         nms_thres=IOU_THRESHOLD)\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        result_proto_coef = boxes[:, 6:] if len(boxes) else np.array([])\n        return result_boxes, result_scores, result_classid, result_proto_coef\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))            \n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None) * \\\n                     np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None)\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id, mask coefficients[32])\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\n        # clip the coordinates\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1)\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1)\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1)\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1)\n        # Object confidence\n        confs = boxes[:, 4]\n        # Sort by the confs\n        boxes = boxes[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_boxes = []\n        while boxes.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\n            label_match = boxes[0, 5] == boxes[:, 5]\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\n            invalid = large_overlap & label_match\n            keep_boxes += [boxes[0]]\n            boxes = boxes[~invalid]\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\n        return boxes\n\n    def sigmoid(self, x):\n        return 1 / (1 + np.exp(-x))\n\n    def scale_mask(self, mask, ih, iw):\n        mask = cv2.resize(mask, (self.input_w, self.input_h))\n        r_w = self.input_w / (iw * 1.0)\n        r_h = self.input_h / (ih * 1.0)\n        if r_h > r_w:\n            w = self.input_w\n            h = int(r_w * ih)\n            x = 0\n            y = int((self.input_h - h) / 2)\n        else:\n            w = int(r_h * iw)\n            h = self.input_h\n            x = int((self.input_w - w) / 2)\n            y = 0\n        crop = mask[y:y+h, x:x+w]\n        crop = cv2.resize(crop, (iw, ih))\n        return crop\n\n\n    def process_mask(self, output_proto_mask, result_proto_coef, result_boxes, ih, iw):\n        \"\"\"\n        description: Mask pred by yolov5 instance segmentation ,\n        param: \n            output_proto_mask: prototype mask e.g. (32, 160, 160) for 640x640 input\n            result_proto_coef: prototype mask coefficients (n, 32), n represents n results\n            result_boxes     :  \n            ih: rows of original image\n            iw: cols of original image\n        return:\n            mask_result: (n, ih, iw)\n        \"\"\"\n        result_proto_masks = output_proto_mask.reshape(self.seg_c, self.seg_h, self.seg_w)\n        c, mh, mw = result_proto_masks.shape\n        masks = self.sigmoid((result_proto_coef @ result_proto_masks.astype(np.float32).reshape(c, -1))).reshape(-1, mh, mw)\n        mask_result = []\n        for mask, box in zip(masks, result_boxes):\n            mask_s = np.zeros((ih, iw))\n            crop_mask = self.scale_mask(mask, ih, iw)            \n            x1 = int(box[0])\n            y1 = int(box[1])\n            x2 = int(box[2])\n            y2 = int(box[3])\n            crop = crop_mask[y1:y2, x1:x2]\n            crop = np.where(crop >= 0.5, 1, 0)\n            crop = crop.astype(np.uint8)\n            mask_s[y1:y2, x1:x2] = crop\n            mask_result.append(mask_s)\n        mask_result = np.array(mask_result)\n        return mask_result\n\n    def draw_mask(self, masks, colors_, im_src, alpha=0.5):\n        \"\"\"\n        description: Draw mask on image ,\n        param: \n            masks  : result_mask\n            colors_: color to draw mask\n            im_src : original image\n            alpha  : scale between original  image and mask\n        return:\n            no return\n        \"\"\"\n        if len(masks) == 0:\n            return\n        masks = np.asarray(masks, dtype=np.uint8)\n        masks = np.ascontiguousarray(masks.transpose(1, 2, 0))\n        masks = np.asarray(masks, dtype=np.float32)\n        colors_ = np.asarray(colors_, dtype=np.float32)\n        s = masks.sum(2, keepdims=True).clip(0, 1)\n        masks = (masks @ colors_).clip(0, 255)\n        im_src[:] = masks * alpha + im_src * (1 - s * alpha)\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov5_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov5_wrapper = yolov5_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov5_wrapper.infer(self.yolov5_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov5_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov5_wrapper = yolov5_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov5_wrapper.infer(self.yolov5_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nclass Colors:\n    def __init__(self):\n        hexs = ('FF3838', 'FF9D97', 'FF701F', 'FFB21D', 'CFD231', '48F90A',\n                '92CC17', '3DDB86', '1A9334', '00D4BB', '2C99A8', '00C2FF',\n                '344593', '6473FF', '0018EC', '8438FF', '520085', 'CB38FF',\n                'FF95C8', 'FF37C7')\n        self.palette = [self.hex2rgb(f'#{c}') for c in hexs]\n        self.n = len(self.palette)\n\n    def __call__(self, i, bgr=False):\n        c = self.palette[int(i) % self.n]\n        return (c[2], c[1], c[0]) if bgr else c\n\n    @staticmethod\n    def hex2rgb(h):  # rgb order (PIL)\n        return tuple(int(h[1 + i:1 + i + 2], 16) for i in (0, 2, 4))\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"build/libmyplugins.so\"\n    engine_file_path = \"build/yolov5s-seg.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load coco labels\n\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n            \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n            \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n            \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n            \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n            \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n            \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n            \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n            \"hair drier\", \"toothbrush\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov5TRT instance\n    yolov5_wrapper = YoLov5TRT(engine_file_path)\n    try:\n        print('batch size is', yolov5_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolov5_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov5_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov5_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov5_wrapper.destroy()\n"
  },
  {
    "path": "yolov5-lite/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\n\nproject(yolov5-lite)\n\nadd_definitions(-std=c++11)\nadd_definitions(-DAPI_EXPORTS)\noption(CUDA_USE_STATIC_CUDA_RUNTIME OFF)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nfind_package(CUDA REQUIRED)\n\nif(WIN32)\nenable_language(CUDA)\nendif(WIN32)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\n# cuda\ninclude_directories(/usr/local/cuda/include)\nlink_directories(/usr/local/cuda/lib64)\n# tensorrt\n# include_directories(/usr/include/x86_64-linux-gnu/)\n# link_directories(/usr/lib/x86_64-linux-gnu/)\ninclude_directories(/opt/TensorRT-8.6.1.6/include)\nlink_directories(/opt/TensorRT-8.6.1.6/lib)\n\nset(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors -D_MWAITXINTRIN_H_INCLUDED\")\n\ncuda_add_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/yololayer.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\n#add_executable(yolov5 ${PROJECT_SOURCE_DIR}/calibrator.cpp ${PROJECT_SOURCE_DIR}/main.cpp)\nadd_executable(v5lite ${PROJECT_SOURCE_DIR}/calibrator.cpp ${PROJECT_SOURCE_DIR}/v5lite.cpp)\ntarget_link_libraries(v5lite nvinfer)\ntarget_link_libraries(v5lite cudart)\ntarget_link_libraries(v5lite myplugins)\ntarget_link_libraries(v5lite ${OpenCV_LIBS})\n\nif(UNIX)\nadd_definitions(-O2 -pthread)\nendif(UNIX)\n\n"
  },
  {
    "path": "yolov5-lite/README.md",
    "content": "# YOLOv5-Lite TensorRT Deployment\n\n\n\nDetection training code [link](https://github.com/ppogg/YOLOv5-Lite.git)\n\n## Environment\nTensorRT: 8.6.1.6\nCUDA: 12.6\nCUDNN: 8.9.0\nOpenCV:4.10.0\n\n\n\n## Configuration parameters\n\nBefore starting, you need to modify parameters in `include/yololayer.h` to match your training configuration (example at `include/yololayer.h`):\n\n```cpp\nstatic constexpr int MAX_OUTPUT_BBOX_COUNT = 1000;\nstatic constexpr int CLASS_NUM = 80;  // number of classes\nstatic constexpr int INPUT_H = 640;   // input height for yolov5-lite (must be divisible by 32)\nstatic constexpr int INPUT_W = 640;   // input width for yolov5-lite (must be divisible by 32)\nstatic constexpr int DEVICE = 0;\nstatic constexpr float NMS_THRESH = 0.4;\nstatic constexpr float CONF_THRESH = 0.45;\nstatic constexpr int BATCH_SIZE = 1;\nconst char* INPUT_BLOB_NAME = \"data\";\nconst char* OUTPUT_BLOB_NAME = \"prob\";\n```\n\n## 1. Generate .wts from .pt\n\nThis step must be performed inside the `yolov5-lite` folder:\n\n```bash\ncd yolov5-lite\ngit clone https://gitcode.com/open-source-toolkit/ac70a.git\nunzip your zip file \n\npython gen_wts.py -w v5lite-s.pt -o v5lite-s.wts\npython gen_wts.py -w v5lite-e.pt -o v5lite-e.wts\npython gen_wts.py -w v5lite-g.pt -o v5lite-g.wts\n```\n\n## 2. Build the engine and run inference\n\n### Build steps\n\na. First, set `CLASS_NUM` in `include/yololayer.h` to match your dataset class count — this is important, otherwise you will get errors.\n\nb. Run the following commands:\n\n```bash\nmkdir build\ncd build\ncmake ..\nmake\n```\n\n### Generate engine files\n\n```bash\n./v5lite -s ../v5lite-s.wts v5lite-s.engine s\n./v5lite -s ../v5lite-g.wts v5lite-g.engine g\n./v5lite -s ../v5lite-e.wts v5lite-e.engine e\n./v5lite -s ../v5lite-c.wts v5lite-c.engine c\n```\n\n### Using the engine for inference\n\n(`samples` is the folder containing your images):\n\n```bash\n./v5lite -d v5lite-s.engine ../samples\n```\n\nYou can also use `yolov5-lite-trt.py` (in the repository root) for inference.\n\n## 3. INT8 Quantization\n\n### Preparation\n\n1. Collect calibration images (recommended ~1000 images)\n2. Put the images in a calibration folder (for example: `tensorrtx-int8calib-data/coco_calib`)\n3. Modify the macro in [v5lite.cpp](yolov5-lite/v5lite.cpp):\n\n   Change:\n   ```cpp\n   // #define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32\n   // #define USE_INT8  // set USE_INT8 or USE_FP16 or USE_FP32\n   ```\n\n   To:\n   ```cpp\n   // #define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32\n   #define USE_INT8  // set USE_INT8 or USE_FP16 or USE_FP32\n   ```\n\n4. Update the data path in the code to point to your calibration images\n\n5. Rebuild and generate the engine, then run inference (repeat step 2)\n\n## Notes\n\n- In practice, calling the engine from Python may produce better inference behavior in some cases.\n"
  },
  {
    "path": "yolov5-lite/calibrator.cpp",
    "content": "#include <iostream>\n#include <iterator>\n#include <fstream>\n#include <opencv2/dnn/dnn.hpp>\n#include \"calibrator.h\"\n#include \"cuda_utils.h\"\n#include \"utils.h\"\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name, const char* input_blob_name, bool read_cache)\n    : batchsize_(batchsize)\n    , input_w_(input_w)\n    , input_h_(input_h)\n    , img_idx_(0)\n    , img_dir_(img_dir)\n    , calib_table_name_(calib_table_name)\n    , input_blob_name_(input_blob_name)\n    , read_cache_(read_cache)\n{\n    input_count_ = 3 * input_w * input_h * batchsize;\n    CUDA_CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n    read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2()\n{\n    CUDA_CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const TRT_NOEXCEPT\n{\n    return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT\n{\n    if (img_idx_ + batchsize_ > (int)img_files_.size()) {\n        return false;\n    }\n\n    std::vector<cv::Mat> input_imgs_;\n    for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n        std::cout << img_files_[i] << \"  \" << i << std::endl;\n        cv::Mat temp = cv::imread(img_dir_ + img_files_[i]);\n        if (temp.empty()){\n            std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n            return false;\n        }\n        cv::Mat pr_img = preprocess_img(temp, input_w_, input_h_);\n        input_imgs_.push_back(pr_img);\n    }\n    img_idx_ += batchsize_;\n    cv::Mat blob = cv::dnn::blobFromImages(input_imgs_, 1.0 / 255.0, cv::Size(input_w_, input_h_), cv::Scalar(0, 0, 0), true, false);\n\n    CUDA_CHECK(cudaMemcpy(device_input_, blob.ptr<float>(0), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n    assert(!strcmp(names[0], input_blob_name_));\n    bindings[0] = device_input_;\n    return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length) TRT_NOEXCEPT\n{\n    std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n    calib_cache_.clear();\n    std::ifstream input(calib_table_name_, std::ios::binary);\n    input >> std::noskipws;\n    if (read_cache_ && input.good())\n    {\n        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n    }\n    length = calib_cache_.size();\n    return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT\n{\n    std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n    std::ofstream output(calib_table_name_, std::ios::binary);\n    output.write(reinterpret_cast<const char*>(cache), length);\n}\n\n"
  },
  {
    "path": "yolov5-lite/common.hpp",
    "content": "#ifndef YOLOV5_COMMON_H_\n#define YOLOV5_COMMON_H_\n\n#include <fstream>\n#include <map>\n#include <sstream>\n#include <vector>\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"yololayer.h\"\n\nusing namespace nvinfer1;\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    int l, r, t, b;\n    float r_w = Yolo::INPUT_W / (img.cols * 1.0);\n    float r_h = Yolo::INPUT_H / (img.rows * 1.0);\n    if (r_h > r_w) \n    {\n        l = bbox[0] - bbox[2] / 2.f;\n        r = bbox[0] + bbox[2] / 2.f;\n        t = bbox[1] - bbox[3] / 2.f - (Yolo::INPUT_H - r_w * img.rows) / 2;\n        b = bbox[1] + bbox[3] / 2.f - (Yolo::INPUT_H - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } \n    else\n    {\n        l = bbox[0] - bbox[2] / 2.f - (Yolo::INPUT_W - r_h * img.cols) / 2;\n        r = bbox[0] + bbox[2] / 2.f - (Yolo::INPUT_W - r_h * img.cols) / 2;\n        t = bbox[1] - bbox[3] / 2.f;\n        b = bbox[1] + bbox[3] / 2.f;\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    return cv::Rect(l, t, r - l, b - t);\n}\n\nfloat iou(float lbox[4], float rbox[4]) \n{\n    float interBox[] = {\n        (std::max)(lbox[0] - lbox[2] / 2.f , rbox[0] - rbox[2] / 2.f), //left\n        (std::min)(lbox[0] + lbox[2] / 2.f , rbox[0] + rbox[2] / 2.f), //right\n        (std::max)(lbox[1] - lbox[3] / 2.f , rbox[1] - rbox[3] / 2.f), //top\n        (std::min)(lbox[1] + lbox[3] / 2.f , rbox[1] + rbox[3] / 2.f), //bottom\n    };\n\n    if (interBox[2] > interBox[3] || interBox[0] > interBox[1])\n    {\n        std::cout << \"The data is questionable!\" << std::endl;\n        return 0.0f;\n    }\n\n    float interBoxS = (interBox[1] - interBox[0])*(interBox[3] - interBox[2]);\n    return interBoxS / (lbox[2] * lbox[3] + rbox[2] * rbox[3] - interBoxS);\n}\n\nbool cmp(const Yolo::Detection& a, const Yolo::Detection& b) \n{\n    return a.conf > b.conf;\n}\n\nvoid nms(std::vector<Yolo::Detection>& res, float *output, float conf_thresh, float nms_thresh = 0.5) \n{\n    int det_size = sizeof(Yolo::Detection) / sizeof(float);\n    std::map<float, std::vector<Yolo::Detection>> m;\n    for (int i = 0; i < output[0] && i < Yolo::MAX_OUTPUT_BBOX_COUNT; i++) {\n        if (output[1 + det_size * i + 4] <= conf_thresh) continue;\n        Yolo::Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        if (m.count(det.class_id) == 0) m.emplace(det.class_id, std::vector<Yolo::Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        //std::cout << it->second[0].class_id << \" --- \" << std::endl;\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                    dets.erase(dets.begin() + n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) \n{\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        /*\n        class Weights\n        {\n        public:\n            DataType type;      //!< The type of the weights.\n            void const* values; //!< The weight values, in a contiguous array.\n            int64_t count;      //!< The number of weights in the array.\n        };\n        */\n        Weights wt{ DataType::kFLOAT, nullptr, 0 };\n        uint32_t size; \n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n    //  for (auto it = weightMap.begin(); it != weightMap.end(); it++) {\n    //     std::cout << \"========= keys: \" << it -> first << \" =================\" <<  std::endl;\n    // }\n\n    return weightMap;\n}\n\nnvinfer1::IScaleLayer* addBatchNorm2d(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps)\n {\n    float *gamma = (float*)weightMap[lname + \".weight\"].values;\n    float *beta = (float*)weightMap[lname + \".bias\"].values;\n    float *mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float *var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float *scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    // gamma / sqrt(running_var + eps)\n    for (int i = 0; i < len; i++)\n    {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{ DataType::kFLOAT, scval, len };\n\n    float *shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) \n    {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{ DataType::kFLOAT, shval, len };\n\n    float *pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) \n    {\n        pval[i] = 1.0;\n    }\n    Weights power{ DataType::kFLOAT, pval, len };\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nnvinfer1::IPoolingLayer *conv_bn_relu_maxpool(nvinfer1::INetworkDefinition *network, std::map<std::string, nvinfer1::Weights> & weightMap, nvinfer1::ITensor &input, int outch, std::string lname){\n  nvinfer1::Weights emptywts{nvinfer1::DataType::kFLOAT, nullptr, 0};\n  nvinfer1::IConvolutionLayer *conv0 = network->addConvolutionNd(input, outch, nvinfer1::DimsHW{3, 3}, weightMap[lname + \"conv.0.weight\"], emptywts);\n  conv0->setStrideNd(nvinfer1::DimsHW{2, 2});\n  conv0->setPaddingNd(nvinfer1::DimsHW{1, 1});\n\n  nvinfer1::IScaleLayer * bn1 = addBatchNorm2d(network, weightMap, *conv0->getOutput(0), lname + \"conv.1\", 1e-3);\n  \n  auto Relu = network->addActivation(*bn1->getOutput(0), nvinfer1::ActivationType::kRELU);\n  assert(Relu);\n  IPoolingLayer *pool = network->addPoolingNd(*Relu->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{3, 3});\n  pool->setStrideNd(nvinfer1::DimsHW{2, 2});\n  pool->setPaddingNd(nvinfer1::DimsHW{1, 1});\n  assert(pool);\n  return pool;\n}\n\n\n\n\nnvinfer1::IElementWiseLayer *HardSwish(nvinfer1::INetworkDefinition *network, nvinfer1::ITensor &input){\n    auto hsig = network->addActivation(input, ActivationType::kHARD_SIGMOID);\n    hsig->setAlpha(1.0 / 6.0);\n    hsig->setBeta(0.5);\n    auto ew = network->addElementWise(input, *hsig->getOutput(0), ElementWiseOperation::kPROD);\n    return ew;\n    \n}\n\n\n\nnvinfer1::IElementWiseLayer *CBH(nvinfer1::INetworkDefinition *network, std::map<std::string, nvinfer1::Weights> &weightMap, nvinfer1::ITensor &input, \n        int num_filters, int filter_size, int stride, std::string lname, int num_groups=1){\n    \n    int pad = (filter_size - 1) / 2;\n    nvinfer1::Weights emptywts {nvinfer1::DataType::kFLOAT, nullptr, 0};\n\n    nvinfer1::IConvolutionLayer *conv = network->addConvolutionNd(input, num_filters, nvinfer1::DimsHW{filter_size, filter_size}, \n                 weightMap[lname + \".conv.weight\"], emptywts);\n    conv->setStrideNd(nvinfer1::DimsHW{stride, stride});\n    conv->setPaddingNd(nvinfer1::DimsHW{pad, pad});\n    conv->setNbGroups(num_groups);\n\n    nvinfer1::IScaleLayer *bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n    nvinfer1::IElementWiseLayer *hash = HardSwish(network, *bn->getOutput(0));\n    \n    nvinfer1::Dims dims = hash->getOutput(0)->getDimensions();\n   \n    return hash;\n}\n\n\n\n\nnvinfer1::IElementWiseLayer *SiLU(nvinfer1::INetworkDefinition* network, nvinfer1::ITensor& input)\n{\n    // Create Sigmoid activation layer\n    nvinfer1::IActivationLayer *sig = network->addActivation(input, ActivationType::kSIGMOID);\n\n    nvinfer1::IElementWiseLayer *mul = network->addElementWise(input, *sig->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n\n    return mul;\n}\n\n\nnvinfer1::IElementWiseLayer *LC_SEModule(nvinfer1::INetworkDefinition *network, std::map<std::string, nvinfer1::Weights> &weightMap, nvinfer1::ITensor &input,\n       int in_channels, std::string lname, int reduction=4){\n\n    nvinfer1::IIdentityLayer *identity = network->addIdentity(input);\n    nvinfer1::IReduceLayer *avg_pool = network->addReduce(input, nvinfer1::ReduceOperation::kAVG, (1 << 1) | (1 << 2), true);\n    nvinfer1::IConvolutionLayer *conv1 = network->addConvolutionNd(*avg_pool->getOutput(0), in_channels / reduction, nvinfer1::DimsHW{1, 1},\n             weightMap[lname + \".conv1.weight\"], weightMap[lname + \".conv1.bias\"]);\n    nvinfer1::IActivationLayer *relu = network->addActivation(*conv1->getOutput(0), nvinfer1::ActivationType::kRELU);\n    nvinfer1::IConvolutionLayer *conv2 = network->addConvolutionNd(*relu->getOutput(0), in_channels, nvinfer1::DimsHW{1, 1},\n             weightMap[lname + \".conv2.weight\"], weightMap[lname + \".conv2.bias\"]);\n    nvinfer1::IElementWiseLayer *silu = SiLU(network, *conv2->getOutput(0));\n\n    nvinfer1::IElementWiseLayer *out = network->addElementWise(*silu->getOutput(0), *identity->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n\n    return out;\n}\n\nnvinfer1::IElementWiseLayer *LC_Block(nvinfer1::INetworkDefinition *network, std::map<std::string, nvinfer1::Weights> &weightMap, nvinfer1::ITensor &input,\n     int num_channels, int num_filters, int stride, int dw_size, std::string lname, bool use_se=false){\n    // num_channels : in_channel\n    // num_filters : out_channel\n    // stride:dw_conv's stride\n    // dw_size: dw_conv's filter-size\n    nvinfer1::IElementWiseLayer *dw_conv = CBH(network, weightMap, input, num_channels, dw_size, stride, lname + \".dw_conv\", num_channels);\n    if(use_se){\n        nvinfer1::IElementWiseLayer *se = LC_SEModule(network, weightMap, *dw_conv->getOutput(0), num_channels, lname + \".se\");\n        nvinfer1::IElementWiseLayer *pw_conv = CBH(network, weightMap, *se->getOutput(0), num_filters, 1, 1, lname + \".pw_conv\");\n\n        return pw_conv;\n    }\n    nvinfer1::IElementWiseLayer *pw_conv = CBH(network, weightMap, *dw_conv->getOutput(0), num_filters, 1, 1, lname + \".pw_conv\");\n    \n    return pw_conv;\n}\n\n\nnvinfer1::IElementWiseLayer *Dense(nvinfer1::INetworkDefinition *network, std::map<std::string, nvinfer1::Weights> &weightMap, nvinfer1::ITensor &input, \n      int num_filters, int filter_size, std::string lname){\n    nvinfer1::Weights emptywts{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer *dense_conv = network->addConvolutionNd(input, num_filters, nvinfer1::DimsHW{filter_size, filter_size}\n     , weightMap[lname + \".dense_conv.weight\"], emptywts);\n    \n    nvinfer1::IElementWiseLayer *hash = HardSwish(network, *dense_conv->getOutput(0));\n    nvinfer1::Dims dims_o = hash->getOutput(0)->getDimensions();\n    return hash;\n}\n\n\nnvinfer1::IElementWiseLayer* convBlock(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int ksize, int s, int g, std::string lname) {\n  Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n  int p = ksize / 3;\n  IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ ksize, ksize }, weightMap[lname + \".conv.weight\"], emptywts);\n  assert(conv1);\n  conv1->setStrideNd(DimsHW{ s, s });\n  conv1->setPaddingNd(DimsHW{ p, p });\n  conv1->setNbGroups(g);\n  IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn\", 1e-3);\n\n  // silu = x * sigmoid\n  auto sig = network->addActivation(*bn1->getOutput(0), ActivationType::kSIGMOID);\n  assert(sig);\n  auto ew = network->addElementWise(*bn1->getOutput(0), *sig->getOutput(0), ElementWiseOperation::kPROD);\n  assert(ew);\n\n  return ew;\n}\n\nnvinfer1::IShuffleLayer* shuffle_block(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, int inch, int outch, int s) {\n    Weights emptywts{DataType::kFLOAT, nullptr, 0};\n    int branch_features = outch / 2;\n    ITensor *x1, *x2i, *x2o;\n    if (s > 1) {\n        IConvolutionLayer* conv1 = network->addConvolutionNd(input, inch, DimsHW{3, 3}, weightMap[lname + \"branch1.0.weight\"], emptywts);\n        assert(conv1);\n        conv1->setStrideNd(DimsHW{s, s});\n        conv1->setPaddingNd(DimsHW{1, 1});\n        conv1->setNbGroups(inch);\n        IScaleLayer *bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \"branch1.1\", 1e-5);\n        IConvolutionLayer* conv2 = network->addConvolutionNd(*bn1->getOutput(0), branch_features, DimsHW{1, 1}, weightMap[lname + \"branch1.2.weight\"], emptywts);\n        assert(conv2);\n        IScaleLayer *bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \"branch1.3\", 1e-5);\n        IActivationLayer* relu1 = network->addActivation(*bn2->getOutput(0), ActivationType::kRELU);\n        assert(relu1);\n        x1 = relu1->getOutput(0);\n        x2i = &input;\n    } else {\n        Dims d = input.getDimensions();\n        ISliceLayer *s1 = network->addSlice(input, Dims3{ 0, 0, 0 }, Dims3{ d.d[0] / 2, d.d[1], d.d[2] }, Dims3{ 1, 1, 1 });\n        ISliceLayer *s2 = network->addSlice(input, Dims3{ d.d[0] / 2, 0, 0 }, Dims3{ d.d[0] / 2, d.d[1], d.d[2] }, Dims3{ 1, 1, 1 });\n        x1 = s1->getOutput(0);\n        x2i = s2->getOutput(0);\n    }\n\n    IConvolutionLayer* conv3 = network->addConvolutionNd(*x2i, branch_features, DimsHW{1, 1}, weightMap[lname + \"branch2.0.weight\"], emptywts);\n    assert(conv3);\n    IScaleLayer *bn3 = addBatchNorm2d(network, weightMap, *conv3->getOutput(0), lname + \"branch2.1\", 1e-5);\n    IActivationLayer* relu2 = network->addActivation(*bn3->getOutput(0), ActivationType::kRELU);\n    assert(relu2);\n    IConvolutionLayer* conv4 = network->addConvolutionNd(*relu2->getOutput(0), branch_features, DimsHW{3, 3}, weightMap[lname + \"branch2.3.weight\"], emptywts);\n    assert(conv4);\n    conv4->setStrideNd(DimsHW{s, s});\n    conv4->setPaddingNd(DimsHW{1, 1});\n    conv4->setNbGroups(branch_features);\n    IScaleLayer *bn4 = addBatchNorm2d(network, weightMap, *conv4->getOutput(0), lname + \"branch2.4\", 1e-5);\n    IConvolutionLayer* conv5 = network->addConvolutionNd(*bn4->getOutput(0), branch_features, DimsHW{1, 1}, weightMap[lname + \"branch2.5.weight\"], emptywts);\n    assert(conv5);\n    IScaleLayer *bn5 = addBatchNorm2d(network, weightMap, *conv5->getOutput(0), lname + \"branch2.6\", 1e-5);\n    IActivationLayer* relu3 = network->addActivation(*bn5->getOutput(0), ActivationType::kRELU);\n    assert(relu3);\n\n    ITensor* inputTensors1[] = {x1, relu3->getOutput(0)};\n    IConcatenationLayer* cat1 = network->addConcatenation(inputTensors1, 2);\n    assert(cat1);\n\n    Dims dims = cat1->getOutput(0)->getDimensions();\n    std::cout << cat1->getOutput(0)->getName() << \" dims: \";\n    for (int i = 0; i < dims.nbDims; i++) {\n        std::cout << dims.d[i] << \", \";\n    }\n    std::cout << std::endl;\n\n    IShuffleLayer *sf1 = network->addShuffle(*cat1->getOutput(0));\n    assert(sf1);\n    sf1->setReshapeDimensions(Dims4(2, dims.d[0] / 2, dims.d[1], dims.d[2]));\n    sf1->setSecondTranspose(Permutation{1, 0, 2, 3});\n\n    Dims dims1 = sf1->getOutput(0)->getDimensions();\n    std::cout << sf1->getOutput(0)->getName() << \" dims: \";\n    for (int i = 0; i < dims1.nbDims; i++) {\n        std::cout << dims1.d[i] << \", \";\n    }\n    std::cout << std::endl;\n\n    IShuffleLayer *sf2 = network->addShuffle(*sf1->getOutput(0));\n    assert(sf2);\n    sf2->setReshapeDimensions(Dims3(dims.d[0], dims.d[1], dims.d[2]));\n\n    Dims dims2 = sf2->getOutput(0)->getDimensions();\n    std::cout << sf2->getOutput(0)->getName() << \" dims: \";\n    for (int i = 0; i < dims2.nbDims; i++) {\n        std::cout << dims2.d[i] << \", \";\n    }\n    std::cout << std::endl;\n\n    return sf2;\n}\n\nnvinfer1::IElementWiseLayer* SPP(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, \n  int c1, int c2, int k1, int k2, int k3, std::string lname) {\n  int c_ = c1 / 2;\n  auto cv1 = convBlock(network, weightMap, input, c_, 1, 1, 1, lname + \".cv1\");\n\n  auto pool1 = network->addPoolingNd(*cv1->getOutput(0), PoolingType::kMAX, DimsHW{ k1, k1 });\n  pool1->setPaddingNd(DimsHW{ k1 / 2, k1 / 2 });\n  pool1->setStrideNd(DimsHW{ 1, 1 });\n  auto pool2 = network->addPoolingNd(*cv1->getOutput(0), PoolingType::kMAX, DimsHW{ k2, k2 });\n  pool2->setPaddingNd(DimsHW{ k2 / 2, k2 / 2 });\n  pool2->setStrideNd(DimsHW{ 1, 1 });\n  auto pool3 = network->addPoolingNd(*cv1->getOutput(0), PoolingType::kMAX, DimsHW{ k3, k3 });\n  pool3->setPaddingNd(DimsHW{ k3 / 2, k3 / 2 });\n  pool3->setStrideNd(DimsHW{ 1, 1 });\n\n  ITensor* inputTensors[] = { cv1->getOutput(0), pool1->getOutput(0), pool2->getOutput(0), pool3->getOutput(0) };\n  auto cat = network->addConcatenation(inputTensors, 4);\n\n  auto cv2 = convBlock(network, weightMap, *cat->getOutput(0), c2, 1, 1, 1, lname + \".cv2\");\n  return cv2;\n}\n\nnvinfer1::IElementWiseLayer* bottleneck(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2, bool shortcut, int g, float e, std::string lname) {\n  auto cv1 = convBlock(network, weightMap, input, (int)((float)c2 * e), 1, 1, 1, lname + \".cv1\");\n  auto cv2 = convBlock(network, weightMap, *cv1->getOutput(0), c2, 3, 1, g, lname + \".cv2\");\n  if (shortcut && c1 == c2) {\n    auto ew = network->addElementWise(input, *cv2->getOutput(0), ElementWiseOperation::kSUM);\n    return ew;\n  }\n  return cv2;\n}\n\nnvinfer1::IElementWiseLayer* bottleneckCSP(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2, int n, bool shortcut, int g, float e, std::string lname) {\n  Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n  int c_ = (int)((float)c2 * e);\n  auto cv1 = convBlock(network, weightMap, input, c_, 1, 1, 1, lname + \".cv1\");\n  auto cv2 = network->addConvolutionNd(input, c_, DimsHW{ 1, 1 }, weightMap[lname + \".cv2.weight\"], emptywts);\n  ITensor* y1 = cv1->getOutput(0);\n  for (int i = 0; i < n; i++) {\n    auto b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, g, 1.0, lname + \".m.\" + std::to_string(i));\n    y1 = b->getOutput(0);\n  }\n  auto cv3 = network->addConvolutionNd(*y1, c_, DimsHW{ 1, 1 }, weightMap[lname + \".cv3.weight\"], emptywts);\n\n  ITensor* inputTensors[] = { cv3->getOutput(0), cv2->getOutput(0) };\n  auto cat = network->addConcatenation(inputTensors, 2);\n\n  IScaleLayer* bn = addBatchNorm2d(network, weightMap, *cat->getOutput(0), lname + \".bn\", 1e-4);\n  auto lr = network->addActivation(*bn->getOutput(0), ActivationType::kLEAKY_RELU);\n  lr->setAlpha(0.1);\n\n  auto cv4 = convBlock(network, weightMap, *lr->getOutput(0), c2, 1, 1, 1, lname + \".cv4\");\n  return cv4;\n}\n\nnvinfer1::IElementWiseLayer* C3(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input,\n           int c1, int c2, int n, bool shortcut, int g, float e, std::string lname) {\n  int c_ = (int)((float)c2 * e);\n  auto cv1 = convBlock(network, weightMap, input, c_, 1, 1, 1, lname + \".cv1\");\n  auto cv2 = convBlock(network, weightMap, input, c_, 1, 1, 1, lname + \".cv2\");\n  ITensor *y1 = cv1->getOutput(0);\n  for (int i = 0; i < n; i++) {\n    auto b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, g, 1.0, lname + \".m.\" + std::to_string(i));\n    y1 = b->getOutput(0);\n  }\n\n  ITensor* inputTensors[] = { y1, cv2->getOutput(0) };\n  auto cat = network->addConcatenation(inputTensors, 2);\n\n  auto cv3 = convBlock(network, weightMap, *cat->getOutput(0), c2, 1, 1, 1, lname + \".cv3\");\n  return cv3;\n}\n\n\nnvinfer1::IScaleLayer *conv_bn(nvinfer1::INetworkDefinition *network, std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input,\n           std::string lname, int out_channels, int kernel_size, int stride, int padding, int groups=1){\n    nvinfer1::Weights emptywts{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer *conv = network->addConvolutionNd(input, out_channels, nvinfer1::DimsHW{kernel_size, kernel_size},\n         weightMap[lname + \".conv.weight\"], emptywts);\n    conv->setStrideNd(nvinfer1::DimsHW{stride, stride});\n    conv->setPaddingNd(nvinfer1::DimsHW{padding, padding});\n    conv->setNbGroups(groups);\n\n    nvinfer1::IScaleLayer *bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-5);\n    return bn;\n   }\n\nnvinfer1::IActivationLayer *RepVGGBlock(nvinfer1::INetworkDefinition *network, std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input,\n        std::string lname, int out_channels, int kernel_size = 3, int stride = 1, int padding = 1, int groups=1){\n\n    nvinfer1::IScaleLayer *rbr_dense = conv_bn(network, weightMap, input, lname + \".rbr_dense\", out_channels, kernel_size, stride, padding, groups);\n    int padding_11 = padding - kernel_size / 2;\n    nvinfer1::IScaleLayer *rbr_1x1 = conv_bn(network, weightMap, input, lname + \".rbr_1x1\", out_channels, 1, stride, padding_11, groups);\n    nvinfer1::IElementWiseLayer *add = network->addElementWise(*rbr_dense->getOutput(0), *rbr_1x1->getOutput(0),  nvinfer1::ElementWiseOperation::kSUM);\n\n    nvinfer1::IActivationLayer *silu = network->addActivation(*add->getOutput(0), nvinfer1::ActivationType::kRELU);\n    return silu;\n}\n\nnvinfer1::IActivationLayer *DWConvblock(nvinfer1::INetworkDefinition *network, std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input,\n     std::string lname, int in_channels, int out_channels, int kernel_size, int stride){\n    nvinfer1::Weights emptywts {nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer *conv1 = network->addConvolutionNd(input, in_channels, nvinfer1::DimsHW{kernel_size, kernel_size},\n      weightMap[lname + \".conv1.weight\"], emptywts);\n    conv1->setStrideNd(nvinfer1::DimsHW{stride, stride});\n    std::cout << (kernel_size / 2) << std::endl;\n    conv1->setPaddingNd(nvinfer1::DimsHW{kernel_size / 2, kernel_size / 2});\n    conv1->setNbGroups(in_channels);\n    nvinfer1::IScaleLayer *bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn1\", 1e-5);\n    nvinfer1::IActivationLayer *relu1 = network->addActivation(*bn1->getOutput(0), nvinfer1::ActivationType::kRELU);\n    nvinfer1::IConvolutionLayer *conv2 = network->addConvolutionNd(*relu1->getOutput(0), out_channels, nvinfer1::DimsHW{1, 1}, weightMap[lname + \".conv2.weight\"], emptywts);\n    conv2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IScaleLayer *bn2 = addBatchNorm2d(network, weightMap, *conv2->getOutput(0), lname + \".bn2\", 1e-5);\n    nvinfer1::IActivationLayer *relu2 = network->addActivation(*bn2->getOutput(0), nvinfer1::ActivationType::kRELU);\n\n    return relu2;    \n    }\n\nstd::vector<std::vector<float>> getAnchors(std::map<std::string, Weights>& weightMap, std::string lname) \n{\n    std::vector<std::vector<float>> anchors;\n    Weights wts = weightMap[lname + \".anchor_grid\"];\n    int anchor_len = Yolo::CHECK_COUNT * 2; // 6\n    for (int i = 0; i < wts.count / anchor_len; i++) \n    {\n        auto *p = (const float*)wts.values + i * anchor_len;\n        std::vector<float> anchor(p, p + anchor_len);\n        anchors.push_back(anchor);\n    }\n    return anchors;\n}\n\nnvinfer1::IElementWiseLayer* focus(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch,\n           int outch, int ksize, std::string lname) {\n  ISliceLayer* s1 = network->addSlice(input, Dims3{ 0, 0, 0 }, Dims3{ inch, Yolo::INPUT_H / 2, Yolo::INPUT_W / 2 }, Dims3{ 1, 2, 2 });\n  ISliceLayer* s2 = network->addSlice(input, Dims3{ 0, 1, 0 }, Dims3{ inch, Yolo::INPUT_H / 2, Yolo::INPUT_W / 2 }, Dims3{ 1, 2, 2 });\n  ISliceLayer* s3 = network->addSlice(input, Dims3{ 0, 0, 1 }, Dims3{ inch, Yolo::INPUT_H / 2, Yolo::INPUT_W / 2 }, Dims3{ 1, 2, 2 });\n  ISliceLayer* s4 = network->addSlice(input, Dims3{ 0, 1, 1 }, Dims3{ inch, Yolo::INPUT_H / 2, Yolo::INPUT_W / 2 }, Dims3{ 1, 2, 2 });\n  ITensor* inputTensors[] = { s1->getOutput(0), s2->getOutput(0), s3->getOutput(0), s4->getOutput(0) };\n  auto cat = network->addConcatenation(inputTensors, 4);\n  auto conv = convBlock(network, weightMap, *cat->getOutput(0), outch, ksize, 1, 1, lname + \".conv\");\n  return conv;\n}\n\nnvinfer1::IElementWiseLayer *ADD(nvinfer1::INetworkDefinition *network,nvinfer1::ITensor& x1,nvinfer1::ITensor& x2, float alpha) {\n    nvinfer1::Weights shift{nvinfer1::DataType::kFLOAT, nullptr, 0}; \n    nvinfer1::Weights scale{nvinfer1::DataType::kFLOAT, &alpha, 1};  \n    nvinfer1::Weights power{nvinfer1::DataType::kFLOAT, nullptr, 0}; \n\n    nvinfer1::IScaleLayer* scaleLayer = network->addScale(x2, nvinfer1::ScaleMode::kUNIFORM, shift, scale, power);\n\n    nvinfer1::IElementWiseLayer* addLayer = network->addElementWise(x1, *scaleLayer->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n\n    return addLayer; \n}\n\nIPluginV2Layer* addYoLoLayer(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, std::string lname, std::vector<IConvolutionLayer*> dets) {\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    auto anchors = getAnchors(weightMap, lname);\n    PluginField plugin_fields[2];\n    int netinfo[4] = {Yolo::CLASS_NUM, Yolo::INPUT_W, Yolo::INPUT_H, Yolo::MAX_OUTPUT_BBOX_COUNT};\n    plugin_fields[0].data = netinfo;\n    plugin_fields[0].length = 4;\n    plugin_fields[0].name = \"netinfo\";\n    plugin_fields[0].type = PluginFieldType::kFLOAT32;\n    int scale = 8;\n    std::vector<Yolo::YoloKernel> kernels;\n    for (size_t i = 0; i < anchors.size(); i++) {\n        Yolo::YoloKernel kernel;\n        kernel.width = Yolo::INPUT_W / scale;\n        kernel.height = Yolo::INPUT_H / scale;\n        memcpy(kernel.anchors, &anchors[i][0], anchors[i].size() * sizeof(float));\n        kernels.push_back(kernel);\n        scale *= 2;\n    }\n    plugin_fields[1].data = &kernels[0];\n    plugin_fields[1].length = kernels.size();\n    plugin_fields[1].name = \"kernels\";\n    plugin_fields[1].type = PluginFieldType::kFLOAT32;\n    PluginFieldCollection plugin_data;\n    plugin_data.nbFields = 2;\n    plugin_data.fields = plugin_fields;\n    IPluginV2 *plugin_obj = creator->createPlugin(\"yololayer\", &plugin_data);\n    std::vector<ITensor*> input_tensors;\n    for (auto det: dets) {\n        input_tensors.push_back(det->getOutput(0));\n    }\n    auto yolo = network->addPluginV2(&input_tensors[0], input_tensors.size(), *plugin_obj);\n    return yolo;\n}\n\n\n\n#endif\n\n"
  },
  {
    "path": "yolov5-lite/gen_wts.py",
    "content": "import argparse\nimport os\nimport struct\nimport torch\nfrom utils.torch_utils import select_device\n\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description='Convert .pt file to .wts')\n    parser.add_argument('-w', '--weights', required=True, help='Input weights (.pt) file path (required)')\n    parser.add_argument('-o', '--output', help='Output (.wts) file path (optional)')\n    args = parser.parse_args()\n    if not os.path.isfile(args.weights):\n        raise SystemExit('Invalid input file')\n    if not args.output:\n        args.output = os.path.splitext(args.weights)[0] + '.wts'\n    elif os.path.isdir(args.output):\n        args.output = os.path.join(\n            args.output,\n            os.path.splitext(os.path.basename(args.weights))[0] + '.wts')\n    return args.weights, args.output\n\n\npt_file, wts_file = parse_args()\n\n# Initialize\ndevice = select_device('cpu')\n# Load model\nmodel = torch.load(pt_file, map_location=device, weights_only=False)['model'].float()  # load to FP32\nmodel.to(device).eval()\n\nwith open(wts_file, 'w') as f:\n    # Write the number of keys in the parameter dictionary first\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        # Flatten matrix parameters into a 1D array\n        vr = v.reshape(-1).cpu().numpy()\n        # Key, number of elements in the 1D array\n        f.write('{} {} '.format(k, len(vr)))\n        # Values, each separated by a space\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\n"
  },
  {
    "path": "yolov5-lite/v5lite.cpp",
    "content": "#include <iostream>\r\n#include <chrono>\r\n#include <cmath>\r\n#include <cstdio>\r\n#include<cassert>\r\n\r\n\r\n#include \"cuda_utils.h\"\r\n#include \"logging.h\"\r\n#include \"common.hpp\"\r\n#include \"utils.h\"\r\n#include \"calibrator.h\"\r\n\r\n// #define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32\r\n#define USE_INT8  // set USE_INT8 or USE_FP16 or USE_FP32\r\n\r\n\r\n\r\nstatic const int OUTPUT_SIZE = Yolo::MAX_OUTPUT_BBOX_COUNT * sizeof(Yolo::Detection) / sizeof(float) + 1;  // we assume the yololayer outputs no more than MAX_OUTPUT_BBOX_COUNT boxes that conf >= 0.1\r\nstatic Logger gLogger;\r\n\r\nstatic int get_depth(int x, float gd) {\r\n    if (x == 1) return 1;\r\n    int r = round(x * gd);\r\n    if (x * gd - int(x * gd) == 0.5 && (int(x * gd) % 2) == 0) {\r\n        --r;\r\n    }\r\n    return std::max<int>(r, 1);\r\n}\r\n\r\ninline int Get_channel(int x, int gw = 1, float divisor = 8.0){\r\n  // std::cout << \"=======\" << (x*gw) / divisor << \"===============\" << std::endl;\r\n  auto ch_out = int(ceil((x * gw) / divisor)) * divisor;\r\n  return ch_out;\r\n}\r\n\r\nnvinfer1::ICudaEngine *build_det_v5_lite_c(unsigned int maxBatchSize, nvinfer1::IBuilder *builder, nvinfer1::IBuilderConfig *config, \r\n       nvinfer1::DataType dt, std::string wts_name)\r\n{\r\n\r\n  nvinfer1::INetworkDefinition *network = builder->createNetworkV2(0U);\r\n  nvinfer1::ITensor *data = network->addInput(Yolo::INPUT_BLOB_NAME, dt, nvinfer1::Dims3{3, Yolo::INPUT_W, Yolo::INPUT_H});\r\n  std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_name);\r\n\r\n\r\n  // backbone\r\n  nvinfer1::IElementWiseLayer *conv0 = CBH(network, weightMap, *data, Get_channel(32), 3, 2, \"model.0\");\r\n  nvinfer1::IElementWiseLayer *conv1 = LC_Block(network, weightMap, *conv0->getOutput(0), Get_channel(32), Get_channel(64), 2, 3, \"model.1\", false);\r\n  nvinfer1::IElementWiseLayer *conv2 = LC_Block(network, weightMap, *conv1->getOutput(0), Get_channel(64), Get_channel(64), 1, 3, \"model.2\", false);\r\n  nvinfer1::IElementWiseLayer *conv3 = LC_Block(network, weightMap, *conv2->getOutput(0), Get_channel(64), Get_channel(128), 2, 3, \"model.3\", false);\r\n  nvinfer1::IElementWiseLayer *conv4 = LC_Block(network, weightMap, *conv3->getOutput(0), Get_channel(128), Get_channel(128), 1, 3, \"model.4\", false);\r\n  nvinfer1::IElementWiseLayer *conv5 = LC_Block(network, weightMap, *conv4->getOutput(0), Get_channel(128), Get_channel(128), 1, 3, \"model.5\", false);\r\n  nvinfer1::IElementWiseLayer *conv6 = LC_Block(network, weightMap, *conv5->getOutput(0), Get_channel(128), Get_channel(128), 1, 3, \"model.6\", false);\r\n  nvinfer1::IElementWiseLayer *conv7 = LC_Block(network, weightMap, *conv6->getOutput(0), Get_channel(128), Get_channel(256), 2, 3, \"model.7\", false);\r\n  nvinfer1::IElementWiseLayer *conv8 = LC_Block(network, weightMap, *conv7->getOutput(0), Get_channel(256), Get_channel(256), 1, 5, \"model.8\", false);\r\n  nvinfer1::IElementWiseLayer *conv9 = LC_Block(network, weightMap, *conv8->getOutput(0), Get_channel(256), Get_channel(256), 1, 5, \"model.9\", false);\r\n  nvinfer1::IElementWiseLayer *conv10 = LC_Block(network, weightMap, *conv9->getOutput(0), Get_channel(256), Get_channel(256), 1, 5, \"model.10\", false);\r\n  nvinfer1::IElementWiseLayer *conv11 = LC_Block(network, weightMap, *conv10->getOutput(0), Get_channel(256), Get_channel(256), 1, 5, \"model.11\", false);\r\n  nvinfer1::IElementWiseLayer *conv12 = LC_Block(network, weightMap, *conv11->getOutput(0), Get_channel(256), Get_channel(256), 1, 5, \"model.12\", false);\r\n  nvinfer1::IElementWiseLayer *conv13 = LC_Block(network, weightMap, *conv12->getOutput(0), Get_channel(256), Get_channel(512), 2, 5, \"model.13\", true);\r\n  nvinfer1::IElementWiseLayer *conv14 = LC_Block(network, weightMap, *conv13->getOutput(0), Get_channel(512), Get_channel(512), 1, 5, \"model.14\", true);\r\n  nvinfer1::IElementWiseLayer *conv15 = LC_Block(network, weightMap, *conv14->getOutput(0), Get_channel(512), Get_channel(512), 1, 5, \"model.15\", true);\r\n  nvinfer1::IElementWiseLayer *conv16 = LC_Block(network, weightMap, *conv15->getOutput(0), Get_channel(512), Get_channel(512), 1, 5, \"model.16\", true);\r\n  nvinfer1::IElementWiseLayer *conv17 = Dense(network, weightMap, *conv16->getOutput(0), Get_channel(512), 1, \"model.17\");\r\n\r\n  // neck\r\n  float scale[] = {1.0, 2.0, 2.0};\r\n  nvinfer1::IElementWiseLayer *conv18 = convBlock(network, weightMap, *conv17->getOutput(0), Get_channel(256), 1, 1, 1, \"model.18\");\r\n  nvinfer1::IResizeLayer *upsample19 = network->addResize(*conv18->getOutput(0));\r\n  upsample19->setScales(scale, 3);\r\n  nvinfer1::ITensor *inputTensors20[] = {upsample19->getOutput(0), conv12->getOutput(0)}; // 256 + 256 = 512\r\n  nvinfer1::IConcatenationLayer *cat20 = network->addConcatenation(inputTensors20, 2);\r\n  nvinfer1::IElementWiseLayer *conv21 = C3(network, weightMap, *cat20->getOutput(0), 512, Get_channel(256), get_depth(1, 1), false, 1, 0.5, \"model.21\");\r\n\r\n  nvinfer1::IElementWiseLayer *conv22 = convBlock(network, weightMap, *conv21->getOutput(0), Get_channel(128), 1, 1, 1, \"model.22\");\r\n  nvinfer1::IResizeLayer *upsample23 = network->addResize(*conv22->getOutput(0));\r\n  upsample23->setScales(scale, 3);\r\n  nvinfer1::ITensor *inputTensors24[] = {upsample23->getOutput(0), conv6->getOutput(0)}; // 128 + 128 = 256\r\n  nvinfer1::IConcatenationLayer *cat24 = network->addConcatenation(inputTensors24, 2);\r\n  nvinfer1::IElementWiseLayer *conv25 = C3(network, weightMap, *cat24->getOutput(0), 256, Get_channel(128), get_depth(1, 1), false, 1, 0.5, \"model.25\");\r\n\r\n  nvinfer1::IElementWiseLayer *conv26 = LC_Block(network, weightMap, *conv25->getOutput(0), Get_channel(128), Get_channel(128), 2, 5, \"model.26\", true);\r\n  nvinfer1::ITensor *inputTensor27[] = {conv26->getOutput(0), conv22->getOutput(0)}; // 128 + 128 = 256\r\n  nvinfer1::IConcatenationLayer *cat27 = network->addConcatenation(inputTensor27, 2);\r\n  nvinfer1::IElementWiseLayer *conv28 = C3(network, weightMap, *cat27->getOutput(0), 256, Get_channel(256), get_depth(1, 1), false, 1, 0.5, \"model.28\");\r\n\r\n  nvinfer1::IElementWiseLayer *conv29 = LC_Block(network, weightMap, *conv28->getOutput(0), Get_channel(256), Get_channel(256), 2, 5, \"model.29\", true);\r\n  nvinfer1::ITensor *inputTensor30[] = {conv29->getOutput(0), conv18->getOutput(0)}; // 256 + 256 = 512\r\n  nvinfer1::IConcatenationLayer *cat30 = network->addConcatenation(inputTensor30, 2);\r\n  nvinfer1::IElementWiseLayer *conv31 = C3(network, weightMap, *cat30->getOutput(0), 512, Get_channel(512), get_depth(1, 1), false, 1, 0.5, \"model.31\");\r\n\r\n    // detect\r\n  nvinfer1::IConvolutionLayer *det0 = network->addConvolutionNd(*conv25->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), \r\n      nvinfer1::DimsHW{1, 1}, weightMap[\"model.32.m.0.weight\"], weightMap[\"model.32.m.0.bias\"]);\r\n    \r\n  nvinfer1::IConvolutionLayer *det1 = network->addConvolutionNd(*conv28->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), \r\n      nvinfer1::DimsHW{1, 1}, weightMap[\"model.32.m.1.weight\"], weightMap[\"model.32.m.1.bias\"]);\r\n    \r\n  nvinfer1::IConvolutionLayer *det2 = network->addConvolutionNd(*conv31->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), \r\n        nvinfer1::DimsHW{1, 1}, weightMap[\"model.32.m.2.weight\"], weightMap[\"model.32.m.2.bias\"]);\r\n    \r\n  auto yolo = addYoLoLayer(network, weightMap, \"model.32\", std::vector<nvinfer1::IConvolutionLayer*>{det0, det1, det2});\r\n  yolo->getOutput(0)->setName(Yolo::OUTPUT_BLOB_NAME);\r\n  network->markOutput(*yolo->getOutput(0));\r\n\r\n      // Engine config\r\n  builder->setMaxBatchSize(maxBatchSize);\r\n  config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\r\n#if defined(USE_FP16)\r\n  config->setFlag(BuilderFlag::kFP16);\r\n#elif defined(USE_INT8)\r\n  std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\r\n  assert(builder->platformHasFastInt8());\r\n  config->setFlag(BuilderFlag::kINT8);\r\n  std::string data_path = \"tensorrtx-int8calib-data/coco_calib/\";\r\n  //Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\r\n  Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, Yolo::INPUT_W, Yolo::INPUT_H, data_path.c_str(), \"int8calib.table\", Yolo::INPUT_BLOB_NAME);\r\n  config->setInt8Calibrator(calibrator);\r\n#endif\r\n\r\n  std::cout << \"Building engine, please wait for a while...\" << std::endl;\r\n  ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\r\n  std::cout << \"Build engine successfully!\" << std::endl;\r\n\r\n  // Don't need the network any more\r\n  network->destroy();\r\n\r\n  // Release host memory\r\n  for (auto& mem : weightMap) {\r\n    free((void*)(mem.second.values));\r\n  }\r\n\r\n  return engine;\r\n}\r\n\r\n\r\nnvinfer1::ICudaEngine *build_det_v5_lite_e(unsigned int maxBatchSize, nvinfer1::IBuilder *builder, nvinfer1::IBuilderConfig *config,\r\n    nvinfer1::DataType dt, std::string wts_name){\r\n  nvinfer1::INetworkDefinition *network = builder->createNetworkV2(0U);\r\n  nvinfer1::ITensor *data = network->addInput(Yolo::INPUT_BLOB_NAME, dt, nvinfer1::Dims3{3, Yolo::INPUT_W, Yolo::INPUT_H});\r\n  std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_name);\r\n\r\n  // backbone\r\n  nvinfer1::IPoolingLayer *conv0 = conv_bn_relu_maxpool(network, weightMap, *data, 32, \"model.0.\"); //32\r\n  // std::cout << \"Get_channel: \" << Get_channel(116) << std::endl;\r\n  nvinfer1::IShuffleLayer *conv1 = shuffle_block(network, weightMap, *conv0->getOutput(0), \"model.1.\", 32, Get_channel(116), 2); //120\r\n  nvinfer1::IShuffleLayer *conv2_0 = shuffle_block(network, weightMap, *conv1->getOutput(0), \"model.2.0.\", Get_channel(116), Get_channel(116), 1); //120\r\n  nvinfer1::IShuffleLayer *conv2_1 = shuffle_block(network, weightMap, *conv2_0->getOutput(0), \"model.2.1.\", Get_channel(116), Get_channel(116), 1); // 120\r\n  nvinfer1::IShuffleLayer *conv2_2 = shuffle_block(network, weightMap, *conv2_1->getOutput(0), \"model.2.2.\", Get_channel(116), Get_channel(116), 1); // 120\r\n  nvinfer1::IShuffleLayer *conv3 = shuffle_block(network, weightMap, *conv2_2->getOutput(0), \"model.3.\", Get_channel(116), Get_channel(232), 2); // 232\r\n  nvinfer1::IShuffleLayer *conv4_0 = shuffle_block(network, weightMap, *conv3->getOutput(0), \"model.4.0.\", Get_channel(232), Get_channel(232), 1); // 232 \r\n  nvinfer1::IShuffleLayer *conv4_1 = shuffle_block(network, weightMap, *conv4_0->getOutput(0), \"model.4.1.\", Get_channel(232), Get_channel(232), 1); // 232\r\n  nvinfer1::IShuffleLayer *conv4_2 = shuffle_block(network, weightMap, *conv4_1->getOutput(0), \"model.4.2.\", Get_channel(232), Get_channel(232), 1); // 232\r\n  nvinfer1::IShuffleLayer *conv4_3 = shuffle_block(network, weightMap, *conv4_2->getOutput(0), \"model.4.3.\", Get_channel(232), Get_channel(232), 1); // 232\r\n  nvinfer1::IShuffleLayer *conv4_4 = shuffle_block(network, weightMap, *conv4_3->getOutput(0), \"model.4.4.\", Get_channel(232), Get_channel(232), 1); //232\r\n  nvinfer1::IShuffleLayer *conv4_5 = shuffle_block(network, weightMap, *conv4_4->getOutput(0), \"model.4.5.\", Get_channel(232), Get_channel(232), 1);\r\n  nvinfer1::IShuffleLayer *conv4_6 = shuffle_block(network, weightMap, *conv4_5->getOutput(0), \"model.4.6.\", Get_channel(232), Get_channel(232), 1); // 232\r\n  nvinfer1::IShuffleLayer *conv5 = shuffle_block(network, weightMap, *conv4_6->getOutput(0), \"model.5.\", Get_channel(232), Get_channel(464), 2); //464 \r\n  nvinfer1::IShuffleLayer *conv6 = shuffle_block(network, weightMap, *conv5->getOutput(0), \"model.6.\", Get_channel(464), Get_channel(464), 1); // 464\r\n\r\n  // neck\r\n  float scale[] = {1.0, 2.0, 2.0};\r\n  nvinfer1::IElementWiseLayer *conv7 = convBlock(network, weightMap, *conv6->getOutput(0), Get_channel(96), 1, 1, 1, \"model.7\"); // 96\r\n  nvinfer1::IResizeLayer *upsample8 = network->addResize(*conv7->getOutput(0));\r\n  upsample8->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\r\n  upsample8->setScales(scale, 3);\r\n  nvinfer1::ITensor *inputTensors9[] = {upsample8->getOutput(0), conv4_6->getOutput(0)};\r\n  nvinfer1::IConcatenationLayer *cat9 = network->addConcatenation(inputTensors9, 2); //  96 + 232 = 328\r\n  nvinfer1::IActivationLayer *conv10 = DWConvblock(network, weightMap, *cat9->getOutput(0), \"model.10\", 328, Get_channel(96), 3, 1);\r\n\r\n  nvinfer1::IElementWiseLayer *conv11 = convBlock(network, weightMap, *conv10->getOutput(0), Get_channel(96), 1, 1, 1, \"model.11\"); // 96\r\n  nvinfer1::IResizeLayer *upsample12 = network->addResize(*conv11->getOutput(0));\r\n  upsample12->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\r\n  upsample12->setScales(scale, 3);\r\n  nvinfer1::ITensor *inputTensors13[] = {upsample12->getOutput(0), conv2_2->getOutput(0)}; // 96 + 120 \r\n  nvinfer1::IConcatenationLayer *cat13 = network->addConcatenation(inputTensors13, 2);\r\n  nvinfer1::IActivationLayer *conv14 = DWConvblock(network, weightMap, *cat13->getOutput(0), \"model.14\", 216, Get_channel(96), 3, 1);\r\n\r\n  nvinfer1::IActivationLayer *conv15 = DWConvblock(network, weightMap, *conv14->getOutput(0), \"model.15\", Get_channel(96), Get_channel(96), 3, 2);\r\n  nvinfer1::IElementWiseLayer *add16 = ADD(network, *conv15->getOutput(0), *conv11->getOutput(0), 1.0);\r\n  nvinfer1::IActivationLayer *conv17 = DWConvblock(network, weightMap, *add16->getOutput(0), \"model.17\", Get_channel(96), Get_channel(96), 3, 1);\r\n\r\n  nvinfer1::IActivationLayer *conv18 = DWConvblock(network, weightMap, *conv17->getOutput(0), \"model.18\", Get_channel(96), Get_channel(96), 3, 2);\r\n  nvinfer1::IElementWiseLayer *add19 = ADD(network, *conv18->getOutput(0), *conv7->getOutput(0), 1.0);\r\n  nvinfer1::IActivationLayer *conv20 = DWConvblock(network, weightMap, *add19->getOutput(0), \"model.20\", Get_channel(96), Get_channel(96), 3, 1);\r\n\r\n\r\n\r\n  // detect\r\n  nvinfer1::IConvolutionLayer *det0 = network->addConvolutionNd(*conv14->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), \r\n      nvinfer1::DimsHW{1, 1}, weightMap[\"model.21.m.0.weight\"], weightMap[\"model.21.m.0.bias\"]);\r\n    \r\n  nvinfer1::IConvolutionLayer *det1 = network->addConvolutionNd(*conv17->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), \r\n      nvinfer1::DimsHW{1, 1}, weightMap[\"model.21.m.1.weight\"], weightMap[\"model.21.m.1.bias\"]);\r\n    \r\n  nvinfer1::IConvolutionLayer *det2 = network->addConvolutionNd(*conv20->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), \r\n        nvinfer1::DimsHW{1, 1}, weightMap[\"model.21.m.2.weight\"], weightMap[\"model.21.m.2.bias\"]);\r\n    \r\n  auto yolo = addYoLoLayer(network, weightMap, \"model.21\", std::vector<nvinfer1::IConvolutionLayer*>{det0, det1, det2});\r\n  yolo->getOutput(0)->setName(Yolo::OUTPUT_BLOB_NAME);\r\n  network->markOutput(*yolo->getOutput(0));\r\n\r\n      // Engine config\r\n  builder->setMaxBatchSize(maxBatchSize);\r\n  config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\r\n#if defined(USE_FP16)\r\n  config->setFlag(BuilderFlag::kFP16);\r\n#elif defined(USE_INT8)\r\n  std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\r\n  assert(builder->platformHasFastInt8());\r\n  config->setFlag(BuilderFlag::kINT8);\r\n  std::string data_path = \"tensorrtx-int8calib-data/coco_calib/\";\r\n  //Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\r\n  Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, Yolo::INPUT_W, Yolo::INPUT_H, data_path.c_str(), \"int8calib.table\", Yolo::INPUT_BLOB_NAME);\r\n  config->setInt8Calibrator(calibrator);\r\n#endif\r\n\r\n  std::cout << \"Building engine, please wait for a while...\" << std::endl;\r\n  ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\r\n  std::cout << \"Build engine successfully!\" << std::endl;\r\n\r\n  // Don't need the network any more\r\n  network->destroy();\r\n\r\n  // Release host memory\r\n  for (auto& mem : weightMap) {\r\n    free((void*)(mem.second.values));\r\n  }\r\n\r\n  return engine;\r\n}\r\n\r\n\r\nnvinfer1::ICudaEngine *build_det_v5_lite_g(unsigned int maxBatchSize, nvinfer1::IBuilder *builder, nvinfer1::IBuilderConfig *config, \r\n                  nvinfer1::DataType dt,  std::string wts_name){\r\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);\r\n\r\n    // backbone\r\n    nvinfer1::ITensor *data = network->addInput(Yolo::INPUT_BLOB_NAME, dt, nvinfer1::Dims3{3, Yolo::INPUT_H, Yolo::INPUT_W});\r\n    assert(data);\r\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_name);\r\n    nvinfer1::IElementWiseLayer *conv0 = focus(network, weightMap, *data, 3, Get_channel(32), 3, \"model.0\"); // 32\r\n    nvinfer1::IActivationLayer *conv1 = RepVGGBlock(network, weightMap, *conv0->getOutput(0), \"model.1\", Get_channel(64), 3, 2, 1); //64\r\n    nvinfer1::IElementWiseLayer *conv2 = C3(network, weightMap, *conv1->getOutput(0), Get_channel(64), Get_channel(64), get_depth(1, 1), true, 1, 0.5, \"model.2\"); // 64\r\n    nvinfer1::IActivationLayer *conv3 = RepVGGBlock(network, weightMap, *conv2->getOutput(0), \"model.3\", Get_channel(128), 3, 2, 1); // 128\r\n    nvinfer1::IElementWiseLayer *conv4 = C3(network, weightMap, *conv3->getOutput(0), Get_channel(128), Get_channel(128), get_depth(3, 1), true, 1, 0.5, \"model.4\"); // 128\r\n    nvinfer1::IActivationLayer *conv5 = RepVGGBlock(network, weightMap, *conv4->getOutput(0), \"model.5\", Get_channel(256), 3, 2, 1); // 256\r\n    nvinfer1::IElementWiseLayer *conv6 = C3(network, weightMap, *conv5->getOutput(0), Get_channel(256), Get_channel(256), get_depth(3, 1), true, 1, 0.5, \"model.6\"); // 256\r\n    nvinfer1::IActivationLayer *conv7 = RepVGGBlock(network, weightMap, *conv6->getOutput(0), \"model.7\", Get_channel(512), 3, 2, 1); // 512\r\n    nvinfer1::IElementWiseLayer *conv8 = SPP(network, weightMap, *conv7->getOutput(0), Get_channel(512), Get_channel(512), 5, 9, 13, \"model.8\"); // 512\r\n    nvinfer1::IElementWiseLayer *conv9 = C3(network, weightMap, *conv8->getOutput(0), Get_channel(512), Get_channel(512), get_depth(1, 1), false, 1, 0.5, \"model.9\"); // 512\r\n    \r\n\r\n    float scale[] = {1.0, 2.0, 2.0};\r\n    nvinfer1::IElementWiseLayer *conv10 = convBlock(network, weightMap, *conv9->getOutput(0), Get_channel(128), 1, 1, 1, \"model.10\"); // 128\r\n    nvinfer1::IResizeLayer *upsample11 = network->addResize(*conv10->getOutput(0));\r\n    upsample11->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\r\n    upsample11->setScales(scale, 3);\r\n    nvinfer1::ITensor *inputTensors12[] = {upsample11->getOutput(0), conv6->getOutput(0)};\r\n    nvinfer1::IConcatenationLayer *cat12 = network->addConcatenation(inputTensors12, 2); // 384\r\n    nvinfer1::IElementWiseLayer *conv13 = C3(network, weightMap, *cat12->getOutput(0), 384, Get_channel(128), get_depth(3, 1), false, 1, 0.5, \"model.13\");\r\n\r\n    nvinfer1::IElementWiseLayer *conv14 = convBlock(network, weightMap, *conv13->getOutput(0), Get_channel(128), 1, 1, 1, \"model.14\"); // 128\r\n    nvinfer1::IResizeLayer *upsample15 = network->addResize(*conv14->getOutput(0));\r\n    upsample15->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\r\n    upsample15->setScales(scale, 3);\r\n    nvinfer1::ITensor *inputTensors16[] = {upsample15->getOutput(0), conv4->getOutput(0)}; //  128+128\r\n    nvinfer1::IConcatenationLayer *cat16 = network->addConcatenation(inputTensors16, 2);\r\n    nvinfer1::IElementWiseLayer *conv17 = C3(network, weightMap, *cat16->getOutput(0), 256, Get_channel(128), get_depth(3, 1), false, 1, 0.5, \"model.17\");\r\n\r\n    nvinfer1::IElementWiseLayer *conv18 = convBlock(network, weightMap, *conv17->getOutput(0), Get_channel(128), 3, 2, 1, \"model.18\"); // 128\r\n    nvinfer1::ITensor *inputTensors19[] = {conv18->getOutput(0), conv14->getOutput(0)};\r\n    nvinfer1::IConcatenationLayer *cat19 = network->addConcatenation(inputTensors19, 2); // 128 + 128\r\n    nvinfer1::IElementWiseLayer *conv20 = C3(network, weightMap, *cat19->getOutput(0), 256, Get_channel(128), get_depth(3, 1), false, 1, 0.5, \"model.20\");\r\n\r\n    nvinfer1::IElementWiseLayer *conv21 = convBlock(network, weightMap, *conv20->getOutput(0), Get_channel(128), 3, 2, 1, \"model.21\"); // 128\r\n    nvinfer1::ITensor *inputTensors22[] = {conv21->getOutput(0), conv10->getOutput(0)}; \r\n    nvinfer1::IConcatenationLayer *cat22 = network->addConcatenation(inputTensors22, 2); // 128 + 128\r\n    nvinfer1::IElementWiseLayer *conv23 = C3(network, weightMap, *cat22->getOutput(0), 256, Get_channel(128), get_depth(3, 1), false, 1, 0.5, \"model.23\");\r\n\r\n      // detect\r\n    nvinfer1::IConvolutionLayer *det0 = network->addConvolutionNd(*conv17->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), \r\n      nvinfer1::DimsHW{1, 1}, weightMap[\"model.24.m.0.weight\"], weightMap[\"model.24.m.0.bias\"]);\r\n    \r\n    nvinfer1::IConvolutionLayer *det1 = network->addConvolutionNd(*conv20->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), \r\n      nvinfer1::DimsHW{1, 1}, weightMap[\"model.24.m.1.weight\"], weightMap[\"model.24.m.1.bias\"]);\r\n    \r\n    nvinfer1::IConvolutionLayer *det2 = network->addConvolutionNd(*conv23->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), \r\n        nvinfer1::DimsHW{1, 1}, weightMap[\"model.24.m.2.weight\"], weightMap[\"model.24.m.2.bias\"]);\r\n    \r\n    auto yolo = addYoLoLayer(network, weightMap, \"model.24\", std::vector<nvinfer1::IConvolutionLayer*>{det0, det1, det2});\r\n    yolo->getOutput(0)->setName(Yolo::OUTPUT_BLOB_NAME);\r\n    network->markOutput(*yolo->getOutput(0));\r\n\r\n      // Engine config\r\n  builder->setMaxBatchSize(maxBatchSize);\r\n  config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\r\n#if defined(USE_FP16)\r\n  config->setFlag(BuilderFlag::kFP16);\r\n#elif defined(USE_INT8)\r\n  std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\r\n  assert(builder->platformHasFastInt8());\r\n  config->setFlag(BuilderFlag::kINT8);\r\n  std::string data_path = \"tensorrtx-int8calib-data/coco_calib/\";\r\n  //Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\r\n  Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, Yolo::INPUT_W, Yolo::INPUT_H, data_path.c_str(), \"int8calib.table\", Yolo::INPUT_BLOB_NAME);\r\n  config->setInt8Calibrator(calibrator);\r\n#endif\r\n\r\n  std::cout << \"Building engine, please wait for a while...\" << std::endl;\r\n  ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\r\n  std::cout << \"Build engine successfully!\" << std::endl;\r\n\r\n  // Don't need the network any more\r\n  network->destroy();\r\n\r\n  // Release host memory\r\n  for (auto& mem : weightMap) {\r\n    free((void*)(mem.second.values));\r\n  }\r\n\r\n  return engine;\r\n}\r\n\r\n\r\n\r\n\r\nnvinfer1::ICudaEngine *build_det_v5_lite_s(unsigned int maxBatchSize, nvinfer1::IBuilder *builder, nvinfer1::IBuilderConfig *config, nvinfer1::DataType dt,std::string & wts_name){\r\n  // backbone\r\n  nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);\r\n  nvinfer1::ITensor *data = network->addInput(Yolo::INPUT_BLOB_NAME, dt, nvinfer1::Dims3{3, Yolo::INPUT_H, Yolo::INPUT_W});\r\n  assert(data);\r\n  std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_name);\r\n  nvinfer1::IPoolingLayer *conv0 = conv_bn_relu_maxpool(network, weightMap, *data, 32, \"model.0.\");\r\n  std::cout << \"Get_channel: \" << Get_channel(116) << std::endl;\r\n  nvinfer1::IShuffleLayer *conv1 = shuffle_block(network, weightMap, *conv0->getOutput(0), \"model.1.\", 32, Get_channel(116), 2);\r\n  nvinfer1::IShuffleLayer *conv2_0 = shuffle_block(network, weightMap, *conv1->getOutput(0), \"model.2.0.\", Get_channel(116), Get_channel(116), 1);\r\n  nvinfer1::IShuffleLayer *conv2_1 = shuffle_block(network, weightMap, *conv2_0->getOutput(0), \"model.2.1.\", Get_channel(116), Get_channel(116), 1);\r\n  nvinfer1::IShuffleLayer *conv2_2 = shuffle_block(network, weightMap, *conv2_1->getOutput(0), \"model.2.2.\", Get_channel(116), Get_channel(116), 1);\r\n  nvinfer1::IShuffleLayer *conv3 = shuffle_block(network, weightMap, *conv2_2->getOutput(0), \"model.3.\", Get_channel(116), Get_channel(232), 2);\r\n  nvinfer1::IShuffleLayer *conv4_0 = shuffle_block(network, weightMap, *conv3->getOutput(0), \"model.4.0.\", Get_channel(232), Get_channel(232), 1);\r\n  nvinfer1::IShuffleLayer *conv4_1 = shuffle_block(network, weightMap, *conv4_0->getOutput(0), \"model.4.1.\", Get_channel(232), Get_channel(232), 1);\r\n  nvinfer1::IShuffleLayer *conv4_2 = shuffle_block(network, weightMap, *conv4_1->getOutput(0), \"model.4.2.\", Get_channel(232), Get_channel(232), 1);\r\n  nvinfer1::IShuffleLayer *conv4_3 = shuffle_block(network, weightMap, *conv4_2->getOutput(0), \"model.4.3.\", Get_channel(232), Get_channel(232), 1);\r\n  nvinfer1::IShuffleLayer *conv4_4 = shuffle_block(network, weightMap, *conv4_3->getOutput(0), \"model.4.4.\", Get_channel(232), Get_channel(232), 1);\r\n  nvinfer1::IShuffleLayer *conv4_5 = shuffle_block(network, weightMap, *conv4_4->getOutput(0), \"model.4.5.\", Get_channel(232), Get_channel(232), 1);\r\n  nvinfer1::IShuffleLayer *conv4_6 = shuffle_block(network, weightMap, *conv4_5->getOutput(0), \"model.4.6.\", Get_channel(232), Get_channel(232), 1);\r\n  nvinfer1::IShuffleLayer *conv5 = shuffle_block(network, weightMap, *conv4_6->getOutput(0), \"model.5.\", Get_channel(232), Get_channel(464), 2);\r\n  nvinfer1::IShuffleLayer *conv6_0 = shuffle_block(network, weightMap, *conv5->getOutput(0), \"model.6.0.\", Get_channel(464), Get_channel(464), 1);\r\n  nvinfer1::IShuffleLayer *conv6_1 = shuffle_block(network, weightMap, *conv6_0->getOutput(0), \"model.6.1.\", Get_channel(464), Get_channel(464), 1);\r\n  nvinfer1::IShuffleLayer *conv6_2 = shuffle_block(network, weightMap, *conv6_1->getOutput(0), \"model.6.2.\", Get_channel(464), Get_channel(464), 1);\r\n\r\n  // head\r\n  float scale[] = {1.0, 2.0, 2.0};\r\n  nvinfer1::IElementWiseLayer *conv7 = convBlock(network, weightMap, *conv6_2->getOutput(0), Get_channel(128), 1, 1, 1, \"model.7\");\r\n  nvinfer1::IResizeLayer *upsample8 = network->addResize(*conv7->getOutput(0));\r\n  upsample8->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\r\n  upsample8->setScales(scale, 3);\r\n  assert(upsample8);\r\n  nvinfer1::ITensor *inputTensors9[] = {upsample8->getOutput(0), conv4_6->getOutput(0)}; // channels = 128 + 232 = 360\r\n  nvinfer1::IConcatenationLayer *cat9 = network->addConcatenation(inputTensors9, 2);\r\n  // std::cout << \"The c3 's n is \" << get_depth(3, 1) << std::endl;\r\n  nvinfer1::IElementWiseLayer *conv10 = C3(network, weightMap, *cat9->getOutput(0), 360, Get_channel(128), get_depth(1, 1), false, 1, 0.5, \"model.10\");\r\n\r\n  nvinfer1::IElementWiseLayer *conv11 = convBlock(network, weightMap, *conv10->getOutput(0), Get_channel(64), 1, 1, 1, \"model.11\");\r\n  nvinfer1::IResizeLayer *upsample12 = network->addResize(*conv11->getOutput(0));\r\n  upsample12->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\r\n  upsample12->setScales(scale, 3);\r\n  assert(upsample12);\r\n  nvinfer1::ITensor *inputTensors13[] = {upsample12->getOutput(0), conv2_2->getOutput(0)}; // 64 + 120 = 184\r\n  nvinfer1::IConcatenationLayer *cat13 = network->addConcatenation(inputTensors13, 2);\r\n  nvinfer1::IElementWiseLayer *conv14 = C3(network, weightMap, *cat13->getOutput(0), 184, Get_channel(64), get_depth(1, 1), false, 1, 0.5, \"model.14\");\r\n\r\n  nvinfer1::IElementWiseLayer *conv15 = convBlock(network, weightMap, *conv14->getOutput(0), Get_channel(64), 3, 2, 1, \"model.15\");\r\n  nvinfer1::ITensor *inputTensors16[] = {conv15->getOutput(0), conv11->getOutput(0)}; // 64 + 64 = 128\r\n  nvinfer1::IConcatenationLayer *cat16 = network->addConcatenation(inputTensors16, 2); \r\n  nvinfer1::IElementWiseLayer *conv17 = C3(network, weightMap, *cat16->getOutput(0), 128, Get_channel(128), get_depth(1, 1), false, 1, 0.5, \"model.17\");\r\n\r\n  nvinfer1::IElementWiseLayer *conv18 = convBlock(network, weightMap, *conv17->getOutput(0), Get_channel(128), 3, 2, 1, \"model.18\");\r\n  nvinfer1::ITensor *inputTensors19[] = {conv18->getOutput(0), conv7->getOutput(0)}; // 128 + 128 = 256\r\n  nvinfer1::IConcatenationLayer *cat19 = network->addConcatenation(inputTensors19, 2); \r\n  nvinfer1::IElementWiseLayer *conv20 = C3(network, weightMap, *cat19->getOutput(0), 256, Get_channel(256), get_depth(1, 1), false, 1, 0.5, \"model.20\");\r\n\r\n  // detect\r\n  nvinfer1::IConvolutionLayer *det0 = network->addConvolutionNd(*conv14->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), \r\n     nvinfer1::DimsHW{1, 1}, weightMap[\"model.21.m.0.weight\"], weightMap[\"model.21.m.0.bias\"]);\r\n  \r\n  nvinfer1::IConvolutionLayer *det1 = network->addConvolutionNd(*conv17->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), \r\n    nvinfer1::DimsHW{1, 1}, weightMap[\"model.21.m.1.weight\"], weightMap[\"model.21.m.1.bias\"]);\r\n  \r\n  nvinfer1::IConvolutionLayer *det2 = network->addConvolutionNd(*conv20->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), \r\n      nvinfer1::DimsHW{1, 1}, weightMap[\"model.21.m.2.weight\"], weightMap[\"model.21.m.2.bias\"]);\r\n  \r\n  auto yolo = addYoLoLayer(network, weightMap, \"model.21\", std::vector<nvinfer1::IConvolutionLayer*>{det0, det1, det2});\r\n  yolo->getOutput(0)->setName(Yolo::OUTPUT_BLOB_NAME);\r\n  network->markOutput(*yolo->getOutput(0));\r\n\r\n    // Engine config\r\n  builder->setMaxBatchSize(maxBatchSize);\r\n  config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\r\n#if defined(USE_FP16)\r\n  config->setFlag(BuilderFlag::kFP16);\r\n#elif defined(USE_INT8)\r\n  std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\r\n  assert(builder->platformHasFastInt8());\r\n  config->setFlag(BuilderFlag::kINT8);\r\n  std::string data_path = \"tensorrtx-int8calib-data/coco_calib/\";\r\n  //Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\r\n  Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, Yolo::INPUT_W, Yolo::INPUT_H, data_path.c_str(), \"int8calib.table\", Yolo::INPUT_BLOB_NAME);\r\n  config->setInt8Calibrator(calibrator);\r\n#endif\r\n\r\n  std::cout << \"Building engine, please wait for a while...\" << std::endl;\r\n  ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);\r\n  std::cout << \"Build engine successfully!\" << std::endl;\r\n\r\n  // Don't need the network any more\r\n  network->destroy();\r\n\r\n  // Release host memory\r\n  for (auto& mem : weightMap) {\r\n    free((void*)(mem.second.values));\r\n  }\r\n\r\n  return engine;\r\n}\r\n\r\n\r\n\r\n\r\nvoid serialize_engine(unsigned int max_batchsize, std::string& wts_name, std::string& engine_name, std::string & used_model){\r\n  \r\n  IBuilder* builder = createInferBuilder(gLogger);\r\n  IBuilderConfig* config = builder->createBuilderConfig();\r\n\r\n  ICudaEngine *engine = nullptr;\r\n  if(used_model == \"g\"){\r\n    engine = build_det_v5_lite_g(max_batchsize, builder, config, nvinfer1::DataType::kFLOAT, wts_name);\r\n  }else if(used_model == \"s\"){\r\n    engine = build_det_v5_lite_s(max_batchsize, builder, config, nvinfer1::DataType::kFLOAT, wts_name);\r\n  }else if(used_model == \"c\"){\r\n    engine = build_det_v5_lite_c(max_batchsize, builder, config, nvinfer1::DataType::kFLOAT, wts_name);\r\n  }\r\n  else{\r\n    engine = build_det_v5_lite_e(max_batchsize, builder, config, nvinfer1::DataType::kFLOAT, wts_name);\r\n  }\r\n  // Serialize the engine\r\n  IHostMemory* serialized_engine = engine->serialize();\r\n  assert(serialized_engine != nullptr);\r\n\r\n  // Save engine to file\r\n  std::ofstream p(engine_name, std::ios::binary);\r\n  if (!p) {\r\n    std::cerr << \"Could not open plan output file\" << std::endl;\r\n    // assert(false);\r\n\r\n  }\r\n  p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\r\n\r\n  // Close everything down\r\n  engine->destroy();\r\n  config->destroy();\r\n  serialized_engine->destroy();\r\n  builder->destroy();\r\n}\r\n\r\n\r\n\r\n\r\nvoid doInference(IExecutionContext& context, cudaStream_t& stream, void **buffers, float* input, float* output, int batchSize) {\r\n    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host\r\n    CUDA_CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * 3 * Yolo::INPUT_H * Yolo::INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));\r\n    context.enqueue(batchSize, buffers, stream, nullptr);\r\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));\r\n    cudaStreamSynchronize(stream);\r\n}\r\n\r\nbool parse_args(int argc, char **argv, std::string & wts_name, std::string & engine_name,\r\n                                   std::string & used_model, std::string & img_dir){\r\n  if(argc < 4 || argc > 6) return false;\r\n  if(std::string(argv[1]) == \"-s\" && (argc == 5)){\r\n    wts_name = argv[2];\r\n    engine_name = argv[3];\r\n    used_model = argv[4];\r\n  }else if(std::string(argv[1]) == \"-d\" && argc == 4){\r\n    engine_name = std::string(argv[2]);\r\n    img_dir = std::string(argv[3]);\r\n  }else{\r\n    return false;\r\n  }\r\n\r\n  return true;\r\n}\r\n\r\nint main(int argc, char** argv) {\r\n    cudaSetDevice(Yolo::DEVICE);\r\n\r\n    std::string wts_name = \"\";\r\n    std::string engine_name = \"\";\r\n    std::string img_dir, used_model;\r\n    \r\n\r\n    if(!parse_args(argc, argv, wts_name, engine_name, used_model, img_dir)){\r\n      std::cerr << \"arguments not right!\" << std::endl;\r\n      std::cerr << \"./v5lite -s [.wts] [.engine] [s/e/g/c] // serialize modeo to the plan\" << std::endl;\r\n      std::cerr << \"./v5lite -d [.engine] ../images  // deserialize plan file and run inference\" << std::endl;\r\n      return -1;  \r\n    }\r\n\r\n    if (!wts_name.empty()) {\r\n        serialize_engine(Yolo::BATCH_SIZE,  wts_name, engine_name, used_model);\r\n        return 0;\r\n    }\r\n\r\n    // deserialize the .engine and run inference\r\n    std::ifstream file(engine_name, std::ios::binary);\r\n    if (!file.good()) {\r\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\r\n        return -1;\r\n    }\r\n    char *trtModelStream = nullptr;\r\n    size_t size = 0;\r\n    file.seekg(0, file.end);\r\n    size = file.tellg();\r\n    file.seekg(0, file.beg);\r\n    trtModelStream = new char[size];\r\n    assert(trtModelStream);\r\n    file.read(trtModelStream, size);\r\n    file.close();\r\n\r\n    std::vector<std::string> file_names;\r\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\r\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\r\n        return -1;\r\n    }\r\n\r\n    // prepare input data ---------------------------\r\n    static float data[Yolo::BATCH_SIZE * 3 * Yolo::INPUT_H * Yolo::INPUT_W];\r\n    //for (int i = 0; i < 3 * INPUT_H * INPUT_W; i++)\r\n    //    data[i] = 1.0;\r\n    static float prob[Yolo::BATCH_SIZE * OUTPUT_SIZE];\r\n    IRuntime* runtime = createInferRuntime(gLogger);\r\n    assert(runtime != nullptr);\r\n    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);\r\n    assert(engine != nullptr);\r\n    IExecutionContext* context = engine->createExecutionContext();\r\n    assert(context != nullptr);\r\n    delete[] trtModelStream;\r\n    assert(engine->getNbBindings() == 2);\r\n    void* buffers[2];\r\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\r\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\r\n    const int inputIndex = engine->getBindingIndex(Yolo::INPUT_BLOB_NAME);\r\n    const int outputIndex = engine->getBindingIndex(Yolo::OUTPUT_BLOB_NAME);\r\n    assert(inputIndex == 0);\r\n    assert(outputIndex == 1);\r\n    // Create GPU buffers on device\r\n    CUDA_CHECK(cudaMalloc(&buffers[inputIndex], Yolo::BATCH_SIZE * 3 * Yolo::INPUT_H * Yolo::INPUT_W * sizeof(float)));\r\n    CUDA_CHECK(cudaMalloc(&buffers[outputIndex], Yolo::BATCH_SIZE * OUTPUT_SIZE * sizeof(float)));\r\n    // Create stream\r\n    cudaStream_t stream;\r\n    CUDA_CHECK(cudaStreamCreate(&stream));\r\n\r\n    int fcount = 0;\r\n    for (int f = 0; f < (int)file_names.size(); f++) {\r\n        fcount++;\r\n        if (fcount < Yolo::BATCH_SIZE && f + 1 != (int)file_names.size()) continue;\r\n        for (int b = 0; b < fcount; b++) {\r\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[f - fcount + 1 + b]);\r\n            if (img.empty()) continue;\r\n            cv::Mat pr_img = preprocess_img(img, Yolo::INPUT_W, Yolo::INPUT_H); // letterbox BGR to RGB\r\n            int i = 0;\r\n            for (int row = 0; row < Yolo::INPUT_H; ++row) {\r\n                uchar* uc_pixel = pr_img.data + row * pr_img.step;\r\n                for (int col = 0; col < Yolo::INPUT_W; ++col) {\r\n                    data[b * 3 * Yolo::INPUT_H * Yolo::INPUT_W + i] = (float)uc_pixel[2] / 255.0;\r\n                    data[b * 3 * Yolo::INPUT_H * Yolo::INPUT_W + i + Yolo::INPUT_H * Yolo::INPUT_W] = (float)uc_pixel[1] / 255.0;\r\n                    data[b * 3 * Yolo::INPUT_H * Yolo::INPUT_W + i + 2 * Yolo::INPUT_H * Yolo::INPUT_W] = (float)uc_pixel[0] / 255.0;\r\n                    uc_pixel += 3;\r\n                    ++i;\r\n                }\r\n            }\r\n        }\r\n\r\n        // Run inference\r\n        auto start = std::chrono::system_clock::now();\r\n        doInference(*context, stream, buffers, data, prob, Yolo::BATCH_SIZE);\r\n        auto end = std::chrono::system_clock::now();\r\n        std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\r\n        std::vector<std::vector<Yolo::Detection>> batch_res(fcount);\r\n        for (int b = 0; b < fcount; b++) {\r\n            auto& res = batch_res[b];\r\n            nms(res, &prob[b * OUTPUT_SIZE], Yolo::CONF_THRESH, Yolo::NMS_THRESH);\r\n        }\r\n        for (int b = 0; b < fcount; b++) {\r\n            auto& res = batch_res[b];\r\n            //std::cout << res.size() << std::endl;\r\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[f - fcount + 1 + b]);\r\n            for (size_t j = 0; j < res.size(); j++) {\r\n                cv::Rect r = get_rect(img, res[j].bbox);\r\n                cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\r\n                cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0xFF, 0xFF, 0xFF), 2);\r\n            }\r\n            cv::imwrite(file_names[f - fcount + 1 + b], img);\r\n        }\r\n        fcount = 0;\r\n    }\r\n\r\n    // Release stream and buffers\r\n    cudaStreamDestroy(stream);\r\n    CUDA_CHECK(cudaFree(buffers[inputIndex]));\r\n    CUDA_CHECK(cudaFree(buffers[outputIndex]));\r\n    // Destroy the engine\r\n    context->destroy();\r\n    engine->destroy();\r\n    runtime->destroy();\r\n\r\n    // Print histogram of the output distribution\r\n    // std::cout << \"\\nOutput:\\n\\n\";\r\n    // for (unsigned int i = 0; i < OUTPUT_SIZE; i++)\r\n    // {\r\n    //    std::cout << prob[i] << \", \";\r\n    //    if (i % 10 == 0) std::cout << std::endl;\r\n    // }\r\n    // std::cout << std::endl;\r\n\r\n    return 0;\r\n}\r\n"
  },
  {
    "path": "yolov5-lite/yololayer.cu",
    "content": "#include <assert.h>\n#include <vector>\n#include <iostream>\n#include \"yololayer.h\"\n#include \"cuda_utils.h\"\n\nnamespace Tn\n{\n    template<typename T> \n    void write(char*& buffer, const T& val)\n    {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T> \n    void read(const char*& buffer, T& val)\n    {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n}\n\nusing namespace Yolo;\n\nnamespace nvinfer1\n{\n    YoloLayerPlugin::YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, const std::vector<Yolo::YoloKernel>& vYoloKernel)\n    {\n        mClassCount = classCount;\n        mYoloV5NetWidth = netWidth;\n        mYoloV5NetHeight = netHeight;\n        mMaxOutObject = maxOut;\n        mYoloKernel = vYoloKernel;\n        mKernelCount = vYoloKernel.size();\n\n        CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n        size_t AnchorLen = sizeof(float)* CHECK_COUNT * 2;\n        for (int ii = 0; ii < mKernelCount; ii++)\n        {\n            CUDA_CHECK(cudaMalloc(&mAnchor[ii], AnchorLen));\n            const auto& yolo = mYoloKernel[ii];\n            CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n        }\n    }\n    YoloLayerPlugin::~YoloLayerPlugin()\n    {\n        for (int ii = 0; ii < mKernelCount; ii++)\n        {\n            CUDA_CHECK(cudaFree(mAnchor[ii]));\n        }\n        CUDA_CHECK(cudaFreeHost(mAnchor));\n    }\n\n    // create the plugin at runtime from a byte stream\n    YoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length)\n    {\n        using namespace Tn;\n        const char *d = reinterpret_cast<const char *>(data), *a = d;\n        read(d, mClassCount);\n        read(d, mThreadCount);\n        read(d, mKernelCount);\n        read(d, mYoloV5NetWidth);\n        read(d, mYoloV5NetHeight);\n        read(d, mMaxOutObject);\n        mYoloKernel.resize(mKernelCount);\n        auto kernelSize = mKernelCount * sizeof(YoloKernel);\n        memcpy(mYoloKernel.data(), d, kernelSize);\n        d += kernelSize;\n        CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n        size_t AnchorLen = sizeof(float)* CHECK_COUNT * 2;\n        for (int ii = 0; ii < mKernelCount; ii++)\n        {\n            CUDA_CHECK(cudaMalloc(&mAnchor[ii], AnchorLen));\n            const auto& yolo = mYoloKernel[ii];\n            CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n        }\n        assert(d == a + length);\n    }\n\n    void YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT\n    {\n        using namespace Tn;\n        char* d = static_cast<char*>(buffer), *a = d;\n        write(d, mClassCount);\n        write(d, mThreadCount);\n        write(d, mKernelCount);\n        write(d, mYoloV5NetWidth);\n        write(d, mYoloV5NetHeight);\n        write(d, mMaxOutObject);\n        auto kernelSize = mKernelCount * sizeof(YoloKernel);\n        memcpy(d, mYoloKernel.data(), kernelSize);\n        d += kernelSize;\n\n        assert(d == a + getSerializationSize());\n    }\n\n    size_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT\n    {\n        return sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mKernelCount) + sizeof(Yolo::YoloKernel) * mYoloKernel.size() + sizeof(mYoloV5NetWidth) + sizeof(mYoloV5NetHeight) + sizeof(mMaxOutObject);\n    }\n\n    int YoloLayerPlugin::initialize() TRT_NOEXCEPT\n    {\n        return 0;\n    }\n\n    Dims YoloLayerPlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT\n    {\n        //output the result to channel\n        int totalsize = mMaxOutObject * sizeof(Detection) / sizeof(float);\n\n        return Dims3(totalsize + 1, 1, 1);\n    }\n\n    // Set plugin namespace\n    void YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT\n    {\n        mPluginNamespace = pluginNamespace;\n    }\n\n    const char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT\n    {\n        return mPluginNamespace;\n    }\n\n    // Return the DataType of the plugin output at the requested index\n    DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT\n    {\n        return DataType::kFLOAT;\n    }\n\n    // Return true if output tensor is broadcast across a batch.\n    bool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    // Return true if plugin can use input that is broadcast across batch without replication.\n    bool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT\n    {\n        return false;\n    }\n\n    void YoloLayerPlugin::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) TRT_NOEXCEPT\n    {\n    }\n\n    // Attach the plugin object to an execution context and grant the plugin the access to some context resource.\n    void YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT\n    {\n    }\n\n    // Detach the plugin object from its execution context.\n    void YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\n    const char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT\n    {\n        return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT\n    {\n        return \"1\";\n    }\n\n    void YoloLayerPlugin::destroy() TRT_NOEXCEPT\n    {\n        delete this;\n    }\n\n    // Clone the plugin\n    IPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT\n    {\n        YoloLayerPlugin* p = new YoloLayerPlugin(mClassCount, mYoloV5NetWidth, mYoloV5NetHeight, mMaxOutObject, mYoloKernel);\n        p->setPluginNamespace(mPluginNamespace);\n        return p;\n    }\n\n    __device__ float Logist(float data) { return 1.0f / (1.0f + expf(-data)); };\n\n    __global__ void CalDetection(const float *input, float *output, int noElements,\n        const int netwidth, const int netheight, int maxoutobject, int yoloWidth, int yoloHeight, const float anchors[CHECK_COUNT * 2], int classes, int outputElem)\n    {\n\n        int idx = threadIdx.x + blockDim.x * blockIdx.x;\n        if (idx >= noElements) return;\n\n        int total_grid = yoloWidth * yoloHeight;\n        int bnIdx = idx / total_grid;\n        idx = idx - total_grid * bnIdx;\n        int info_len_i = 5 + classes;\n        const float* curInput = input + bnIdx * (info_len_i * total_grid * CHECK_COUNT);\n\n        for (int k = 0; k < CHECK_COUNT; ++k) {\n            float box_prob = Logist(curInput[idx + k * info_len_i * total_grid + 4 * total_grid]);\n            if (box_prob < IGNORE_THRESH) continue;\n            int class_id = 0;\n            float max_cls_prob = 0.0;\n            for (int i = 5; i < info_len_i; ++i) {\n                float p = Logist(curInput[idx + k * info_len_i * total_grid + i * total_grid]);\n                if (p > max_cls_prob) {\n                    max_cls_prob = p;\n                    class_id = i - 5;\n                }\n            }\n            float *res_count = output + bnIdx * outputElem;\n            int count = (int)atomicAdd(res_count, 1);\n            if (count >= maxoutobject) return;\n            char *data = (char*)res_count + sizeof(float) + count * sizeof(Detection);\n            Detection *det = (Detection*)(data);\n\n            int row = idx / yoloWidth;\n            int col = idx % yoloWidth;\n\n            //Location\n            // pytorch:\n            //  y = x[i].sigmoid()\n            //  y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i]  # xy\n            //  y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh\n            //  X: (sigmoid(tx) + cx)/FeaturemapW *  netwidth\n            det->bbox[0] = (col - 0.5f + 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 0 * total_grid])) * netwidth / yoloWidth;\n            det->bbox[1] = (row - 0.5f + 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 1 * total_grid])) * netheight / yoloHeight;\n\n            // W: (Pw * e^tw) / FeaturemapW * netwidth\n            // v5: https://github.com/ultralytics/yolov5/issues/471\n            det->bbox[2] = 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 2 * total_grid]);\n            det->bbox[2] = det->bbox[2] * det->bbox[2] * anchors[2 * k];\n            det->bbox[3] = 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 3 * total_grid]);\n            det->bbox[3] = det->bbox[3] * det->bbox[3] * anchors[2 * k + 1];\n            det->conf = box_prob * max_cls_prob;\n            det->class_id = class_id;\n        }\n    }\n\n    void YoloLayerPlugin::forwardGpu(const float* const* inputs, float *output, cudaStream_t stream, int batchSize)\n    {\n        int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);\n        for (int idx = 0; idx < batchSize; ++idx) {\n            CUDA_CHECK(cudaMemsetAsync(output + idx * outputElem, 0, sizeof(float), stream));\n        }\n        int numElem = 0;\n        for (unsigned int i = 0; i < mYoloKernel.size(); ++i) {\n            const auto& yolo = mYoloKernel[i];\n            numElem = yolo.width * yolo.height * batchSize;\n            if (numElem < mThreadCount) mThreadCount = numElem;\n\n            //printf(\"Net: %d  %d \\n\", mYoloV5NetWidth, mYoloV5NetHeight);\n            CalDetection << < (numElem + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream >> >\n                (inputs[i], output, numElem, mYoloV5NetWidth, mYoloV5NetHeight, mMaxOutObject, yolo.width, yolo.height, (float*)mAnchor[i], mClassCount, outputElem);\n        }\n    }\n\n\n    int YoloLayerPlugin::enqueue(int batchSize, const void* const* inputs, void* TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT\n    {\n        forwardGpu((const float* const*)inputs, (float*)outputs[0], stream, batchSize);\n        return 0;\n    }\n\n    PluginFieldCollection YoloPluginCreator::mFC{};\n    std::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\n    YoloPluginCreator::YoloPluginCreator()\n    {\n        mPluginAttributes.clear();\n\n        mFC.nbFields = mPluginAttributes.size();\n        mFC.fields = mPluginAttributes.data();\n    }\n\n    const char* YoloPluginCreator::getPluginName() const TRT_NOEXCEPT\n    {\n        return \"YoloLayer_TRT\";\n    }\n\n    const char* YoloPluginCreator::getPluginVersion() const TRT_NOEXCEPT\n    {\n        return \"1\";\n    }\n\n    const PluginFieldCollection* YoloPluginCreator::getFieldNames() TRT_NOEXCEPT\n    {\n        return &mFC;\n    }\n\n    IPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT\n    {\n        assert(fc->nbFields == 2);\n        assert(strcmp(fc->fields[0].name, \"netinfo\") == 0);\n        assert(strcmp(fc->fields[1].name, \"kernels\") == 0);\n        int *p_netinfo = (int*)(fc->fields[0].data);\n        int class_count = p_netinfo[0];\n        int input_w = p_netinfo[1];\n        int input_h = p_netinfo[2];\n        int max_output_object_count = p_netinfo[3];\n        std::vector<Yolo::YoloKernel> kernels(fc->fields[1].length);\n        memcpy(&kernels[0], fc->fields[1].data, kernels.size() * sizeof(Yolo::YoloKernel));\n        YoloLayerPlugin* obj = new YoloLayerPlugin(class_count, input_w, input_h, max_output_object_count, kernels);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n\n    IPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT\n    {\n        // This object will be deleted when the network is destroyed, which will\n        // call YoloLayerPlugin::destroy()\n        YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n        obj->setPluginNamespace(mNamespace.c_str());\n        return obj;\n    }\n}\n\n"
  },
  {
    "path": "yolov5-lite/yolov5-lite-trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\n\n# categories = ['faster']\ncategories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\",\n                  \"traffic light\",\n                  \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n                  \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\n                  \"frisbee\",\n                  \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\",\n                  \"surfboard\",\n                  \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n                  \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n                  \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\n                  \"cell phone\",\n                  \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\n                  \"teddy bear\",\n                  \"hair drier\", \"toothbrush\"]\n    \n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLov5 project.\n    param: \n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n        line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n        )\n\n\nclass YoLov5TRT(object):\n    \"\"\"\n    description: A YOLOv5 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        engine = self.engine\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            result_boxes, result_scores, result_classid = self.post_process(\n                output[i * 6001: (i + 1) * 6001], batch_origin_h[i], batch_origin_w[i]\n            )\n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                # print(\"class:\", categories[int(result_classid[j])])\n                # print(\"probability:\", result_scores[j])\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        \n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n        \n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2\n            y /= r_h\n\n        return y\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...] \n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        pred = np.reshape(output[1:], (-1, 6))[:num, :]\n        # Do nms\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        print(\"The lengh of result_boxes is \", len(result_boxes))\n        return result_boxes, result_scores, result_classid\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))            \n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None) * \\\n                     np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None)\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\n        # clip the coordinates\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w -1)\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w -1)\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h -1)\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h -1)\n        # Object confidence\n        confs = boxes[:, 4]\n        # Sort by the confs\n        boxes = boxes[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_boxes = []\n        while boxes.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\n            label_match = boxes[0, -1] == boxes[:, -1]\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\n            invalid = large_overlap & label_match\n            keep_boxes += [boxes[0]]\n            boxes = boxes[~invalid]\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\n        return boxes\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov5_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov5_wrapper = yolov5_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov5_wrapper.infer(self.yolov5_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output',  \"e_\" + filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov5_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov5_wrapper = yolov5_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov5_wrapper.infer(self.yolov5_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"build/libmyplugins.so\"\n    engine_file_path = \"build/v5lite-g-int8.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n\n    # categories = ['faster']\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\",\n                  \"traffic light\",\n                  \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n                  \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\n                  \"frisbee\",\n                  \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\",\n                  \"surfboard\",\n                  \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n                  \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n                  \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\n                  \"cell phone\",\n                  \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\n                  \"teddy bear\",\n                  \"hair drier\", \"toothbrush\"]\n    \n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov5TRT instance\n    yolov5_wrapper = YoLov5TRT(engine_file_path)\n    try:\n        print('batch size is', yolov5_wrapper.batch_size)\n        \n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolov5_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov5_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov5_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov5_wrapper.destroy()\n"
  },
  {
    "path": "yolov7/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\n\nproject(yolov7)\n\nadd_definitions(-std=c++11)\nadd_definitions(-DAPI_EXPORTS)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nset(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)\nenable_language(CUDA)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\ninclude_directories(${PROJECT_SOURCE_DIR}/plugin)\n\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\nif (CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n  message(\"embed_platform on\")\n  include_directories(/usr/local/cuda/targets/aarch64-linux/include)\n  link_directories(/usr/local/cuda/targets/aarch64-linux/lib)\nelse()\n  message(\"embed_platform off\")\n  # cuda\n  include_directories(/usr/local/cuda/include)\n  link_directories(/usr/local/cuda/lib64)\n\n  # tensorrt\n  include_directories(/home/nvidia/TensorRT-8.2.5.1/include)\n  link_directories(/home/nvidia/TensorRT-8.2.5.1/lib)\nendif()\n\nadd_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/plugin/yololayer.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\nfile(GLOB_RECURSE SRCS ${PROJECT_SOURCE_DIR}/src/*.cpp ${PROJECT_SOURCE_DIR}/src/*.cu)\nadd_executable(yolov7 main.cpp ${SRCS})\n\ntarget_link_libraries(yolov7 nvinfer)\ntarget_link_libraries(yolov7 cudart)\ntarget_link_libraries(yolov7 myplugins)\ntarget_link_libraries(yolov7 ${OpenCV_LIBS})\n\n"
  },
  {
    "path": "yolov7/README.md",
    "content": "# YOLOv7\n\nThe Pytorch implementation is [WongKinYiu/yolov7](https://github.com/WongKinYiu/yolov7).\n\nThe tensorrt code is derived from [QIANXUNZDL123/tensorrtx-yolov7](https://github.com/QIANXUNZDL123/tensorrtx-yolov7)\n\n## Contributors\n\n<a href=\"https://github.com/QIANXUNZDL123\"><img src=\"https://avatars.githubusercontent.com/u/46549527?v=4?s=48\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/lindsayshuo\"><img src=\"https://avatars.githubusercontent.com/u/45239466?v=4?s=48\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/wang-xinyu\"><img src=\"https://avatars.githubusercontent.com/u/15235574?s=48&v=4\" width=\"40px;\" alt=\"\"/></a> \n<a href=\"https://github.com/AMIYAMAITY\"><img src=\"https://avatars.githubusercontent.com/u/25117739?s=48&v=4\" width=\"40px;\" alt=\"\"/></a> \n\n## Requirements\n\n- TensorRT 8.0+\n- OpenCV 3.4.0+\n\n## Different versions of yolov7\n\nCurrently, we support yolov7 v0.1\n\n- For yolov7 v0.1, download .pt from [yolov7 release v0.1](https://github.com/WongKinYiu/yolov7/releases/tag/v0.1), then follow how-to-run in current page.\n\n## Config\n\n- Choose the model tiny/v7/x/d6/w6/e6/e6e from command line arguments.\n- Check more configs in [include/config.h](./include/config.h)\n\n## How to Run, yolov7-tiny as example\n\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\n\n```\n// download https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7-tiny.pt\ncp {tensorrtx}/yolov7/gen_wts.py {WongKinYiu}/yolov7\ncd {WongKinYiu}/yolov7\npython gen_wts.py\n// a file 'yolov7.wts' will be generated.\n```\n\n2. build tensorrtx/yolov7 and run\n\n```\ncd {tensorrtx}/yolov7/\n// update kNumClass in config.h if your model is trained on custom dataset\nmkdir build\ncd build\ncp {WongKinYiu}/yolov7/yolov7.wts {tensorrtx}/yolov7/build\ncmake ..\nmake\nsudo ./yolov7 -s [.wts] [.engine] [t/v7/x/w6/e6/d6/e6e]  // serialize model to plan file\nsudo ./yolov7 -d [.engine] [image folder]  // deserialize and run inference, the images in [image folder] will be processed.\n// For example yolov7\nsudo ./yolov7 -s yolov7.wts yolov7.engine v7\nsudo ./yolov7 -d yolov7.engine ../images\n```\n\n3. check the images generated, as follows. _zidane.jpg and _bus.jpg\n\n4. optional, load and run the tensorrt model in python\n\n```\n// install python-tensorrt, pycuda, etc.\n// ensure the yolov7.engine and libmyplugins.so have been built\npython yolov7_trt.py\n```\n\n# INT8 Quantization\n\n1. Prepare calibration images, you can randomly select 1000s images from your train set. For coco, you can also download my calibration images `coco_calib` from [GoogleDrive](https://drive.google.com/drive/folders/1s7jE9DtOngZMzJC1uL307J2MiaGwdRSI?usp=sharing) or [BaiduPan](https://pan.baidu.com/s/1GOm_-JobpyLMAqZWCDUhKg) pwd: a9wh\n\n2. unzip it in yolov7/build\n\n3. set the macro `USE_INT8` in config.h and make\n\n4. serialize the model and test\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/78247927-4d9fac00-751e-11ea-8b1b-704a0aeb3fcf.jpg\" height=\"360px;\">\n</p>\n\n## More Information\n\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n\n"
  },
  {
    "path": "yolov7/gen_wts.py",
    "content": "import sys  # noqa: F401\nimport argparse\nimport os\nimport struct\nimport torch\nfrom utils.torch_utils import select_device\n\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description='Convert .pt file to .wts')\n    parser.add_argument('-w', '--weights', required=True, help='Input weights (.pt) file path (required)')\n    parser.add_argument('-o', '--output', help='Output (.wts) file path (optional)')\n    args = parser.parse_args()\n    if not os.path.isfile(args.weights):\n        raise SystemExit('Invalid input file')\n    if not args.output:\n        args.output = os.path.splitext(args.weights)[0] + '.wts'\n    elif os.path.isdir(args.output):\n        args.output = os.path.join(\n            args.output,\n            os.path.splitext(os.path.basename(args.weights))[0] + '.wts')\n    return args.weights, args.output\n\n\npt_file, wts_file = parse_args()\n\n# Initialize\ndevice = select_device('cpu')\n# Load model\nmodel = torch.load(pt_file, map_location=device, weights_only=False)  # Load FP32 weights\nmodel = model['ema' if model.get('ema') else 'model'].float()\n\n# update anchor_grid info\nanchor_grid = model.model[-1].anchors * model.model[-1].stride[..., None, None]\n# model.model[-1].anchor_grid = anchor_grid\ndelattr(model.model[-1], 'anchor_grid')  # model.model[-1] is detect layer\n# The parameters are saved in the OrderDict through the \"register_buffer\" method, and then saved to the weight.\nmodel.model[-1].register_buffer(\"anchor_grid\", anchor_grid)\n\nmodel.to(device).eval()\n\nwith open(wts_file, 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\n"
  },
  {
    "path": "yolov7/include/block.h",
    "content": "#pragma once\n\n#include \"NvInfer.h\"\n#include <string>\n#include <vector>\n#include <map>\n\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file);\n\nnvinfer1::IElementWiseLayer* convBnSilu(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c2, int k, int s, int p, std::string lname);\n\nnvinfer1::ILayer* ReOrg(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int inch);\n\nnvinfer1::ILayer* DownC(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c1, int c2, const std::string& lname);\n\nnvinfer1::IElementWiseLayer* SPPCSPC(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c2, const std::string& lname);\n\nnvinfer1::IElementWiseLayer* RepConv(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c2, int k, int s, const std::string& lname);\n\nnvinfer1::IActivationLayer* convBlockLeakRelu(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int outch, int ksize, int s, int p, std::string lname);\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition *network, std::map<std::string, nvinfer1::Weights>& weightMap, std::string lname, std::vector<nvinfer1::IConvolutionLayer*> dets);\n\n"
  },
  {
    "path": "yolov7/include/calibrator.h",
    "content": "#ifndef ENTROPY_CALIBRATOR_H\n#define ENTROPY_CALIBRATOR_H\n\n#include <NvInfer.h>\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\n//! \\class Int8EntropyCalibrator2\n//!\n//! \\brief Implements Entropy calibrator 2.\n//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.\n//!\nclass Int8EntropyCalibrator2 : public nvinfer1::IInt8EntropyCalibrator2\n{\npublic:\n    Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name, const char* input_blob_name, bool read_cache = true);\n\n    virtual ~Int8EntropyCalibrator2();\n    int getBatchSize() const TRT_NOEXCEPT override;\n    bool getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT override;\n    const void* readCalibrationCache(size_t& length) TRT_NOEXCEPT override;\n    void writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT override;\n\nprivate:\n    int batchsize_;\n    int input_w_;\n    int input_h_;\n    int img_idx_;\n    std::string img_dir_;\n    std::vector<std::string> img_files_;\n    size_t input_count_;\n    std::string calib_table_name_;\n    const char* input_blob_name_;\n    bool read_cache_;\n    void* device_input_;\n    std::vector<char> calib_cache_;\n};\n\n#endif // ENTROPY_CALIBRATOR_H\n"
  },
  {
    "path": "yolov7/include/config.h",
    "content": "#pragma once\n\n/* --------------------------------------------------------\n * These configs are related to tensorrt model, if these are changed,\n * please re-compile and re-serialize the tensorrt model.\n * --------------------------------------------------------*/\n\n// For INT8, you need prepare the calibration dataset, please refer to\n// https://github.com/wang-xinyu/tensorrtx/tree/master/yolov7#int8-quantization\n#define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32\n\n// These are used to define input/output tensor names,\n// you can set them to whatever you want.\nconst static char* kInputTensorName = \"data\";\nconst static char* kOutputTensorName = \"prob\";\n\nconst static int kNumClass = 80;\nconst static int kBatchSize = 1;\n\n// Yolo's input width and height must by divisible by 32\nconst static int kInputH = 640;\nconst static int kInputW = 640;\n\n// Maximum number of output bounding boxes from yololayer plugin.\n// That is maximum number of output bounding boxes before NMS.\nconst static int kMaxNumOutputBbox = 1000;\n\nconst static int kNumAnchor = 3;\n\n// The bboxes whose confidence is lower than kIgnoreThresh will be ignored in yololayer plugin.\nconst static float kIgnoreThresh = 0.1f;\n\n/* --------------------------------------------------------\n * These configs are not related to tensorrt model, if these are changed,\n * please re-compile, but no need to re-serialize the tensorrt model.\n * --------------------------------------------------------*/\n\n// NMS overlapping thresh and final detection confidence thresh\nconst static float kNmsThresh = 0.45f;\nconst static float kConfThresh = 0.5f;\n\nconst static int kGpuId = 0;\n\n// If your image size is larger than 4096 * 3112, please increase this value\nconst static int kMaxInputImageSize = 4096 * 3112;\n\n"
  },
  {
    "path": "yolov7/include/cuda_utils.h",
    "content": "#ifndef TRTX_CUDA_UTILS_H_\n#define TRTX_CUDA_UTILS_H_\n\n#include <cuda_runtime_api.h>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)\\\n    {\\\n        cudaError_t error_code = callstr;\\\n        if (error_code != cudaSuccess) {\\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__;\\\n            assert(0);\\\n        }\\\n    }\n#endif  // CUDA_CHECK\n\n#endif  // TRTX_CUDA_UTILS_H_\n\n"
  },
  {
    "path": "yolov7/include/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override \n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "yolov7/include/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#include \"NvInfer.h\"\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "yolov7/include/model.h",
    "content": "#pragma once\n\n#include \"NvInfer.h\"\n#include <string>\n\nnvinfer1::IHostMemory* build_engine_yolov7e6e(unsigned int maxBatchSize, nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt, const std::string& wts_path);\nnvinfer1::IHostMemory* build_engine_yolov7d6(unsigned int maxBatchSize, nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt, const std::string& wts_path);\nnvinfer1::IHostMemory* build_engine_yolov7e6(unsigned int maxBatchSize, nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt, const std::string& wts_path);\nnvinfer1::IHostMemory* build_engine_yolov7w6(unsigned int maxBatchSize, nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt, const std::string& wts_path);\nnvinfer1::IHostMemory* build_engine_yolov7x(unsigned int maxBatchSize, nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt, const std::string& wts_path);\nnvinfer1::IHostMemory* build_engine_yolov7(unsigned int maxBatchSize, nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt, const std::string& wts_path);\nnvinfer1::IHostMemory* build_engine_yolov7_tiny(unsigned int maxBatchSize, nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt, std::string& wts_name);\n"
  },
  {
    "path": "yolov7/include/postprocess.h",
    "content": "#pragma once\n\n#include \"types.h\"\n#include <opencv2/opencv.hpp>\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]);\n\nvoid nms(std::vector<Detection>& res, float *output, float conf_thresh, float nms_thresh = 0.5);\n\nvoid batch_nms(std::vector<std::vector<Detection>>& batch_res, float *output, int batch_size, int output_size, float conf_thresh, float nms_thresh = 0.5);\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\n"
  },
  {
    "path": "yolov7/include/preprocess.h",
    "content": "#pragma once\n\n#include <cuda_runtime.h>\n#include <cstdint>\n#include <opencv2/opencv.hpp>\n#include <iostream>\n\nvoid cuda_preprocess_init(int max_image_size);\nvoid cuda_preprocess_destroy();\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height,\n                     float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream);\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch,\n                           float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream);\n\n"
  },
  {
    "path": "yolov7/include/types.h",
    "content": "#pragma once\n\n#include \"config.h\"\n\nstruct YoloKernel {\n  int width;\n  int height;\n  float anchors[kNumAnchor * 2];\n};\n\nstruct alignas(float) Detection {\n  //center_x center_y w h\n  float bbox[4];\n  float conf;  // bbox_conf * cls_conf\n  float class_id;\n};\n\n"
  },
  {
    "path": "yolov7/include/utils.h",
    "content": "#ifndef TRTX_YOLOV7_UTILS_H_\n#define TRTX_YOLOV7_UTILS_H_\n\n#include <dirent.h>\n#include <opencv2/opencv.hpp>\n\nstatic inline cv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h) {\n    int w, h, x, y;\n    float r_w = input_w / (img.cols*1.0);\n    float r_h = input_h / (img.rows*1.0);\n    if (r_h > r_w) {\n        w = input_w;\n        h = r_w * img.rows;\n        x = 0;\n        y = (input_h - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = input_h;\n        x = (input_w - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n    cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\nstatic inline int read_files_in_dir(const char *p_dir_name, std::vector<std::string> &file_names) {\n    DIR *p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 &&\n            strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\n#endif  // TRTX_YOLOV7_UTILS_H_\n\n"
  },
  {
    "path": "yolov7/main.cpp",
    "content": "#include \"config.h\"\n#include \"model.h\"\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"utils.h\"\n#include \"preprocess.h\"\n#include \"postprocess.h\"\n#include <chrono>\n#include <fstream>\n\nusing namespace nvinfer1;\n\nconst static int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\nstatic Logger gLogger;\n\nvoid serialize_engine(unsigned int maxBatchSize, std::string& wts_name, std::string& sub_type, std::string& engine_name) {\n  // Create builder\n  IBuilder* builder = createInferBuilder(gLogger);\n  IBuilderConfig* config = builder->createBuilderConfig();\n\n  // Create model to populate the network, then set the outputs and create an engine\n  IHostMemory* serialized_engine = nullptr;\n  if (sub_type == \"t\") {\n    serialized_engine = build_engine_yolov7_tiny(maxBatchSize, builder, config, DataType::kFLOAT, wts_name);\n  } else if (sub_type == \"v7\") {\n    serialized_engine = build_engine_yolov7(maxBatchSize, builder, config, DataType::kFLOAT, wts_name);\n  } else if (sub_type == \"x\") {\n    serialized_engine = build_engine_yolov7x(maxBatchSize, builder, config, DataType::kFLOAT, wts_name);\n  } else if (sub_type == \"w6\") {\n    serialized_engine = build_engine_yolov7w6(maxBatchSize, builder, config, DataType::kFLOAT, wts_name);\n  } else if (sub_type == \"e6\") {\n    serialized_engine = build_engine_yolov7e6(maxBatchSize, builder, config, DataType::kFLOAT, wts_name);\n  } else if (sub_type == \"d6\") {\n    serialized_engine = build_engine_yolov7d6(maxBatchSize, builder, config, DataType::kFLOAT, wts_name);\n  } else if (sub_type == \"e6e\") {\n    serialized_engine = build_engine_yolov7e6e(maxBatchSize, builder, config, DataType::kFLOAT, wts_name);\n  }\n  assert(serialized_engine != nullptr);\n\n  std::ofstream p(engine_name, std::ios::binary);\n  if (!p) {\n    std::cerr << \"could not open plan output file\" << std::endl;\n    assert(false);\n  }\n  p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n  delete config;\n  delete serialized_engine;\n  delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine, IExecutionContext** context) {\n  std::ifstream file(engine_name, std::ios::binary);\n  if (!file.good()) {\n    std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n    assert(false);\n  }\n  size_t size = 0;\n  file.seekg(0, file.end);\n  size = file.tellg();\n  file.seekg(0, file.beg);\n  char* serialized_engine = new char[size];\n  assert(serialized_engine);\n  file.read(serialized_engine, size);\n  file.close();\n\n  *runtime = createInferRuntime(gLogger);\n  assert(*runtime);\n  *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n  assert(*engine);\n  *context = (*engine)->createExecutionContext();\n  assert(*context);\n  delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device, float** output_buffer_host) {\n  assert(engine->getNbBindings() == 2);\n  // In order to bind the buffers, we need to know the names of the input and output tensors.\n  // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n  const int inputIndex = engine->getBindingIndex(kInputTensorName);\n  const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n  assert(inputIndex == 0);\n  assert(outputIndex == 1);\n  // Create GPU buffers on device\n  CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n  CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n\n  *output_buffer_host = new float[kBatchSize * kOutputSize];\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchSize) {\n  // infer on the batch asynchronously, and DMA output back to host\n  context.enqueue(batchSize, buffers, stream, nullptr);\n  CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost, stream));\n  CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir, std::string& sub_type) {\n  if (argc < 4) return false;\n  if (std::string(argv[1]) == \"-s\" && argc == 5) {\n    wts = std::string(argv[2]);\n    engine = std::string(argv[3]);\n    sub_type = std::string(argv[4]);\n  } else if (std::string(argv[1]) == \"-d\" && argc == 4) {\n    engine = std::string(argv[2]);\n    img_dir = std::string(argv[3]);\n  } else {\n    return false;\n  }\n  return true;\n}\n\nint main(int argc, char** argv) {\n  cudaSetDevice(kGpuId);\n\n  std::string wts_name = \"\";\n  std::string engine_name = \"\";\n  std::string img_dir;\n  std::string sub_type = \"\";\n\n  if (!parse_args(argc, argv, wts_name, engine_name, img_dir, sub_type)) {\n    std::cerr << \"Arguments not right!\" << std::endl;\n    std::cerr << \"./yolov7 -s [.wts] [.engine] [t/v7/x/w6/e6/d6/e6e]  // serialize model to plan file\" << std::endl;\n    std::cerr << \"./yolov7 -d [.engine] ../samples  // deserialize plan file and run inference\" << std::endl;\n    return -1;\n  }\n\n  // Create a model using the API directly and serialize it to a file\n  if (!wts_name.empty()) {\n    serialize_engine(kBatchSize, wts_name, sub_type, engine_name);\n    return 0;\n  }\n\n  // Deserialize the engine from file\n  IRuntime* runtime = nullptr;\n  ICudaEngine* engine = nullptr;\n  IExecutionContext* context = nullptr;\n  deserialize_engine(engine_name, &runtime, &engine, &context);\n  cudaStream_t stream;\n  CUDA_CHECK(cudaStreamCreate(&stream));\n\n  cuda_preprocess_init(kMaxInputImageSize);\n\n  // Prepare cpu and gpu buffers\n  float* device_buffers[2];\n  float* output_buffer_host = nullptr;\n  prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host);\n\n  // Read images from directory\n  std::vector<std::string> file_names;\n  if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n    std::cerr << \"read_files_in_dir failed.\" << std::endl;\n    return -1;\n  }\n\n  // batch predict\n  for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n    // Get a batch of images\n    std::vector<cv::Mat> img_batch;\n    std::vector<std::string> img_name_batch;\n    for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n      cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n      img_batch.push_back(img);\n      img_name_batch.push_back(file_names[j]);\n    }\n\n    // Preprocess\n    cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n\n    // Run inference\n    auto start = std::chrono::system_clock::now();\n    infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize);\n    auto end = std::chrono::system_clock::now();\n    std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n\n    // NMS\n    std::vector<std::vector<Detection>> res_batch;\n    batch_nms(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n\n    // Draw bounding boxes\n    draw_bbox(img_batch, res_batch);\n\n    // Save images\n    for (size_t j = 0; j < img_batch.size(); j++) {\n      cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n    }\n  }\n\n  // Release stream and buffers\n  cudaStreamDestroy(stream);\n  CUDA_CHECK(cudaFree(device_buffers[0]));\n  CUDA_CHECK(cudaFree(device_buffers[1]));\n  delete[] output_buffer_host;\n  cuda_preprocess_destroy();\n  // Destroy the engine\n  delete context;\n  delete engine;\n  delete runtime;\n\n  // Print histogram of the output distribution\n  //std::cout << \"\\nOutput:\\n\\n\";\n  //for (unsigned int i = 0; i < kOutputSize; i++)\n  //{\n  //    std::cout << prob[i] << \", \";\n  //    if (i % 10 == 0) std::cout << std::endl;\n  //}\n  //std::cout << std::endl;\n\n  return 0;\n}\n\n"
  },
  {
    "path": "yolov7/plugin/yololayer.cu",
    "content": "#include \"yololayer.h\"\n#include \"cuda_utils.h\"\n#include <assert.h>\n#include <vector>\n#include <iostream>\n\nnamespace Tn {\ntemplate<typename T>\nvoid write(char*& buffer, const T& val) {\n  *reinterpret_cast<T*>(buffer) = val;\n  buffer += sizeof(T);\n}\n\ntemplate<typename T>\nvoid read(const char*& buffer, T& val) {\n  val = *reinterpret_cast<const T*>(buffer);\n  buffer += sizeof(T);\n}\n}  // namespace Tn\n\nnamespace nvinfer1 {\nYoloLayerPlugin::YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, const std::vector<YoloKernel>& vYoloKernel) {\n  mClassCount = classCount;\n  mYoloV7NetWidth = netWidth;\n  mYoloV7NetHeight = netHeight;\n  mMaxOutObject = maxOut;\n  mYoloKernel = vYoloKernel;\n  mKernelCount = vYoloKernel.size();\n\n  CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n  size_t AnchorLen = sizeof(float) * kNumAnchor * 2;\n  for (int ii = 0; ii < mKernelCount; ii++) {\n    CUDA_CHECK(cudaMalloc(&mAnchor[ii], AnchorLen));\n    const auto& yolo = mYoloKernel[ii];\n    CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n  }\n}\nYoloLayerPlugin::~YoloLayerPlugin() {\n  for (int ii = 0; ii < mKernelCount; ii++) {\n    CUDA_CHECK(cudaFree(mAnchor[ii]));\n  }\n  CUDA_CHECK(cudaFreeHost(mAnchor));\n}\n\n// create the plugin at runtime from a byte stream\nYoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length) {\n  using namespace Tn;\n  const char *d = reinterpret_cast<const char *>(data), *a = d;\n  read(d, mClassCount);\n  read(d, mThreadCount);\n  read(d, mKernelCount);\n  read(d, mYoloV7NetWidth);\n  read(d, mYoloV7NetHeight);\n  read(d, mMaxOutObject);\n  mYoloKernel.resize(mKernelCount);\n  auto kernelSize = mKernelCount * sizeof(YoloKernel);\n  memcpy(mYoloKernel.data(), d, kernelSize);\n  d += kernelSize;\n  CUDA_CHECK(cudaMallocHost(&mAnchor, mKernelCount * sizeof(void*)));\n  size_t AnchorLen = sizeof(float) * kNumAnchor * 2;\n  for (int ii = 0; ii < mKernelCount; ii++) {\n    CUDA_CHECK(cudaMalloc(&mAnchor[ii], AnchorLen));\n    const auto& yolo = mYoloKernel[ii];\n    CUDA_CHECK(cudaMemcpy(mAnchor[ii], yolo.anchors, AnchorLen, cudaMemcpyHostToDevice));\n  }\n  assert(d == a + length);\n}\n\nvoid YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT {\n  using namespace Tn;\n  char* d = static_cast<char*>(buffer), *a = d;\n  write(d, mClassCount);\n  write(d, mThreadCount);\n  write(d, mKernelCount);\n  write(d, mYoloV7NetWidth);\n  write(d, mYoloV7NetHeight);\n  write(d, mMaxOutObject);\n  auto kernelSize = mKernelCount * sizeof(YoloKernel);\n  memcpy(d, mYoloKernel.data(), kernelSize);\n  d += kernelSize;\n\n  assert(d == a + getSerializationSize());\n}\n\nsize_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT {\n  return sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mKernelCount) + sizeof(YoloKernel) * mYoloKernel.size() + sizeof(mYoloV7NetWidth) + sizeof(mYoloV7NetHeight) + sizeof(mMaxOutObject);\n}\n\nint YoloLayerPlugin::initialize() TRT_NOEXCEPT {\n  return 0;\n}\n\nDims YoloLayerPlugin::getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT {\n  //output the result to channel\n  int totalsize = mMaxOutObject * sizeof(Detection) / sizeof(float);\n  return Dims3(totalsize + 1, 1, 1);\n}\n\n// Set plugin namespace\nvoid YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT {\n  mPluginNamespace = pluginNamespace;\n}\n\nconst char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT {\n  return mPluginNamespace;\n}\n\n// Return the DataType of the plugin output at the requested index\nDataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT {\n  return DataType::kFLOAT;\n}\n\n// Return true if output tensor is broadcast across a batch.\nbool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT {\n  return false;\n}\n\n// Return true if plugin can use input that is broadcast across batch without replication.\nbool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT {\n  return false;\n}\n\nvoid YoloLayerPlugin::configurePlugin(PluginTensorDesc const* in, int nbInput, PluginTensorDesc const* out, int nbOutput) TRT_NOEXCEPT {}\n\n// Attach the plugin object to an execution context and grant the plugin the access to some context resource.\nvoid YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT {}\n\n// Detach the plugin object from its execution context.\nvoid YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\nconst char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT {\n  return \"YoloLayer_TRT\";\n}\n\nconst char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT {\n  return \"1\";\n}\n\nvoid YoloLayerPlugin::destroy() TRT_NOEXCEPT {\n  delete this;\n}\n\n// Clone the plugin\nIPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT {\n  YoloLayerPlugin* p = new YoloLayerPlugin(mClassCount, mYoloV7NetWidth, mYoloV7NetHeight, mMaxOutObject, mYoloKernel);\n  p->setPluginNamespace(mPluginNamespace);\n  return p;\n}\n\n__device__ float Logist(float data) { return 1.0f / (1.0f + expf(-data)); };\n\n__global__ void CalDetection(const float *input, float *output, int noElements,\n    const int netwidth, const int netheight, int maxoutobject, int yoloWidth, int yoloHeight, const float anchors[kNumAnchor * 2], int classes, int outputElem) {\n  int idx = threadIdx.x + blockDim.x * blockIdx.x;\n  if (idx >= noElements) return;\n\n  int total_grid = yoloWidth * yoloHeight;  // 80*80 40*40 20*20\n  int bnIdx = idx / total_grid;\n  idx = idx - total_grid * bnIdx;\n  int info_len_i = 5 + classes;\n  const float* curInput = input + bnIdx * (info_len_i * total_grid * kNumAnchor);\n\n  for (int k = 0; k < 3; k++) {\n    float box_prob = Logist(curInput[idx + k * info_len_i * total_grid + 4 * total_grid]);\n    if (box_prob < kIgnoreThresh) continue;\n    int class_id = 0;\n    float max_cls_prob = 0.0;\n    for (int i = 5; i < info_len_i; ++i) {\n      float p = Logist(curInput[idx + k * info_len_i * total_grid + i * total_grid]);\n      if (p > max_cls_prob) {\n        max_cls_prob = p;\n        class_id = i - 5;\n      }\n    }\n    float *res_count = output + bnIdx * outputElem;\n    int count = (int)atomicAdd(res_count, 1);\n    if (count >= maxoutobject) return;\n    char *data = (char*)res_count + sizeof(float) + count * sizeof(Detection);\n    Detection *det = (Detection*)(data);\n\n    int row = idx / yoloWidth;\n    int col = idx % yoloWidth;\n\n    // Location\n    // pytorch:\n    //  y = x[i].sigmoid()\n    //  y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i]  # xy\n    //  y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh\n    //  X: (sigmoid(tx) + cx)/FeaturemapW *  netwidth\n    det->bbox[0] = (col - 0.5f + 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 0 * total_grid])) * netwidth / yoloWidth;\n    det->bbox[1] = (row - 0.5f + 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 1 * total_grid])) * netheight / yoloHeight;\n\n    // W: (Pw * e^tw) / FeaturemapW * netwidth\n    // v5: https://github.com/ultralytics/yolov7/issues/471\n    //float box_w = ((row[2] * 2)*(row[2] * 2)) * float(anchors[a][c][0]) * scale;\n    //float box_h = ((row[3] * 2) * (row[3] * 2)) * float(anchors[a][c][1]) * scale;\n    det->bbox[2] = 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 2 * total_grid]);\n    det->bbox[2] = det->bbox[2] * det->bbox[2] * anchors[2 * k];\n    det->bbox[3] = 2.0f * Logist(curInput[idx + k * info_len_i * total_grid + 3 * total_grid]);\n    det->bbox[3] = det->bbox[3] * det->bbox[3] * anchors[2 * k + 1];\n    det->conf = box_prob * max_cls_prob;\n    det->class_id = class_id;\n  }\n}\n\nvoid YoloLayerPlugin::forwardGpu(const float* const* inputs, float *output, cudaStream_t stream, int batchSize) {\n  int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);\n  for (int idx = 0; idx < batchSize; ++idx) {\n    CUDA_CHECK(cudaMemsetAsync(output + idx * outputElem, 0, sizeof(float), stream));\n  }\n  int numElem = 0;\n\n  for (unsigned int i = 0; i < mYoloKernel.size(); ++i) {\n    const auto& yolo = mYoloKernel[i];\n    numElem = yolo.width * yolo.height * batchSize;\n    if (numElem < mThreadCount) mThreadCount = numElem;\n\n    CalDetection<<<(numElem + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream>>>\n        (inputs[i], output, numElem, mYoloV7NetWidth, mYoloV7NetHeight, mMaxOutObject, yolo.width, yolo.height, (float*)mAnchor[i], mClassCount, outputElem);\n  }\n}\n\nint YoloLayerPlugin::enqueue(int batchSize, const void* const* inputs, void* TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT {\n  forwardGpu((const float* const*)inputs, (float*)outputs[0], stream, batchSize);\n  return 0;\n}\n\nPluginFieldCollection YoloPluginCreator::mFC{};\nstd::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\nYoloPluginCreator::YoloPluginCreator() {\n  mPluginAttributes.clear();\n  mFC.nbFields = mPluginAttributes.size();\n  mFC.fields = mPluginAttributes.data();\n}\n\nconst char* YoloPluginCreator::getPluginName() const TRT_NOEXCEPT {\n  return \"YoloLayer_TRT\";\n}\n\nconst char* YoloPluginCreator::getPluginVersion() const TRT_NOEXCEPT {\n  return \"1\";\n}\n\nconst PluginFieldCollection* YoloPluginCreator::getFieldNames() TRT_NOEXCEPT {\n  return &mFC;\n}\n\nIPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT {\n  assert(fc->nbFields == 2);\n  assert(strcmp(fc->fields[0].name, \"netinfo\") == 0);\n  assert(strcmp(fc->fields[1].name, \"kernels\") == 0);\n  int *p_netinfo = (int*)(fc->fields[0].data);\n  int class_count = p_netinfo[0];\n  int input_w = p_netinfo[1];\n  int input_h = p_netinfo[2];\n  int max_output_object_count = p_netinfo[3];\n  std::vector<YoloKernel> kernels(fc->fields[1].length);\n  memcpy(&kernels[0], fc->fields[1].data, kernels.size() * sizeof(YoloKernel));\n  YoloLayerPlugin* obj = new YoloLayerPlugin(class_count, input_w, input_h, max_output_object_count, kernels);\n  obj->setPluginNamespace(mNamespace.c_str());\n  return obj;\n}\n\nIPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT {\n  // This object will be deleted when the network is destroyed, which will\n  // call YoloLayerPlugin::destroy()\n  YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n  obj->setPluginNamespace(mNamespace.c_str());\n  return obj;\n}\n}  // namespace nvinfer1\n\n"
  },
  {
    "path": "yolov7/plugin/yololayer.h",
    "content": "#pragma once\n\n#include \"macros.h\"\n#include \"types.h\"\n#include <vector>\n#include <string>\n\nnamespace nvinfer1 {\nclass API YoloLayerPlugin : public IPluginV2IOExt {\n public:\n  YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, const std::vector<YoloKernel>& vYoloKernel);\n  YoloLayerPlugin(const void* data, size_t length);\n  ~YoloLayerPlugin();\n\n  int getNbOutputs() const TRT_NOEXCEPT override {\n    return 1;\n  }\n\n  Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n  int initialize() TRT_NOEXCEPT override;\n\n  virtual void terminate() TRT_NOEXCEPT override {}\n\n  virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0; }\n\n  virtual int enqueue(int batchSize, const void* const* inputs, void*TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT override;\n\n  virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n  virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n  bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const TRT_NOEXCEPT override {\n    return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n  }\n\n  const char* getPluginType() const TRT_NOEXCEPT override;\n\n  const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n  void destroy() TRT_NOEXCEPT override;\n\n  IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n  void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n  const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n  DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT override;\n\n  bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT override;\n\n  bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n  void attachToContext(\n      cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n  void configurePlugin(PluginTensorDesc const* in, int nbInput, PluginTensorDesc const* out, int nbOutput) TRT_NOEXCEPT override;\n\n  void detachFromContext() TRT_NOEXCEPT override;\n\n private:\n  void forwardGpu(const float* const* inputs, float *output, cudaStream_t stream, int batchSize = 1);\n  int mThreadCount = 256;\n  const char* mPluginNamespace;\n  int mKernelCount;\n  int mClassCount;\n  int mYoloV7NetWidth;\n  int mYoloV7NetHeight;\n  int mMaxOutObject;\n  std::vector<YoloKernel> mYoloKernel;\n  void** mAnchor;\n};\n\nclass API YoloPluginCreator : public IPluginCreator {\n public:\n  YoloPluginCreator();\n\n  ~YoloPluginCreator() override = default;\n\n  const char* getPluginName() const TRT_NOEXCEPT override;\n\n  const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n  const PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n  IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n  IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT override;\n\n  void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override {\n    mNamespace = libNamespace;\n  }\n\n  const char* getPluginNamespace() const TRT_NOEXCEPT override {\n    return mNamespace.c_str();\n  }\n\n private:\n  std::string mNamespace;\n  static PluginFieldCollection mFC;\n  static std::vector<PluginField> mPluginAttributes;\n};\nREGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n}  // namespace nvinfer1\n\n"
  },
  {
    "path": "yolov7/src/block.cpp",
    "content": "﻿#include \"block.h\"\n#include \"yololayer.h\"\n#include \"NvInfer.h\"\n#include <iostream>\n#include <fstream>\n#include <assert.h>\n#include <cmath>\n#include <cstring>\n\nusing namespace nvinfer1;\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--)\n    {\n        Weights wt{ DataType::kFLOAT, nullptr, 0 };\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x)\n        {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nstatic IScaleLayer* addBatchNorm2d(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname, float eps) {\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights scale{ DataType::kFLOAT, scval, len };\n\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    Weights shift{ DataType::kFLOAT, shval, len };\n\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    Weights power{ DataType::kFLOAT, pval, len };\n\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    IScaleLayer* scale_1 = network->addScale(input, ScaleMode::kCHANNEL, shift, scale, power);\n    assert(scale_1);\n    return scale_1;\n}\n\nIElementWiseLayer* convBnSilu(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c2, int k, int s, int p, std::string lname) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, c2, DimsHW{ k, k }, weightMap[lname + \".conv.weight\"], emptywts);\n    assert(conv1);\n    conv1->setName((lname + \".conv\").c_str());\n    conv1->setStrideNd(DimsHW{ s, s });\n    conv1->setPaddingNd(DimsHW{ p, p });\n\n\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn\", 1e-3);\n\n\n    // silu = x * sigmoid(x)\n    IActivationLayer* sig1 = network->addActivation(*bn1->getOutput(0), ActivationType::kSIGMOID);\n    assert(sig1);\n    IElementWiseLayer* ew1 = network->addElementWise(*bn1->getOutput(0), *sig1->getOutput(0), ElementWiseOperation::kPROD);\n    assert(ew1);\n    return ew1;\n}\n\nILayer* ReOrg(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int inch) {\n    ISliceLayer* s1 = network->addSlice(input, Dims3{ 0, 0, 0 }, Dims3{ inch, kInputH / 2, kInputW / 2 }, Dims3{ 1, 2, 2 });\n    ISliceLayer* s2 = network->addSlice(input, Dims3{ 0, 1, 0 }, Dims3{ inch, kInputH / 2, kInputW / 2 }, Dims3{ 1, 2, 2 });\n    ISliceLayer* s3 = network->addSlice(input, Dims3{ 0, 0, 1 }, Dims3{ inch, kInputH / 2, kInputW / 2 }, Dims3{ 1, 2, 2 });\n    ISliceLayer* s4 = network->addSlice(input, Dims3{ 0, 1, 1 }, Dims3{ inch, kInputH / 2, kInputW / 2 }, Dims3{ 1, 2, 2 });\n    ITensor* inputTensors[] = { s1->getOutput(0), s2->getOutput(0), s3->getOutput(0), s4->getOutput(0) };\n    auto cat = network->addConcatenation(inputTensors, 4);\n    return cat;\n}\n\nILayer* DownC(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2, const std::string& lname) {\n    int c_ = int(c2 * 0.5);\n    IElementWiseLayer* cv1 = convBnSilu(network, weightMap, input, c1, 1, 1, 0, lname + \".cv1\");\n    IElementWiseLayer* cv2 = convBnSilu(network, weightMap, *cv1->getOutput(0), c_, 3, 2, 1, lname + \".cv2\");\n\n    IPoolingLayer* m1 = network->addPoolingNd(input, PoolingType::kMAX, DimsHW{ 2, 2 });\n    m1->setStrideNd(DimsHW{ 2, 2 });\n    IElementWiseLayer* cv3 = convBnSilu(network, weightMap, *m1->getOutput(0), c_, 1, 1, 0, lname + \".cv3\");\n\n    ITensor* input_tensors[] = { cv2->getOutput(0),  cv3->getOutput(0) };\n    IConcatenationLayer* concat = network->addConcatenation(input_tensors, 2);\n\n    return concat;\n\n}\n\nIElementWiseLayer* SPPCSPC(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c2, const std::string& lname) {\n    int c_ = int(2 * c2 * 0.5);\n    IElementWiseLayer* cv1 = convBnSilu(network, weightMap, input, c_, 1, 1, 0, lname + \".cv1\");\n    IElementWiseLayer* cv2 = convBnSilu(network, weightMap, input, c_, 1, 1, 0, lname + \".cv2\");\n\n    IElementWiseLayer* cv3 = convBnSilu(network, weightMap, *cv1->getOutput(0), c_, 3, 1, 1, lname + \".cv3\");\n    IElementWiseLayer* cv4 = convBnSilu(network, weightMap, *cv3->getOutput(0), c_, 1, 1, 0, lname + \".cv4\");\n\n    IPoolingLayer* m1 = network->addPoolingNd(*cv4->getOutput(0), PoolingType::kMAX, DimsHW{ 5, 5 });\n    m1->setStrideNd(DimsHW{ 1, 1 });\n    m1->setPaddingNd(DimsHW{ 2, 2 });\n    IPoolingLayer* m2 = network->addPoolingNd(*cv4->getOutput(0), PoolingType::kMAX, DimsHW{ 9, 9 });\n    m2->setStrideNd(DimsHW{ 1, 1 });\n    m2->setPaddingNd(DimsHW{ 4, 4 });\n    IPoolingLayer* m3 = network->addPoolingNd(*cv4->getOutput(0), PoolingType::kMAX, DimsHW{ 13, 13 });\n    m3->setStrideNd(DimsHW{ 1, 1 });\n    m3->setPaddingNd(DimsHW{ 6, 6 });\n\n    ITensor* input_tensors[] = { cv4->getOutput(0), m1->getOutput(0), m2->getOutput(0), m3->getOutput(0) };\n    IConcatenationLayer* concat = network->addConcatenation(input_tensors, 4);\n    // 0U\n    concat->setAxis(0);\n\n    IElementWiseLayer* cv5 = convBnSilu(network, weightMap, *concat->getOutput(0), c_, 1, 1, 0, lname + \".cv5\");\n    IElementWiseLayer* cv6 = convBnSilu(network, weightMap, *cv5->getOutput(0), c_, 3, 1, 1, lname + \".cv6\");\n\n    ITensor* input_tensors2[] = { cv6->getOutput(0), cv2->getOutput(0) };\n    IConcatenationLayer* concat1 = network->addConcatenation(input_tensors2, 2);\n    // 0U\n    concat1->setAxis(0);\n\n\n    IElementWiseLayer* cv7 = convBnSilu(network, weightMap, *concat1->getOutput(0), c2, 1, 1, 0, lname + \".cv7\");\n    return cv7;\n}\n\nIElementWiseLayer* RepConv(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c2, int k, int s, const std::string& lname) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n    // 256 * 128 * 3 *3\n    IConvolutionLayer* rbr_dense_conv = network->addConvolutionNd(input, c2, DimsHW{ k, k }, weightMap[lname + \".rbr_dense.0.weight\"], emptywts);\n    assert(rbr_dense_conv);\n    rbr_dense_conv->setPaddingNd(DimsHW{ k / 2, k / 2 });\n    rbr_dense_conv->setStrideNd(DimsHW{ s, s });\n    rbr_dense_conv->setName((lname + \".rbr_dense.0\").c_str());\n    IScaleLayer* rbr_dense_bn = addBatchNorm2d(network, weightMap, *rbr_dense_conv->getOutput(0), lname + \".rbr_dense.1\", 1e-3);\n\n    IConvolutionLayer* rbr_1x1_conv = network->addConvolutionNd(input, c2, DimsHW{ 1, 1 }, weightMap[lname + \".rbr_1x1.0.weight\"], emptywts);\n    assert(rbr_1x1_conv);\n    rbr_1x1_conv->setStrideNd(DimsHW{ s, s });\n    rbr_1x1_conv->setName((lname + \".rbr_1x1.0\").c_str());\n    IScaleLayer* rbr_1x1_bn = addBatchNorm2d(network, weightMap, *rbr_1x1_conv->getOutput(0), lname + \".rbr_1x1.1\", 1e-3);\n\n    IElementWiseLayer* ew1 = network->addElementWise(*rbr_dense_bn->getOutput(0), *rbr_1x1_bn->getOutput(0), ElementWiseOperation::kSUM);\n    assert(ew1);\n    // silu\n    IActivationLayer* sigmoid = network->addActivation(*ew1->getOutput(0), ActivationType::kSIGMOID);\n    IElementWiseLayer* ew2 = network->addElementWise(*ew1->getOutput(0), *sigmoid->getOutput(0), ElementWiseOperation::kPROD);\n    return ew2;\n}\n\nIActivationLayer* convBlockLeakRelu(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int outch, int ksize, int s, int p, std::string lname) {\n    Weights emptywts{ DataType::kFLOAT, nullptr, 0 };\n\n    IConvolutionLayer* conv1 = network->addConvolutionNd(input, outch, DimsHW{ ksize, ksize }, weightMap[lname + \".conv.weight\"], emptywts);\n    assert(conv1);\n    conv1->setName((lname + \".conv\").c_str());\n    conv1->setStrideNd(DimsHW{ s, s });\n    conv1->setPaddingNd(DimsHW{ p, p });\n    //conv1->setNbGroups(g);\n    IScaleLayer* bn1 = addBatchNorm2d(network, weightMap, *conv1->getOutput(0), lname + \".bn\", 1e-5);\n\n    auto ew1 = network->addActivation(*bn1->getOutput(0), ActivationType::kLEAKY_RELU);\n    ew1->setAlpha(0.1);\n    return ew1;\n}\n\nstatic std::vector<std::vector<float>> getAnchors(std::map<std::string, Weights>& weightMap, std::string lname) {\n    std::vector<std::vector<float>> anchors;\n    Weights wts = weightMap[lname + \".anchor_grid\"];\n    int anchor_len = kNumAnchor * 2;\n    for (int i = 0; i < wts.count / anchor_len; i++) {\n        auto *p = (const float*)wts.values + i * anchor_len;\n        std::vector<float> anchor(p, p + anchor_len);\n        anchors.push_back(anchor);\n    }\n    return anchors;\n}\n\nIPluginV2Layer* addYoLoLayer(INetworkDefinition *network, std::map<std::string, Weights>& weightMap, std::string lname, std::vector<IConvolutionLayer*> dets) {\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    auto anchors = getAnchors(weightMap, lname);\n\n    PluginField plugin_fields[2];\n    int netinfo[4] = {kNumClass, kInputW, kInputH, kMaxNumOutputBbox};\n    plugin_fields[0].data = netinfo;\n    plugin_fields[0].length = 4;\n    plugin_fields[0].name = \"netinfo\";\n    plugin_fields[0].type = PluginFieldType::kFLOAT32;\n    int scale = 8;\n\n    std::vector<YoloKernel> kernels;\n    for (size_t i = 0; i < anchors.size(); i++) {\n        YoloKernel kernel;\n        kernel.width = kInputW / scale;\n        kernel.height = kInputH / scale;\n        memcpy(kernel.anchors, &anchors[i][0], anchors[i].size() * sizeof(float));\n        kernels.push_back(kernel);\n        scale *= 2;\n    }\n    plugin_fields[1].data = &kernels[0];\n    plugin_fields[1].length = kernels.size();\n    plugin_fields[1].name = \"kernels\";\n    plugin_fields[1].type = PluginFieldType::kFLOAT32;\n    PluginFieldCollection plugin_data;\n    plugin_data.nbFields = 2;\n    plugin_data.fields = plugin_fields;\n    IPluginV2 *plugin_obj = creator->createPlugin(\"yololayer\", &plugin_data);\n    std::vector<ITensor*> input_tensors;\n    for (auto det: dets) {\n        input_tensors.push_back(det->getOutput(0));\n    }\n    auto yolo = network->addPluginV2(&input_tensors[0], input_tensors.size(), *plugin_obj);\n    return yolo;\n}\n\n"
  },
  {
    "path": "yolov7/src/calibrator.cpp",
    "content": "#include <iostream>\n#include <iterator>\n#include <fstream>\n#include <opencv2/dnn/dnn.hpp>\n#include \"calibrator.h\"\n#include \"cuda_utils.h\"\n#include \"utils.h\"\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name, const char* input_blob_name, bool read_cache)\n    : batchsize_(batchsize)\n    , input_w_(input_w)\n    , input_h_(input_h)\n    , img_idx_(0)\n    , img_dir_(img_dir)\n    , calib_table_name_(calib_table_name)\n    , input_blob_name_(input_blob_name)\n    , read_cache_(read_cache)\n{\n    input_count_ = 3 * input_w * input_h * batchsize;\n    CUDA_CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n    read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2()\n{\n    CUDA_CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const TRT_NOEXCEPT\n{\n    return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT\n{\n    if (img_idx_ + batchsize_ > (int)img_files_.size()) {\n        return false;\n    }\n\n    std::vector<cv::Mat> input_imgs_;\n    for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n        std::cout << img_files_[i] << \"  \" << i << std::endl;\n        cv::Mat temp = cv::imread(img_dir_ + img_files_[i]);\n        if (temp.empty()){\n            std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n            return false;\n        }\n        cv::Mat pr_img = preprocess_img(temp, input_w_, input_h_);\n        input_imgs_.push_back(pr_img);\n    }\n    img_idx_ += batchsize_;\n    cv::Mat blob = cv::dnn::blobFromImages(input_imgs_, 1.0 / 255.0, cv::Size(input_w_, input_h_), cv::Scalar(0, 0, 0), true, false);\n\n    CUDA_CHECK(cudaMemcpy(device_input_, blob.ptr<float>(0), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n    assert(!strcmp(names[0], input_blob_name_));\n    bindings[0] = device_input_;\n    return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length) TRT_NOEXCEPT\n{\n    std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n    calib_cache_.clear();\n    std::ifstream input(calib_table_name_, std::ios::binary);\n    input >> std::noskipws;\n    if (read_cache_ && input.good())\n    {\n        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n    }\n    length = calib_cache_.size();\n    return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT\n{\n    std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n    std::ofstream output(calib_table_name_, std::ios::binary);\n    output.write(reinterpret_cast<const char*>(cache), length);\n}\n\n"
  },
  {
    "path": "yolov7/src/model.cpp",
    "content": "#include \"model.h\"\n#include \"block.h\"\n// #include \"yololayer.h\"\n#include \"config.h\"\n#include \"calibrator.h\"\n#include <iostream>\n#include <cassert>\n\nusing namespace nvinfer1;\n\nIHostMemory* build_engine_yolov7e6e(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, const std::string& wts_path) {\n    std::map<std::string, Weights> weightMap = loadWeights(wts_path);\n\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{ 3, kInputH, kInputW });\n    assert(data);\n\n    auto* conv0 = ReOrg(network, weightMap, *data, 3);\n\n\n    IElementWiseLayer* conv1 = convBnSilu(network, weightMap, *conv0->getOutput(0), 80, 3, 1, 1, \"model.1\");\n    auto conv2 = DownC(network, weightMap, *conv1->getOutput(0), 80, 160, \"model.2\");\n\n    IElementWiseLayer* conv3 = convBnSilu(network, weightMap, *conv2->getOutput(0), 64, 1, 1, 0, \"model.3\");\n    IElementWiseLayer* conv4 = convBnSilu(network, weightMap, *conv2->getOutput(0), 64, 1, 1, 0, \"model.4\");\n\n    IElementWiseLayer* conv5 = convBnSilu(network, weightMap, *conv4->getOutput(0), 64, 3, 1, 1, \"model.5\");\n    IElementWiseLayer* conv6 = convBnSilu(network, weightMap, *conv5->getOutput(0), 64, 3, 1, 1, \"model.6\");\n    IElementWiseLayer* conv7 = convBnSilu(network, weightMap, *conv6->getOutput(0), 64, 3, 1, 1, \"model.7\");\n    IElementWiseLayer* conv8 = convBnSilu(network, weightMap, *conv7->getOutput(0), 64, 3, 1, 1, \"model.8\");\n    IElementWiseLayer* conv9 = convBnSilu(network, weightMap, *conv8->getOutput(0), 64, 3, 1, 1, \"model.9\");\n    IElementWiseLayer* conv10 = convBnSilu(network, weightMap, *conv9->getOutput(0), 64, 3, 1, 1, \"model.10\");\n\n    ITensor* input_tensor_11[] = { conv10->getOutput(0), conv8->getOutput(0),conv6->getOutput(0), conv4->getOutput(0),\n        conv3->getOutput(0) };\n    IConcatenationLayer* concat11 = network->addConcatenation(input_tensor_11, 5);\n\n    IElementWiseLayer* conv12 = convBnSilu(network, weightMap, *concat11->getOutput(0), 160, 1, 1, 0, \"model.12\");\n\n\n    IElementWiseLayer* conv13 = convBnSilu(network, weightMap, *conv2->getOutput(0), 64, 1, 1, 0, \"model.13\");\n    IElementWiseLayer* conv14 = convBnSilu(network, weightMap, *conv2->getOutput(0), 64, 1, 1, 0, \"model.14\");\n\n    IElementWiseLayer* conv15 = convBnSilu(network, weightMap, *conv14->getOutput(0), 64, 3, 1, 1, \"model.15\");\n    IElementWiseLayer* conv16 = convBnSilu(network, weightMap, *conv15->getOutput(0), 64, 3, 1, 1, \"model.16\");\n    IElementWiseLayer* conv17 = convBnSilu(network, weightMap, *conv16->getOutput(0), 64, 3, 1, 1, \"model.17\");\n    IElementWiseLayer* conv18 = convBnSilu(network, weightMap, *conv17->getOutput(0), 64, 3, 1, 1, \"model.18\");\n    IElementWiseLayer* conv19 = convBnSilu(network, weightMap, *conv18->getOutput(0), 64, 3, 1, 1, \"model.19\");\n    IElementWiseLayer* conv20 = convBnSilu(network, weightMap, *conv19->getOutput(0), 64, 3, 1, 1, \"model.20\");\n    ITensor* input_tensor_21[] = { conv20->getOutput(0), conv18->getOutput(0),conv16->getOutput(0), conv14->getOutput(0),\n        conv13->getOutput(0) };\n    IConcatenationLayer* concat21 = network->addConcatenation(input_tensor_21, 5);\n    \n    IElementWiseLayer* conv22 = convBnSilu(network, weightMap, *concat21->getOutput(0), 160, 1, 1, 0, \"model.22\");\n    auto conv23 = network->addElementWise(*conv22->getOutput(0), *conv12->getOutput(0), ElementWiseOperation::kSUM);\n\n    auto conv24 = DownC(network, weightMap, *conv23->getOutput(0), 160, 320, \"model.24\");\n    IElementWiseLayer* conv25 = convBnSilu(network, weightMap, *conv24->getOutput(0), 128, 1, 1, 0, \"model.25\");\n    IElementWiseLayer* conv26 = convBnSilu(network, weightMap, *conv24->getOutput(0), 128, 1, 1, 0, \"model.26\");\n\n    IElementWiseLayer* conv27 = convBnSilu(network, weightMap, *conv26->getOutput(0), 128, 3, 1, 1, \"model.27\");\n    IElementWiseLayer* conv28 = convBnSilu(network, weightMap, *conv27->getOutput(0), 128, 3, 1, 1, \"model.28\");\n    IElementWiseLayer* conv29 = convBnSilu(network, weightMap, *conv28->getOutput(0), 128, 3, 1, 1, \"model.29\");\n    IElementWiseLayer* conv30 = convBnSilu(network, weightMap, *conv29->getOutput(0), 128, 3, 1, 1, \"model.30\");\n    IElementWiseLayer* conv31 = convBnSilu(network, weightMap, *conv30->getOutput(0), 128, 3, 1, 1, \"model.31\");\n    IElementWiseLayer* conv32 = convBnSilu(network, weightMap, *conv31->getOutput(0), 128, 3, 1, 1, \"model.32\");\n\n    ITensor* input_tensor_33[] = { conv32->getOutput(0), conv30->getOutput(0),conv28->getOutput(0), conv26->getOutput(0),\n        conv25->getOutput(0)};\n    IConcatenationLayer* concat33 = network->addConcatenation(input_tensor_33, 5);\n\n    IElementWiseLayer* conv34 = convBnSilu(network, weightMap, *concat33->getOutput(0), 320, 1, 1, 0, \"model.34\");\n\n    IElementWiseLayer* conv35 = convBnSilu(network, weightMap, *conv24->getOutput(0), 128, 1, 1, 0, \"model.35\");\n    IElementWiseLayer* conv36 = convBnSilu(network, weightMap, *conv24->getOutput(0), 128, 1, 1, 0, \"model.36\");\n\n    IElementWiseLayer* conv37 = convBnSilu(network, weightMap, *conv36->getOutput(0), 128, 3, 1, 1, \"model.37\");\n    IElementWiseLayer* conv38 = convBnSilu(network, weightMap, *conv37->getOutput(0), 128, 3, 1, 1, \"model.38\");\n    IElementWiseLayer* conv39 = convBnSilu(network, weightMap, *conv38->getOutput(0), 128, 3, 1, 1, \"model.39\");\n    IElementWiseLayer* conv40 = convBnSilu(network, weightMap, *conv39->getOutput(0), 128, 3, 1, 1, \"model.40\");\n    IElementWiseLayer* conv41 = convBnSilu(network, weightMap, *conv40->getOutput(0), 128, 3, 1, 1, \"model.41\");\n    IElementWiseLayer* conv42 = convBnSilu(network, weightMap, *conv41->getOutput(0), 128, 3, 1, 1, \"model.42\");\n\n    ITensor* input_tensor_43[] = { conv42->getOutput(0), conv40->getOutput(0),conv38->getOutput(0), conv36->getOutput(0),\n        conv35->getOutput(0)};\n    IConcatenationLayer* concat43 = network->addConcatenation(input_tensor_43, 5);\n    IElementWiseLayer* conv44 = convBnSilu(network, weightMap, *concat43->getOutput(0), 320, 1, 1, 0, \"model.44\");\n\n    auto conv45 = network->addElementWise(*conv44->getOutput(0), *conv34->getOutput(0), ElementWiseOperation::kSUM);\n\n    auto conv46 = DownC(network, weightMap, *conv45->getOutput(0), 320, 640, \"model.46\");//=====\n\n\n    IElementWiseLayer* conv47 = convBnSilu(network, weightMap, *conv46->getOutput(0), 256, 1, 1, 0, \"model.47\");\n    IElementWiseLayer* conv48 = convBnSilu(network, weightMap, *conv46->getOutput(0), 256, 1, 1, 0, \"model.48\");\n\n    IElementWiseLayer* conv49 = convBnSilu(network, weightMap, *conv48->getOutput(0), 256, 3, 1, 1, \"model.49\");\n    IElementWiseLayer* conv50 = convBnSilu(network, weightMap, *conv49->getOutput(0), 256, 3, 1, 1, \"model.50\");\n    IElementWiseLayer* conv51 = convBnSilu(network, weightMap, *conv50->getOutput(0), 256, 3, 1, 1, \"model.51\");\n    IElementWiseLayer* conv52 = convBnSilu(network, weightMap, *conv51->getOutput(0), 256, 3, 1, 1, \"model.52\");\n    IElementWiseLayer* conv53 = convBnSilu(network, weightMap, *conv52->getOutput(0), 256, 3, 1, 1, \"model.53\");\n    IElementWiseLayer* conv54 = convBnSilu(network, weightMap, *conv53->getOutput(0), 256, 3, 1, 1, \"model.54\");\n    \n    ITensor* input_tensor_55[] = { conv54->getOutput(0), conv52->getOutput(0),conv50->getOutput(0), conv48->getOutput(0),\n        conv47->getOutput(0) };\n    IConcatenationLayer* concat55 = network->addConcatenation(input_tensor_55, 5);\n    IElementWiseLayer* conv56 = convBnSilu(network, weightMap, *concat55->getOutput(0), 640, 1, 1, 0, \"model.56\");\n\n    IElementWiseLayer* conv57 = convBnSilu(network, weightMap, *conv46->getOutput(0), 256, 1, 1, 0, \"model.57\");\n    IElementWiseLayer* conv58 = convBnSilu(network, weightMap, *conv46->getOutput(0), 256, 1, 1, 0, \"model.58\");\n\n    IElementWiseLayer* conv59 = convBnSilu(network, weightMap, *conv58->getOutput(0), 256, 3, 1, 1, \"model.59\");\n    IElementWiseLayer* conv60 = convBnSilu(network, weightMap, *conv59->getOutput(0), 256, 3, 1, 1, \"model.60\");\n    IElementWiseLayer* conv61 = convBnSilu(network, weightMap, *conv60->getOutput(0), 256, 3, 1, 1, \"model.61\");\n    IElementWiseLayer* conv62 = convBnSilu(network, weightMap, *conv61->getOutput(0), 256, 3, 1, 1, \"model.62\");\n    IElementWiseLayer* conv63 = convBnSilu(network, weightMap, *conv62->getOutput(0), 256, 3, 1, 1, \"model.63\");\n    IElementWiseLayer* conv64 = convBnSilu(network, weightMap, *conv63->getOutput(0), 256, 3, 1, 1, \"model.64\");\n    ITensor* input_tensor_65[] = { conv64->getOutput(0), conv62->getOutput(0),conv60->getOutput(0), conv58->getOutput(0),\n        conv57->getOutput(0) };\n    IConcatenationLayer* concat65 = network->addConcatenation(input_tensor_65, 5);\n    IElementWiseLayer* conv66 = convBnSilu(network, weightMap, *concat65->getOutput(0), 640, 1, 1, 0, \"model.66\");\n    auto conv67 = network->addElementWise(*conv66->getOutput(0), *conv56->getOutput(0), ElementWiseOperation::kSUM);\n\n    auto conv68 = DownC(network, weightMap, *conv67->getOutput(0), 640, 960, \"model.68\");//=====\n\n    IElementWiseLayer* conv69 = convBnSilu(network, weightMap, *conv68->getOutput(0), 384, 1, 1, 0, \"model.69\");\n    IElementWiseLayer* conv70 = convBnSilu(network, weightMap, *conv68->getOutput(0), 384, 1, 1, 0, \"model.70\");\n\n    IElementWiseLayer* conv71 = convBnSilu(network, weightMap, *conv70->getOutput(0), 384, 3, 1, 1, \"model.71\");\n    IElementWiseLayer* conv72 = convBnSilu(network, weightMap, *conv71->getOutput(0), 384, 3, 1, 1, \"model.72\");\n    IElementWiseLayer* conv73 = convBnSilu(network, weightMap, *conv72->getOutput(0), 384, 3, 1, 1, \"model.73\");\n    IElementWiseLayer* conv74 = convBnSilu(network, weightMap, *conv73->getOutput(0), 384, 3, 1, 1, \"model.74\");\n    IElementWiseLayer* conv75 = convBnSilu(network, weightMap, *conv74->getOutput(0), 384, 3, 1, 1, \"model.75\");\n    IElementWiseLayer* conv76 = convBnSilu(network, weightMap, *conv75->getOutput(0), 384, 3, 1, 1, \"model.76\");\n    ITensor* input_tensor_77[] = { conv76->getOutput(0), conv74->getOutput(0),conv72->getOutput(0), conv70->getOutput(0),\n        conv69->getOutput(0) };\n    IConcatenationLayer* concat77 = network->addConcatenation(input_tensor_77, 5);\n    IElementWiseLayer* conv78 = convBnSilu(network, weightMap, *concat77->getOutput(0), 960, 1, 1, 0, \"model.78\");\n\n    IElementWiseLayer* conv79 = convBnSilu(network, weightMap, *conv68->getOutput(0), 384, 1, 1, 0, \"model.79\");\n    IElementWiseLayer* conv80 = convBnSilu(network, weightMap, *conv68->getOutput(0), 384, 1, 1, 0, \"model.80\");\n\n    IElementWiseLayer* conv81 = convBnSilu(network, weightMap, *conv80->getOutput(0), 384, 3, 1, 1, \"model.81\");\n    IElementWiseLayer* conv82 = convBnSilu(network, weightMap, *conv81->getOutput(0), 384, 3, 1, 1, \"model.82\");\n    IElementWiseLayer* conv83 = convBnSilu(network, weightMap, *conv82->getOutput(0), 384, 3, 1, 1, \"model.83\");\n    IElementWiseLayer* conv84 = convBnSilu(network, weightMap, *conv83->getOutput(0), 384, 3, 1, 1, \"model.84\");\n    IElementWiseLayer* conv85 = convBnSilu(network, weightMap, *conv84->getOutput(0), 384, 3, 1, 1, \"model.85\");\n    IElementWiseLayer* conv86 = convBnSilu(network, weightMap, *conv85->getOutput(0), 384, 3, 1, 1, \"model.86\");\n    ITensor* input_tensor_87[] = { conv86->getOutput(0), conv84->getOutput(0),conv82->getOutput(0), conv80->getOutput(0),\n        conv79->getOutput(0) };\n    IConcatenationLayer* concat87 = network->addConcatenation(input_tensor_87, 5);\n    IElementWiseLayer* conv88 = convBnSilu(network, weightMap, *concat87->getOutput(0), 960, 1, 1, 0, \"model.88\");\n    auto conv89 = network->addElementWise(*conv88->getOutput(0), *conv78->getOutput(0), ElementWiseOperation::kSUM);\n\n\n    auto conv90 = DownC(network, weightMap, *conv89->getOutput(0), 960, 1280, \"model.90\");\n\n    IElementWiseLayer* conv91 = convBnSilu(network, weightMap, *conv90->getOutput(0), 512, 1, 1, 0, \"model.91\");\n    IElementWiseLayer* conv92 = convBnSilu(network, weightMap, *conv90->getOutput(0), 512, 1, 1, 0, \"model.92\");\n\n    IElementWiseLayer* conv93 = convBnSilu(network, weightMap, *conv92->getOutput(0), 512, 3, 1, 1, \"model.93\");\n    IElementWiseLayer* conv94 = convBnSilu(network, weightMap, *conv93->getOutput(0), 512, 3, 1, 1, \"model.94\");\n    IElementWiseLayer* conv95 = convBnSilu(network, weightMap, *conv94->getOutput(0), 512, 3, 1, 1, \"model.95\");\n    IElementWiseLayer* conv96 = convBnSilu(network, weightMap, *conv95->getOutput(0), 512, 3, 1, 1, \"model.96\");\n    IElementWiseLayer* conv97 = convBnSilu(network, weightMap, *conv96->getOutput(0), 512, 3, 1, 1, \"model.97\");\n    IElementWiseLayer* conv98 = convBnSilu(network, weightMap, *conv97->getOutput(0), 512, 3, 1, 1, \"model.98\");\n    ITensor* input_tensor_99[] = { conv98->getOutput(0), conv96->getOutput(0),conv94->getOutput(0), conv92->getOutput(0),\n      conv91->getOutput(0) };\n    IConcatenationLayer* concat99 = network->addConcatenation(input_tensor_99, 5);\n    IElementWiseLayer* conv100 = convBnSilu(network, weightMap, *concat99->getOutput(0), 1280, 1, 1, 0, \"model.100\");\n    \n    IElementWiseLayer* conv101 = convBnSilu(network, weightMap, *conv90->getOutput(0), 512, 1, 1, 0, \"model.101\");\n    IElementWiseLayer* conv102 = convBnSilu(network, weightMap, *conv90->getOutput(0), 512, 1, 1, 0, \"model.102\");\n\n    IElementWiseLayer* conv103 = convBnSilu(network, weightMap, *conv102->getOutput(0), 512, 3, 1, 1, \"model.103\");\n    IElementWiseLayer* conv104 = convBnSilu(network, weightMap, *conv103->getOutput(0), 512, 3, 1, 1, \"model.104\");\n    IElementWiseLayer* conv105 = convBnSilu(network, weightMap, *conv104->getOutput(0), 512, 3, 1, 1, \"model.105\");\n    IElementWiseLayer* conv106 = convBnSilu(network, weightMap, *conv105->getOutput(0), 512, 3, 1, 1, \"model.106\");\n    IElementWiseLayer* conv107 = convBnSilu(network, weightMap, *conv106->getOutput(0), 512, 3, 1, 1, \"model.107\");\n    IElementWiseLayer* conv108 = convBnSilu(network, weightMap, *conv107->getOutput(0), 512, 3, 1, 1, \"model.108\");\n    ITensor* input_tensor_109[] = { conv108->getOutput(0), conv106->getOutput(0),conv104->getOutput(0), conv102->getOutput(0),\n      conv101->getOutput(0) };\n    IConcatenationLayer* concat109 = network->addConcatenation(input_tensor_109, 5);\n    IElementWiseLayer* conv110 = convBnSilu(network, weightMap, *concat109->getOutput(0), 1280, 1, 1, 0, \"model.110\");\n    auto conv111 = network->addElementWise(*conv110->getOutput(0), *conv100->getOutput(0), ElementWiseOperation::kSUM);\n    //---------------------------yolov7e6e head---------------------------------\n    auto conv112 = SPPCSPC(network, weightMap, *conv111->getOutput(0), 640, \"model.112\");\n    IElementWiseLayer* conv113 = convBnSilu(network, weightMap, *conv112->getOutput(0), 480, 1, 1, 0, \"model.113\");\n\n\n    float scale[] = { 1.0, 2.0, 2.0 };\n    IResizeLayer* re114 = network->addResize(*conv113->getOutput(0));\n    re114->setResizeMode(ResizeMode::kNEAREST);\n    re114->setScales(scale, 3);\n\n    IElementWiseLayer* conv115 = convBnSilu(network, weightMap, *conv89->getOutput(0), 480, 1, 1, 0, \"model.115\");\n    ITensor* input_tensor_116[] = { conv115->getOutput(0), re114->getOutput(0) };\n    IConcatenationLayer* concat116 = network->addConcatenation(input_tensor_116, 2);\n\n\n    IElementWiseLayer* conv117 = convBnSilu(network, weightMap, *concat116->getOutput(0), 384, 1, 1, 0, \"model.117\");\n    IElementWiseLayer* conv118 = convBnSilu(network, weightMap, *concat116->getOutput(0), 384, 1, 1, 0, \"model.118\");\n\n    IElementWiseLayer* conv119 = convBnSilu(network, weightMap, *conv118->getOutput(0), 192, 3, 1, 1, \"model.119\");\n    IElementWiseLayer* conv120 = convBnSilu(network, weightMap, *conv119->getOutput(0), 192, 3, 1, 1, \"model.120\");\n    IElementWiseLayer* conv121 = convBnSilu(network, weightMap, *conv120->getOutput(0), 192, 3, 1, 1, \"model.121\");\n    IElementWiseLayer* conv122 = convBnSilu(network, weightMap, *conv121->getOutput(0), 192, 3, 1, 1, \"model.122\");\n    IElementWiseLayer* conv123 = convBnSilu(network, weightMap, *conv122->getOutput(0), 192, 3, 1, 1, \"model.123\");\n    IElementWiseLayer* conv124 = convBnSilu(network, weightMap, *conv123->getOutput(0), 192, 3, 1, 1, \"model.124\");\n    ITensor* input_tensor_125[] = { conv124->getOutput(0), conv123->getOutput(0),conv122->getOutput(0), conv121->getOutput(0),\n        conv120->getOutput(0), conv119->getOutput(0), conv118->getOutput(0), conv117->getOutput(0) };\n    IConcatenationLayer* concat125 = network->addConcatenation(input_tensor_125, 8);\n    IElementWiseLayer* conv126 = convBnSilu(network, weightMap, *concat125->getOutput(0), 480, 1, 1, 0, \"model.126\");\n\n    IElementWiseLayer* conv127 = convBnSilu(network, weightMap, *concat116->getOutput(0), 384, 1, 1, 0, \"model.127\");\n    IElementWiseLayer* conv128 = convBnSilu(network, weightMap, *concat116->getOutput(0), 384, 1, 1, 0, \"model.128\");\n\n    IElementWiseLayer* conv129 = convBnSilu(network, weightMap, *conv128->getOutput(0), 192, 3, 1, 1, \"model.129\");\n    IElementWiseLayer* conv130 = convBnSilu(network, weightMap, *conv129->getOutput(0), 192, 3, 1, 1, \"model.130\");\n    IElementWiseLayer* conv131 = convBnSilu(network, weightMap, *conv130->getOutput(0), 192, 3, 1, 1, \"model.131\");\n    IElementWiseLayer* conv132 = convBnSilu(network, weightMap, *conv131->getOutput(0), 192, 3, 1, 1, \"model.132\");\n    IElementWiseLayer* conv133 = convBnSilu(network, weightMap, *conv132->getOutput(0), 192, 3, 1, 1, \"model.133\");\n    IElementWiseLayer* conv134 = convBnSilu(network, weightMap, *conv133->getOutput(0), 192, 3, 1, 1, \"model.134\");\n    ITensor* input_tensor_135[] = { conv134->getOutput(0), conv133->getOutput(0),conv132->getOutput(0), conv131->getOutput(0),\n        conv130->getOutput(0), conv129->getOutput(0), conv128->getOutput(0), conv127->getOutput(0) };\n    IConcatenationLayer* concat135 = network->addConcatenation(input_tensor_135, 8);\n    IElementWiseLayer* conv136 = convBnSilu(network, weightMap, *concat135->getOutput(0), 480, 1, 1, 0, \"model.136\");\n    auto conv137 = network->addElementWise(*conv136->getOutput(0), *conv126->getOutput(0), ElementWiseOperation::kSUM);\n\n    IElementWiseLayer* conv138 = convBnSilu(network, weightMap, *conv137->getOutput(0), 320, 1, 1, 0, \"model.138\");\n    IResizeLayer* re139 = network->addResize(*conv138->getOutput(0));\n    re139->setResizeMode(ResizeMode::kNEAREST);\n    re139->setScales(scale, 3);\n    IElementWiseLayer* conv140 = convBnSilu(network, weightMap, *conv67->getOutput(0), 320, 1, 1, 0, \"model.140\");\n    ITensor* input_tensor_141[] = { conv140->getOutput(0), re139->getOutput(0) };\n    IConcatenationLayer* concat141 = network->addConcatenation(input_tensor_141, 2);\n\n    IElementWiseLayer* conv142 = convBnSilu(network, weightMap, *concat141->getOutput(0), 256, 1, 1, 0, \"model.142\");\n    IElementWiseLayer* conv143 = convBnSilu(network, weightMap, *concat141->getOutput(0), 256, 1, 1, 0, \"model.143\");\n\n    IElementWiseLayer* conv144 = convBnSilu(network, weightMap, *conv143->getOutput(0), 128, 3, 1, 1, \"model.144\");\n    IElementWiseLayer* conv145 = convBnSilu(network, weightMap, *conv144->getOutput(0), 128, 3, 1, 1, \"model.145\");\n    IElementWiseLayer* conv146 = convBnSilu(network, weightMap, *conv145->getOutput(0), 128, 3, 1, 1, \"model.146\");\n    IElementWiseLayer* conv147 = convBnSilu(network, weightMap, *conv146->getOutput(0), 128, 3, 1, 1, \"model.147\");\n    IElementWiseLayer* conv148 = convBnSilu(network, weightMap, *conv147->getOutput(0), 128, 3, 1, 1, \"model.148\");\n    IElementWiseLayer* conv149 = convBnSilu(network, weightMap, *conv148->getOutput(0), 128, 3, 1, 1, \"model.149\");\n\n    ITensor* input_tensor_150[] = { conv149->getOutput(0), conv148->getOutput(0),conv147->getOutput(0), conv146->getOutput(0),\n        conv145->getOutput(0), conv144->getOutput(0), conv143->getOutput(0), conv142->getOutput(0) };\n    IConcatenationLayer* concat150 = network->addConcatenation(input_tensor_150, 8);\n\n    IElementWiseLayer* conv151 = convBnSilu(network, weightMap, *concat150->getOutput(0), 320, 1, 1, 0, \"model.151\");\n\n    IElementWiseLayer* conv152 = convBnSilu(network, weightMap, *concat141->getOutput(0), 256, 1, 1, 0, \"model.152\");\n    IElementWiseLayer* conv153 = convBnSilu(network, weightMap, *concat141->getOutput(0), 256, 1, 1, 0, \"model.153\");\n\n    IElementWiseLayer* conv154 = convBnSilu(network, weightMap, *conv153->getOutput(0), 128, 3, 1, 1, \"model.154\");\n    IElementWiseLayer* conv155 = convBnSilu(network, weightMap, *conv154->getOutput(0), 128, 3, 1, 1, \"model.155\");\n    IElementWiseLayer* conv156 = convBnSilu(network, weightMap, *conv155->getOutput(0), 128, 3, 1, 1, \"model.156\");\n    IElementWiseLayer* conv157 = convBnSilu(network, weightMap, *conv156->getOutput(0), 128, 3, 1, 1, \"model.157\");\n    IElementWiseLayer* conv158 = convBnSilu(network, weightMap, *conv157->getOutput(0), 128, 3, 1, 1, \"model.158\");\n    IElementWiseLayer* conv159 = convBnSilu(network, weightMap, *conv158->getOutput(0), 128, 3, 1, 1, \"model.159\");\n    ITensor* input_tensor_160[] = { conv159->getOutput(0), conv158->getOutput(0),conv157->getOutput(0), conv156->getOutput(0),\n        conv155->getOutput(0), conv154->getOutput(0), conv153->getOutput(0), conv152->getOutput(0) };\n    IConcatenationLayer* concat160 = network->addConcatenation(input_tensor_160, 8);\n    IElementWiseLayer* conv161 = convBnSilu(network, weightMap, *concat160->getOutput(0), 320, 1, 1, 0, \"model.161\");\n    auto conv162 = network->addElementWise(*conv161->getOutput(0), *conv151->getOutput(0), ElementWiseOperation::kSUM);\n\n    IElementWiseLayer* conv163 = convBnSilu(network, weightMap, *conv162->getOutput(0), 160, 1, 1, 0, \"model.163\");\n\n    IResizeLayer* re164 = network->addResize(*conv163->getOutput(0));\n    re164->setResizeMode(ResizeMode::kNEAREST);\n    re164->setScales(scale, 3);\n\n    IElementWiseLayer* conv165 = convBnSilu(network, weightMap, *conv45->getOutput(0), 160, 1, 1, 0, \"model.165\");\n    ITensor* input_tensor_166[] = { conv165->getOutput(0), re164->getOutput(0) };\n    IConcatenationLayer* concat166 = network->addConcatenation(input_tensor_166, 2);\n\n    IElementWiseLayer* conv167 = convBnSilu(network, weightMap, *concat166->getOutput(0), 128, 1, 1, 0, \"model.167\");\n    IElementWiseLayer* conv168 = convBnSilu(network, weightMap, *concat166->getOutput(0), 128, 1, 1, 0, \"model.168\");\n    IElementWiseLayer* conv169 = convBnSilu(network, weightMap, *conv168->getOutput(0), 64, 3, 1, 1, \"model.169\");\n    IElementWiseLayer* conv170 = convBnSilu(network, weightMap, *conv169->getOutput(0), 64, 3, 1, 1, \"model.170\");\n    IElementWiseLayer* conv171 = convBnSilu(network, weightMap, *conv170->getOutput(0), 64, 3, 1, 1, \"model.171\");\n    IElementWiseLayer* conv172 = convBnSilu(network, weightMap, *conv171->getOutput(0), 64, 3, 1, 1, \"model.172\");\n    IElementWiseLayer* conv173 = convBnSilu(network, weightMap, *conv172->getOutput(0), 64, 3, 1, 1, \"model.173\");\n    IElementWiseLayer* conv174 = convBnSilu(network, weightMap, *conv173->getOutput(0), 64, 3, 1, 1, \"model.174\");\n\n\n    ITensor* input_tensor_175[] = { conv174->getOutput(0), conv173->getOutput(0),conv172->getOutput(0), conv171->getOutput(0),\n       conv170->getOutput(0), conv169->getOutput(0), conv168->getOutput(0), conv167->getOutput(0) };\n    IConcatenationLayer* concat175 = network->addConcatenation(input_tensor_175, 8); \n    IElementWiseLayer* conv176 = convBnSilu(network, weightMap, *concat175->getOutput(0), 160, 1, 1, 0, \"model.176\");\n    IElementWiseLayer* conv177 = convBnSilu(network, weightMap, *concat166->getOutput(0), 128, 1, 1, 0, \"model.177\");\n    IElementWiseLayer* conv178 = convBnSilu(network, weightMap, *concat166->getOutput(0), 128, 1, 1, 0, \"model.178\");\n    \n    IElementWiseLayer* conv179 = convBnSilu(network, weightMap, *conv178->getOutput(0), 64, 3, 1, 1, \"model.179\");\n    IElementWiseLayer* conv180 = convBnSilu(network, weightMap, *conv179->getOutput(0), 64, 3, 1, 1, \"model.180\");\n    IElementWiseLayer* conv181 = convBnSilu(network, weightMap, *conv180->getOutput(0), 64, 3, 1, 1, \"model.181\");\n    IElementWiseLayer* conv182 = convBnSilu(network, weightMap, *conv181->getOutput(0), 64, 3, 1, 1, \"model.182\");\n    IElementWiseLayer* conv183 = convBnSilu(network, weightMap, *conv182->getOutput(0), 64, 3, 1, 1, \"model.183\");\n    IElementWiseLayer* conv184 = convBnSilu(network, weightMap, *conv183->getOutput(0), 64, 3, 1, 1, \"model.184\");\n    ITensor* input_tensor_185[] = { conv184->getOutput(0), conv183->getOutput(0),conv182->getOutput(0), conv181->getOutput(0),\n       conv180->getOutput(0), conv179->getOutput(0), conv178->getOutput(0), conv177->getOutput(0) };\n    IConcatenationLayer* concat185 = network->addConcatenation(input_tensor_185, 8);\n    IElementWiseLayer* conv186 = convBnSilu(network, weightMap, *concat185->getOutput(0), 160, 1, 1, 0, \"model.186\");\n    auto conv187 = network->addElementWise(*conv186->getOutput(0), *conv176->getOutput(0), ElementWiseOperation::kSUM);\n\n    auto conv188 = DownC(network, weightMap, *conv187->getOutput(0), 160, 320, \"model.188\");\n\n\n    ITensor* input_tensor_189[] = { conv188->getOutput(0), conv162->getOutput(0) };\n    IConcatenationLayer* concat189 = network->addConcatenation(input_tensor_189, 2);\n\n    IElementWiseLayer* conv190 = convBnSilu(network, weightMap, *concat189->getOutput(0), 256, 1, 1, 0, \"model.190\");\n    IElementWiseLayer* conv191 = convBnSilu(network, weightMap, *concat189->getOutput(0), 256, 1, 1, 0, \"model.191\");\n\n    IElementWiseLayer* conv192 = convBnSilu(network, weightMap, *conv191->getOutput(0), 128, 3, 1, 1, \"model.192\");\n    IElementWiseLayer* conv193 = convBnSilu(network, weightMap, *conv192->getOutput(0), 128, 3, 1, 1, \"model.193\");\n    IElementWiseLayer* conv194 = convBnSilu(network, weightMap, *conv193->getOutput(0), 128, 3, 1, 1, \"model.194\");\n    IElementWiseLayer* conv195 = convBnSilu(network, weightMap, *conv194->getOutput(0), 128, 3, 1, 1, \"model.195\");\n    IElementWiseLayer* conv196 = convBnSilu(network, weightMap, *conv195->getOutput(0), 128, 3, 1, 1, \"model.196\");\n    IElementWiseLayer* conv197 = convBnSilu(network, weightMap, *conv196->getOutput(0), 128, 3, 1, 1, \"model.197\");\n\n\n    ITensor* input_tensor_198[] = { conv197->getOutput(0), conv196->getOutput(0),conv195->getOutput(0), conv194->getOutput(0),\n       conv193->getOutput(0), conv192->getOutput(0), conv191->getOutput(0), conv190->getOutput(0) };\n    IConcatenationLayer* concat198 = network->addConcatenation(input_tensor_198, 8);\n    IElementWiseLayer* conv199 = convBnSilu(network, weightMap, *concat198->getOutput(0), 320, 1, 1, 0, \"model.199\");\n\n    IElementWiseLayer* conv200 = convBnSilu(network, weightMap, *concat189->getOutput(0), 256, 1, 1, 0, \"model.200\");\n    IElementWiseLayer* conv201 = convBnSilu(network, weightMap, *concat189->getOutput(0), 256, 1, 1, 0, \"model.201\");\n\n    IElementWiseLayer* conv202 = convBnSilu(network, weightMap, *conv201->getOutput(0), 128, 3, 1, 1, \"model.202\");\n    IElementWiseLayer* conv203 = convBnSilu(network, weightMap, *conv202->getOutput(0), 128, 3, 1, 1, \"model.203\");\n    IElementWiseLayer* conv204 = convBnSilu(network, weightMap, *conv203->getOutput(0), 128, 3, 1, 1, \"model.204\");\n    IElementWiseLayer* conv205 = convBnSilu(network, weightMap, *conv204->getOutput(0), 128, 3, 1, 1, \"model.205\");\n    IElementWiseLayer* conv206 = convBnSilu(network, weightMap, *conv205->getOutput(0), 128, 3, 1, 1, \"model.206\");\n    IElementWiseLayer* conv207 = convBnSilu(network, weightMap, *conv206->getOutput(0), 128, 3, 1, 1, \"model.207\");\n    ITensor* input_tensor_208[] = { conv207->getOutput(0), conv206->getOutput(0),conv205->getOutput(0), conv204->getOutput(0),\n      conv203->getOutput(0), conv202->getOutput(0), conv201->getOutput(0), conv200->getOutput(0) };\n    IConcatenationLayer* concat208 = network->addConcatenation(input_tensor_208, 8);\n    IElementWiseLayer* conv209 = convBnSilu(network, weightMap, *concat208->getOutput(0), 320, 1, 1, 0, \"model.209\");\n    auto conv210 = network->addElementWise(*conv209->getOutput(0), *conv199->getOutput(0), ElementWiseOperation::kSUM);\n\n\n    auto conv211 = DownC(network, weightMap, *conv210->getOutput(0), 320, 480, \"model.211\");\n    ITensor* input_tensor_212[] = { conv211->getOutput(0), conv137->getOutput(0) };\n    IConcatenationLayer* concat212 = network->addConcatenation(input_tensor_212, 2);\n\n    IElementWiseLayer* conv213 = convBnSilu(network, weightMap, *concat212->getOutput(0), 384, 1, 1, 0, \"model.213\");\n    IElementWiseLayer* conv214 = convBnSilu(network, weightMap, *concat212->getOutput(0), 384, 1, 1, 0, \"model.214\");\n\n    IElementWiseLayer* conv215 = convBnSilu(network, weightMap, *conv214->getOutput(0), 192, 3, 1, 1, \"model.215\");\n    IElementWiseLayer* conv216 = convBnSilu(network, weightMap, *conv215->getOutput(0), 192, 3, 1, 1, \"model.216\");\n    IElementWiseLayer* conv217 = convBnSilu(network, weightMap, *conv216->getOutput(0), 192, 3, 1, 1, \"model.217\");\n    IElementWiseLayer* conv218 = convBnSilu(network, weightMap, *conv217->getOutput(0), 192, 3, 1, 1, \"model.218\");\n    IElementWiseLayer* conv219 = convBnSilu(network, weightMap, *conv218->getOutput(0), 192, 3, 1, 1, \"model.219\");\n    IElementWiseLayer* conv220 = convBnSilu(network, weightMap, *conv219->getOutput(0), 192, 3, 1, 1, \"model.220\");\n\n    ITensor* input_tensor_221[] = { conv220->getOutput(0), conv219->getOutput(0),conv218->getOutput(0), conv217->getOutput(0),\n      conv216->getOutput(0), conv215->getOutput(0), conv214->getOutput(0), conv213->getOutput(0) };\n    IConcatenationLayer* concat221 = network->addConcatenation(input_tensor_221, 8);\n    IElementWiseLayer* conv222 = convBnSilu(network, weightMap, *concat221->getOutput(0), 480, 1, 1, 0, \"model.222\");\n\n    IElementWiseLayer* conv223 = convBnSilu(network, weightMap, *concat212->getOutput(0), 384, 1, 1, 0, \"model.223\");\n    IElementWiseLayer* conv224 = convBnSilu(network, weightMap, *concat212->getOutput(0), 384, 1, 1, 0, \"model.224\");\n\n    IElementWiseLayer* conv225 = convBnSilu(network, weightMap, *conv224->getOutput(0), 192, 3, 1, 1, \"model.225\");\n    IElementWiseLayer* conv226 = convBnSilu(network, weightMap, *conv225->getOutput(0), 192, 3, 1, 1, \"model.226\");\n    IElementWiseLayer* conv227 = convBnSilu(network, weightMap, *conv226->getOutput(0), 192, 3, 1, 1, \"model.227\");\n    IElementWiseLayer* conv228 = convBnSilu(network, weightMap, *conv227->getOutput(0), 192, 3, 1, 1, \"model.228\");\n    IElementWiseLayer* conv229 = convBnSilu(network, weightMap, *conv228->getOutput(0), 192, 3, 1, 1, \"model.229\");\n    IElementWiseLayer* conv230 = convBnSilu(network, weightMap, *conv229->getOutput(0), 192, 3, 1, 1, \"model.230\");\n    ITensor* input_tensor_231[] = { conv230->getOutput(0), conv229->getOutput(0),conv228->getOutput(0), conv227->getOutput(0),\n     conv226->getOutput(0), conv225->getOutput(0), conv224->getOutput(0), conv223->getOutput(0) };\n    IConcatenationLayer* concat231 = network->addConcatenation(input_tensor_231, 8);\n    IElementWiseLayer* conv232 = convBnSilu(network, weightMap, *concat231->getOutput(0), 480, 1, 1, 0, \"model.232\");\n\n    auto conv233 = network->addElementWise(*conv232->getOutput(0), *conv222->getOutput(0), ElementWiseOperation::kSUM);\n\n\n    auto conv234 = DownC(network, weightMap, *conv233->getOutput(0), 480, 640, \"model.234\");\n    ITensor* input_tensor_235[] = { conv234->getOutput(0), conv112->getOutput(0) };\n    IConcatenationLayer* concat235 = network->addConcatenation(input_tensor_235, 2);\n\n\n    IElementWiseLayer* conv236 = convBnSilu(network, weightMap, *concat235->getOutput(0), 512, 1, 1, 0, \"model.236\");\n    IElementWiseLayer* conv237 = convBnSilu(network, weightMap, *concat235->getOutput(0), 512, 1, 1, 0, \"model.237\");\n\n    IElementWiseLayer* conv238 = convBnSilu(network, weightMap, *conv237->getOutput(0), 256, 3, 1, 1, \"model.238\");\n    IElementWiseLayer* conv239 = convBnSilu(network, weightMap, *conv238->getOutput(0), 256, 3, 1, 1, \"model.239\");\n    IElementWiseLayer* conv240 = convBnSilu(network, weightMap, *conv239->getOutput(0), 256, 3, 1, 1, \"model.240\");\n    IElementWiseLayer* conv241 = convBnSilu(network, weightMap, *conv240->getOutput(0), 256, 3, 1, 1, \"model.241\");\n    IElementWiseLayer* conv242 = convBnSilu(network, weightMap, *conv241->getOutput(0), 256, 3, 1, 1, \"model.242\");\n    IElementWiseLayer* conv243 = convBnSilu(network, weightMap, *conv242->getOutput(0), 256, 3, 1, 1, \"model.243\");\n  \n    ITensor* input_tensor_244[] = { conv243->getOutput(0), conv242->getOutput(0),conv241->getOutput(0), conv240->getOutput(0),\n     conv239->getOutput(0), conv238->getOutput(0), conv237->getOutput(0), conv236->getOutput(0) };\n    IConcatenationLayer* concat244 = network->addConcatenation(input_tensor_244, 8);\n    IElementWiseLayer* conv245 = convBnSilu(network, weightMap, *concat244->getOutput(0), 640, 1, 1, 0, \"model.245\");\n\n    IElementWiseLayer* conv246 = convBnSilu(network, weightMap, *concat235->getOutput(0), 512, 1, 1, 0, \"model.246\");\n    IElementWiseLayer* conv247 = convBnSilu(network, weightMap, *concat235->getOutput(0), 512, 1, 1, 0, \"model.247\");\n\n    IElementWiseLayer* conv248 = convBnSilu(network, weightMap, *conv247->getOutput(0), 256, 3, 1, 1, \"model.248\");\n    IElementWiseLayer* conv249 = convBnSilu(network, weightMap, *conv248->getOutput(0), 256, 3, 1, 1, \"model.249\");\n    IElementWiseLayer* conv250 = convBnSilu(network, weightMap, *conv249->getOutput(0), 256, 3, 1, 1, \"model.250\");\n    IElementWiseLayer* conv251 = convBnSilu(network, weightMap, *conv250->getOutput(0), 256, 3, 1, 1, \"model.251\");\n    IElementWiseLayer* conv252 = convBnSilu(network, weightMap, *conv251->getOutput(0), 256, 3, 1, 1, \"model.252\");\n    IElementWiseLayer* conv253 = convBnSilu(network, weightMap, *conv252->getOutput(0), 256, 3, 1, 1, \"model.253\");\n\n    ITensor* input_tensor_254[] = { conv253->getOutput(0), conv252->getOutput(0),conv251->getOutput(0), conv250->getOutput(0),\n    conv249->getOutput(0), conv248->getOutput(0), conv247->getOutput(0), conv246->getOutput(0) };\n    IConcatenationLayer* concat254 = network->addConcatenation(input_tensor_254, 8);\n\n    IElementWiseLayer* conv255= convBnSilu(network, weightMap, *concat254->getOutput(0), 640, 1, 1, 0, \"model.255\");\n    auto conv256 = network->addElementWise(*conv255->getOutput(0), *conv245->getOutput(0), ElementWiseOperation::kSUM);\n\n    IElementWiseLayer* conv257 = convBnSilu(network, weightMap, *conv187->getOutput(0), 320, 3, 1, 1, \"model.257\");\n    IElementWiseLayer* conv258 = convBnSilu(network, weightMap, *conv210->getOutput(0), 640, 3, 1, 1, \"model.258\");\n    IElementWiseLayer* conv259 = convBnSilu(network, weightMap, *conv233->getOutput(0), 960, 3, 1, 1, \"model.259\");\n    IElementWiseLayer* conv260 = convBnSilu(network, weightMap, *conv256->getOutput(0), 1280, 3, 1, 1, \"model.260\");\n\n\n\n    // out\n    IConvolutionLayer* cv105_0 = network->addConvolutionNd(*conv257->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.261.m.0.weight\"], weightMap[\"model.261.m.0.bias\"]);\n    assert(cv105_0);\n    cv105_0->setName(\"cv105.0\");\n    IConvolutionLayer* cv105_1 = network->addConvolutionNd(*conv258->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.261.m.1.weight\"], weightMap[\"model.261.m.1.bias\"]);\n    assert(cv105_1);\n    cv105_1->setName(\"cv105.1\");\n    IConvolutionLayer* cv105_2 = network->addConvolutionNd(*conv259->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.261.m.2.weight\"], weightMap[\"model.261.m.2.bias\"]);\n    assert(cv105_2);\n    cv105_2->setName(\"cv105.2\");\n    IConvolutionLayer* cv105_3 = network->addConvolutionNd(*conv260->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.261.m.3.weight\"], weightMap[\"model.261.m.3.bias\"]);\n    assert(cv105_3);\n    cv105_3->setName(\"cv105.3\");\n\n    /*------------detect-----------*/\n    auto yolo = addYoLoLayer(network, weightMap, \"model.261\", std::vector<IConvolutionLayer*>{cv105_0, cv105_1, cv105_2, cv105_3});\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#if defined(USE_FP16)\n    config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(BuilderFlag::kINT8);\n    Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return serialized_model;\n}\n\nIHostMemory* build_engine_yolov7d6(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, const std::string& wts_path) {\n    std::map<std::string, Weights> weightMap = loadWeights(wts_path);\n\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{ 3, kInputH, kInputW });\n    assert(data);\n\n    /*----------------------------------yolov7d6 backbone-----------------------------------------*/\n    auto* conv0 = ReOrg(network, weightMap, *data, 3);\n\n\n    IElementWiseLayer* conv1 = convBnSilu(network, weightMap, *conv0->getOutput(0), 96, 3, 1, 1, \"model.1\");\n    auto conv2 = DownC(network, weightMap, *conv1->getOutput(0), 96, 192, \"model.2\");\n\n    IElementWiseLayer* conv3 = convBnSilu(network, weightMap, *conv2->getOutput(0), 64, 1, 1, 0, \"model.3\");\n    IElementWiseLayer* conv4 = convBnSilu(network, weightMap, *conv2->getOutput(0), 64, 1, 1, 0, \"model.4\");\n\n    IElementWiseLayer* conv5 = convBnSilu(network, weightMap, *conv4->getOutput(0), 64, 3, 1, 1, \"model.5\");\n    IElementWiseLayer* conv6 = convBnSilu(network, weightMap, *conv5->getOutput(0), 64, 3, 1, 1, \"model.6\");\n    IElementWiseLayer* conv7 = convBnSilu(network, weightMap, *conv6->getOutput(0), 64, 3, 1, 1, \"model.7\");\n    IElementWiseLayer* conv8 = convBnSilu(network, weightMap, *conv7->getOutput(0), 64, 3, 1, 1, \"model.8\");\n    IElementWiseLayer* conv9 = convBnSilu(network, weightMap, *conv8->getOutput(0), 64, 3, 1, 1, \"model.9\");\n    IElementWiseLayer* conv10 = convBnSilu(network, weightMap, *conv9->getOutput(0), 64, 3, 1, 1, \"model.10\");\n    IElementWiseLayer* conv11 = convBnSilu(network, weightMap, *conv10->getOutput(0), 64, 3, 1, 1, \"model.11\");\n    IElementWiseLayer* conv12 = convBnSilu(network, weightMap, *conv11->getOutput(0), 64, 3, 1, 1, \"model.12\");\n\n    ITensor* input_tensor_13[] = { conv12->getOutput(0), conv10->getOutput(0),conv8->getOutput(0), conv6->getOutput(0),\n        conv4->getOutput(0),conv3->getOutput(0) };\n    IConcatenationLayer* concat13 = network->addConcatenation(input_tensor_13, 6);\n\n    IElementWiseLayer* conv14 = convBnSilu(network, weightMap, *concat13->getOutput(0), 192, 1, 1, 0, \"model.14\");\n\n\n    auto conv15 = DownC(network, weightMap, *conv14->getOutput(0), 192, 384, \"model.15\");\n    IElementWiseLayer* conv16 = convBnSilu(network, weightMap, *conv15->getOutput(0), 128, 1, 1, 0, \"model.16\");\n    IElementWiseLayer* conv17 = convBnSilu(network, weightMap, *conv15->getOutput(0), 128, 1, 1, 0, \"model.17\");\n\n    IElementWiseLayer* conv18 = convBnSilu(network, weightMap, *conv17->getOutput(0), 128, 3, 1, 1, \"model.18\");\n    IElementWiseLayer* conv19 = convBnSilu(network, weightMap, *conv18->getOutput(0), 128, 3, 1, 1, \"model.19\");\n    IElementWiseLayer* conv20 = convBnSilu(network, weightMap, *conv19->getOutput(0), 128, 3, 1, 1, \"model.20\");\n    IElementWiseLayer* conv21 = convBnSilu(network, weightMap, *conv20->getOutput(0), 128, 3, 1, 1, \"model.21\");\n    IElementWiseLayer* conv22 = convBnSilu(network, weightMap, *conv21->getOutput(0), 128, 3, 1, 1, \"model.22\");\n    IElementWiseLayer* conv23 = convBnSilu(network, weightMap, *conv22->getOutput(0), 128, 3, 1, 1, \"model.23\");\n    IElementWiseLayer* conv24 = convBnSilu(network, weightMap, *conv23->getOutput(0), 128, 3, 1, 1, \"model.24\");\n    IElementWiseLayer* conv25 = convBnSilu(network, weightMap, *conv24->getOutput(0), 128, 3, 1, 1, \"model.25\");\n    ITensor* input_tensor_26[] = { conv25->getOutput(0), conv23->getOutput(0),conv21->getOutput(0), conv19->getOutput(0),\n        conv17->getOutput(0),conv16->getOutput(0) };\n    IConcatenationLayer* concat26 = network->addConcatenation(input_tensor_26, 6);\n\n    IElementWiseLayer* conv27 = convBnSilu(network, weightMap, *concat26->getOutput(0), 384, 1, 1, 0, \"model.27\");\n\n\n    auto conv28 = DownC(network, weightMap, *conv27->getOutput(0), 384, 768, \"model.28\");\n    IElementWiseLayer* conv29 = convBnSilu(network, weightMap, *conv28->getOutput(0), 256, 1, 1, 0, \"model.29\");\n    IElementWiseLayer* conv30 = convBnSilu(network, weightMap, *conv28->getOutput(0), 256, 1, 1, 0, \"model.30\");\n\n    IElementWiseLayer* conv31 = convBnSilu(network, weightMap, *conv30->getOutput(0), 256, 3, 1, 1, \"model.31\");\n    IElementWiseLayer* conv32 = convBnSilu(network, weightMap, *conv31->getOutput(0), 256, 3, 1, 1, \"model.32\");\n    IElementWiseLayer* conv33 = convBnSilu(network, weightMap, *conv32->getOutput(0), 256, 3, 1, 1, \"model.33\");\n    IElementWiseLayer* conv34 = convBnSilu(network, weightMap, *conv33->getOutput(0), 256, 3, 1, 1, \"model.34\");\n    IElementWiseLayer* conv35 = convBnSilu(network, weightMap, *conv34->getOutput(0), 256, 3, 1, 1, \"model.35\");\n    IElementWiseLayer* conv36 = convBnSilu(network, weightMap, *conv35->getOutput(0), 256, 3, 1, 1, \"model.36\");\n    IElementWiseLayer* conv37 = convBnSilu(network, weightMap, *conv36->getOutput(0), 256, 3, 1, 1, \"model.37\");\n    IElementWiseLayer* conv38 = convBnSilu(network, weightMap, *conv37->getOutput(0), 256, 3, 1, 1, \"model.38\");\n    ITensor* input_tensor_39[] = { conv38->getOutput(0), conv36->getOutput(0),conv34->getOutput(0), conv32->getOutput(0),\n        conv30->getOutput(0), conv29 ->getOutput(0)};\n    IConcatenationLayer* concat39 = network->addConcatenation(input_tensor_39, 6);\n\n    IElementWiseLayer* conv40 = convBnSilu(network, weightMap, *concat39->getOutput(0), 768, 1, 1, 0, \"model.40\");\n    auto conv41 = DownC(network, weightMap, *conv40->getOutput(0), 768, 1152, \"model.41\");\n    IElementWiseLayer* conv42 = convBnSilu(network, weightMap, *conv41->getOutput(0), 384, 1, 1, 0, \"model.42\");\n    IElementWiseLayer* conv43 = convBnSilu(network, weightMap, *conv41->getOutput(0), 384, 1, 1, 0, \"model.43\");\n\n    IElementWiseLayer* conv44 = convBnSilu(network, weightMap, *conv43->getOutput(0), 384, 3, 1, 1, \"model.44\");\n    IElementWiseLayer* conv45 = convBnSilu(network, weightMap, *conv44->getOutput(0), 384, 3, 1, 1, \"model.45\");\n    IElementWiseLayer* conv46 = convBnSilu(network, weightMap, *conv45->getOutput(0), 384, 3, 1, 1, \"model.46\");\n    IElementWiseLayer* conv47 = convBnSilu(network, weightMap, *conv46->getOutput(0), 384, 3, 1, 1, \"model.47\");\n    IElementWiseLayer* conv48 = convBnSilu(network, weightMap, *conv47->getOutput(0), 384, 3, 1, 1, \"model.48\");\n    IElementWiseLayer* conv49 = convBnSilu(network, weightMap, *conv48->getOutput(0), 384, 3, 1, 1, \"model.49\");\n    IElementWiseLayer* conv50 = convBnSilu(network, weightMap, *conv49->getOutput(0), 384, 3, 1, 1, \"model.50\");\n    IElementWiseLayer* conv51 = convBnSilu(network, weightMap, *conv50->getOutput(0), 384, 3, 1, 1, \"model.51\");\n\n    ITensor* input_tensor_52[] = { conv51->getOutput(0), conv49->getOutput(0),conv47->getOutput(0), conv45->getOutput(0),\n        conv43->getOutput(0),conv42->getOutput(0) };\n    IConcatenationLayer* concat52 = network->addConcatenation(input_tensor_52, 6);\n    IElementWiseLayer* conv53 = convBnSilu(network, weightMap, *concat52->getOutput(0), 1152, 1, 1, 0, \"model.53\");\n\n    auto conv54 = DownC(network, weightMap, *conv53->getOutput(0), 1152, 1536, \"model.54\");//=====\n    IElementWiseLayer* conv55 = convBnSilu(network, weightMap, *conv54->getOutput(0), 512, 1, 1, 0, \"model.55\");\n    IElementWiseLayer* conv56 = convBnSilu(network, weightMap, *conv54->getOutput(0), 512, 1, 1, 0, \"model.56\");\n\n    IElementWiseLayer* conv57 = convBnSilu(network, weightMap, *conv56->getOutput(0), 512, 3, 1, 1, \"model.57\");\n    IElementWiseLayer* conv58 = convBnSilu(network, weightMap, *conv57->getOutput(0), 512, 3, 1, 1, \"model.58\");\n    IElementWiseLayer* conv59 = convBnSilu(network, weightMap, *conv58->getOutput(0), 512, 3, 1, 1, \"model.59\");\n    IElementWiseLayer* conv60 = convBnSilu(network, weightMap, *conv59->getOutput(0), 512, 3, 1, 1, \"model.60\");\n    IElementWiseLayer* conv61 = convBnSilu(network, weightMap, *conv60->getOutput(0), 512, 3, 1, 1, \"model.61\");\n    IElementWiseLayer* conv62 = convBnSilu(network, weightMap, *conv61->getOutput(0), 512, 3, 1, 1, \"model.62\");\n    IElementWiseLayer* conv63 = convBnSilu(network, weightMap, *conv62->getOutput(0), 512, 3, 1, 1, \"model.63\");\n    IElementWiseLayer* conv64 = convBnSilu(network, weightMap, *conv63->getOutput(0), 512, 3, 1, 1, \"model.64\");\n    ITensor* input_tensor_65[] = { conv64->getOutput(0), conv62->getOutput(0),conv60->getOutput(0), conv58->getOutput(0),\n        conv56->getOutput(0),conv55->getOutput(0) };\n    IConcatenationLayer* concat65 = network->addConcatenation(input_tensor_65, 6);\n    IElementWiseLayer* conv66 = convBnSilu(network, weightMap, *concat65->getOutput(0), 1536, 1, 1, 0, \"model.66\");\n\n    //------------------------yolov7e6 head-------------------------------\n    auto conv67 = SPPCSPC(network, weightMap, *conv66->getOutput(0), 768, \"model.67\");\n    IElementWiseLayer* conv68 = convBnSilu(network, weightMap, *conv67->getOutput(0), 576, 1, 1, 0, \"model.68\");\n\n\n    float scale[] = { 1.0, 2.0, 2.0 };\n    IResizeLayer* re69 = network->addResize(*conv68->getOutput(0));\n    re69->setResizeMode(ResizeMode::kNEAREST);\n    re69->setScales(scale, 3);\n\n    IElementWiseLayer* conv70 = convBnSilu(network, weightMap, *conv53->getOutput(0), 576, 1, 1, 0, \"model.70\");\n    ITensor* input_tensor_71[] = { conv70->getOutput(0), re69->getOutput(0) };\n    IConcatenationLayer* concat71 = network->addConcatenation(input_tensor_71, 2);\n    IElementWiseLayer* conv72 = convBnSilu(network, weightMap, *concat71->getOutput(0), 384, 1, 1, 0, \"model.72\");\n    IElementWiseLayer* conv73 = convBnSilu(network, weightMap, *concat71->getOutput(0), 384, 1, 1, 0, \"model.73\");\n\n    IElementWiseLayer* conv74 = convBnSilu(network, weightMap, *conv73->getOutput(0), 192, 3, 1, 1, \"model.74\");\n    IElementWiseLayer* conv75 = convBnSilu(network, weightMap, *conv74->getOutput(0), 192, 3, 1, 1, \"model.75\");\n    IElementWiseLayer* conv76 = convBnSilu(network, weightMap, *conv75->getOutput(0), 192, 3, 1, 1, \"model.76\");\n    IElementWiseLayer* conv77 = convBnSilu(network, weightMap, *conv76->getOutput(0), 192, 3, 1, 1, \"model.77\");\n    IElementWiseLayer* conv78 = convBnSilu(network, weightMap, *conv77->getOutput(0), 192, 3, 1, 1, \"model.78\");\n    IElementWiseLayer* conv79 = convBnSilu(network, weightMap, *conv78->getOutput(0), 192, 3, 1, 1, \"model.79\");\n    IElementWiseLayer* conv80 = convBnSilu(network, weightMap, *conv79->getOutput(0), 192, 3, 1, 1, \"model.80\");\n    IElementWiseLayer* conv81 = convBnSilu(network, weightMap, *conv80->getOutput(0), 192, 3, 1, 1, \"model.81\");\n\n    ITensor* input_tensor_82[] = { conv81->getOutput(0), conv80->getOutput(0),conv79->getOutput(0), conv78->getOutput(0),\n        conv77->getOutput(0), conv76->getOutput(0), conv75->getOutput(0), conv74->getOutput(0), conv73->getOutput(0),\n        conv72->getOutput(0) };\n    IConcatenationLayer* concat82 = network->addConcatenation(input_tensor_82, 10);\n    IElementWiseLayer* conv83 = convBnSilu(network, weightMap, *concat82->getOutput(0), 576, 1, 1, 0, \"model.83\");\n\n    IElementWiseLayer* conv84 = convBnSilu(network, weightMap, *conv83->getOutput(0), 384, 1, 1, 0, \"model.84\");\n    IResizeLayer* re85 = network->addResize(*conv84->getOutput(0));\n    re85->setResizeMode(ResizeMode::kNEAREST);\n    re85->setScales(scale, 3);\n    IElementWiseLayer* conv86 = convBnSilu(network, weightMap, *conv40->getOutput(0), 384, 1, 1, 0, \"model.86\");\n    ITensor* input_tensor_87[] = { conv86->getOutput(0), re85->getOutput(0) };\n    IConcatenationLayer* concat87 = network->addConcatenation(input_tensor_87, 2);\n\n    IElementWiseLayer* conv88 = convBnSilu(network, weightMap, *concat87->getOutput(0), 256, 1, 1, 0, \"model.88\");\n    IElementWiseLayer* conv89 = convBnSilu(network, weightMap, *concat87->getOutput(0), 256, 1, 1, 0, \"model.89\");\n\n    IElementWiseLayer* conv90 = convBnSilu(network, weightMap, *conv89->getOutput(0), 128, 3, 1, 1, \"model.90\");\n    IElementWiseLayer* conv91 = convBnSilu(network, weightMap, *conv90->getOutput(0), 128, 3, 1, 1, \"model.91\");\n    IElementWiseLayer* conv92 = convBnSilu(network, weightMap, *conv91->getOutput(0), 128, 3, 1, 1, \"model.92\");\n    IElementWiseLayer* conv93 = convBnSilu(network, weightMap, *conv92->getOutput(0), 128, 3, 1, 1, \"model.93\");\n    IElementWiseLayer* conv94 = convBnSilu(network, weightMap, *conv93->getOutput(0), 128, 3, 1, 1, \"model.94\");\n    IElementWiseLayer* conv95 = convBnSilu(network, weightMap, *conv94->getOutput(0), 128, 3, 1, 1, \"model.95\");\n    IElementWiseLayer* conv96 = convBnSilu(network, weightMap, *conv95->getOutput(0), 128, 3, 1, 1, \"model.96\");\n    IElementWiseLayer* conv97 = convBnSilu(network, weightMap, *conv96->getOutput(0), 128, 3, 1, 1, \"model.97\");\n\n    ITensor* input_tensor_98[] = { conv97->getOutput(0), conv96->getOutput(0),conv95->getOutput(0), conv94->getOutput(0),\n        conv93->getOutput(0), conv92->getOutput(0), conv91->getOutput(0), conv90->getOutput(0),conv89->getOutput(0), \n        conv88->getOutput(0) };\n    IConcatenationLayer* concat98 = network->addConcatenation(input_tensor_98, 10);\n\n    IElementWiseLayer* conv99 = convBnSilu(network, weightMap, *concat98->getOutput(0), 384, 1, 1, 0, \"model.99\");\n\n    IElementWiseLayer* conv100 = convBnSilu(network, weightMap, *conv99->getOutput(0), 192, 1, 1, 0, \"model.100\");\n    IResizeLayer* re101 = network->addResize(*conv100->getOutput(0));\n    re101->setResizeMode(ResizeMode::kNEAREST);\n    re101->setScales(scale, 3);\n    IElementWiseLayer* conv102 = convBnSilu(network, weightMap, *conv27->getOutput(0), 192, 1, 1, 0, \"model.102\");\n    ITensor* input_tensor_103[] = { conv102->getOutput(0), re101->getOutput(0) };\n    IConcatenationLayer* concat103 = network->addConcatenation(input_tensor_103, 2);\n\n    IElementWiseLayer* conv104 = convBnSilu(network, weightMap, *concat103->getOutput(0), 128, 1, 1, 0, \"model.104\");\n    IElementWiseLayer* conv105 = convBnSilu(network, weightMap, *concat103->getOutput(0), 128, 1, 1, 0, \"model.105\");\n    IElementWiseLayer* conv106 = convBnSilu(network, weightMap, *conv105->getOutput(0), 64, 3, 1, 1, \"model.106\");\n    IElementWiseLayer* conv107 = convBnSilu(network, weightMap, *conv106->getOutput(0), 64, 3, 1, 1, \"model.107\");\n    IElementWiseLayer* conv108 = convBnSilu(network, weightMap, *conv107->getOutput(0), 64, 3, 1, 1, \"model.108\");\n    IElementWiseLayer* conv109 = convBnSilu(network, weightMap, *conv108->getOutput(0), 64, 3, 1, 1, \"model.109\");\n    IElementWiseLayer* conv110 = convBnSilu(network, weightMap, *conv109->getOutput(0), 64, 3, 1, 1, \"model.110\");\n    IElementWiseLayer* conv111 = convBnSilu(network, weightMap, *conv110->getOutput(0), 64, 3, 1, 1, \"model.111\");\n    IElementWiseLayer* conv112 = convBnSilu(network, weightMap, *conv111->getOutput(0), 64, 3, 1, 1, \"model.112\");\n    IElementWiseLayer* conv113 = convBnSilu(network, weightMap, *conv112->getOutput(0), 64, 3, 1, 1, \"model.113\");\n\n    ITensor* input_tensor_114[] = { conv113->getOutput(0), conv112->getOutput(0),conv111->getOutput(0), conv110->getOutput(0),\n       conv109->getOutput(0), conv108->getOutput(0), conv107->getOutput(0), conv106->getOutput(0), conv105->getOutput(0),\n        conv104->getOutput(0) };\n    IConcatenationLayer* concat114 = network->addConcatenation(input_tensor_114, 10);\n\n    IElementWiseLayer* conv115 = convBnSilu(network, weightMap, *concat114->getOutput(0), 192, 1, 1, 0, \"model.115\");\n\n    auto conv116 = DownC(network, weightMap, *conv115->getOutput(0), 192, 384, \"model.116\");\n    ITensor* input_tensor_117[] = { conv116->getOutput(0), conv99->getOutput(0) };\n    IConcatenationLayer* concat117 = network->addConcatenation(input_tensor_117, 2);\n\n    IElementWiseLayer* conv118 = convBnSilu(network, weightMap, *concat117->getOutput(0), 256, 1, 1, 0, \"model.118\");\n    IElementWiseLayer* conv119 = convBnSilu(network, weightMap, *concat117->getOutput(0), 256, 1, 1, 0, \"model.119\");\n\n    IElementWiseLayer* conv120 = convBnSilu(network, weightMap, *conv119->getOutput(0), 128, 3, 1, 1, \"model.120\");\n    IElementWiseLayer* conv121 = convBnSilu(network, weightMap, *conv120->getOutput(0), 128, 3, 1, 1, \"model.121\");\n    IElementWiseLayer* conv122 = convBnSilu(network, weightMap, *conv121->getOutput(0), 128, 3, 1, 1, \"model.122\");\n    IElementWiseLayer* conv123 = convBnSilu(network, weightMap, *conv122->getOutput(0), 128, 3, 1, 1, \"model.123\");\n    IElementWiseLayer* conv124 = convBnSilu(network, weightMap, *conv123->getOutput(0), 128, 3, 1, 1, \"model.124\");\n    IElementWiseLayer* conv125 = convBnSilu(network, weightMap, *conv124->getOutput(0), 128, 3, 1, 1, \"model.125\");\n    IElementWiseLayer* conv126 = convBnSilu(network, weightMap, *conv125->getOutput(0), 128, 3, 1, 1, \"model.126\");\n    IElementWiseLayer* conv127 = convBnSilu(network, weightMap, *conv126->getOutput(0), 128, 3, 1, 1, \"model.127\");\n\n    ITensor* input_tensor_128[] = { conv127->getOutput(0), conv126->getOutput(0),conv125->getOutput(0), conv124->getOutput(0),\n       conv123->getOutput(0), conv122->getOutput(0), conv121->getOutput(0), conv120->getOutput(0), conv119->getOutput(0),\n       conv118->getOutput(0) };\n    IConcatenationLayer* concat128 = network->addConcatenation(input_tensor_128, 10);\n    IElementWiseLayer* conv129 = convBnSilu(network, weightMap, *concat128->getOutput(0), 384, 1, 1, 0, \"model.129\");\n\n    auto conv130 = DownC(network, weightMap, *conv129->getOutput(0), 384, 576, \"model.130\");\n    ITensor* input_tensor_131[] = { conv130->getOutput(0), conv83->getOutput(0) };\n    IConcatenationLayer* concat131 = network->addConcatenation(input_tensor_131, 2);\n\n    IElementWiseLayer* conv132 = convBnSilu(network, weightMap, *concat131->getOutput(0), 384, 1, 1, 0, \"model.132\");\n    IElementWiseLayer* conv133 = convBnSilu(network, weightMap, *concat131->getOutput(0), 384, 1, 1, 0, \"model.133\");\n\n    IElementWiseLayer* conv134 = convBnSilu(network, weightMap, *conv133->getOutput(0), 192, 3, 1, 1, \"model.134\");\n    IElementWiseLayer* conv135 = convBnSilu(network, weightMap, *conv134->getOutput(0), 192, 3, 1, 1, \"model.135\");\n    IElementWiseLayer* conv136 = convBnSilu(network, weightMap, *conv135->getOutput(0), 192, 3, 1, 1, \"model.136\");\n    IElementWiseLayer* conv137 = convBnSilu(network, weightMap, *conv136->getOutput(0), 192, 3, 1, 1, \"model.137\");\n    IElementWiseLayer* conv138 = convBnSilu(network, weightMap, *conv137->getOutput(0), 192, 3, 1, 1, \"model.138\");\n    IElementWiseLayer* conv139 = convBnSilu(network, weightMap, *conv138->getOutput(0), 192, 3, 1, 1, \"model.139\");\n    IElementWiseLayer* conv140 = convBnSilu(network, weightMap, *conv139->getOutput(0), 192, 3, 1, 1, \"model.140\");\n    IElementWiseLayer* conv141 = convBnSilu(network, weightMap, *conv140->getOutput(0), 192, 3, 1, 1, \"model.141\");\n    ITensor* input_tensor_142[] = { conv141->getOutput(0), conv140->getOutput(0),conv139->getOutput(0), conv138->getOutput(0),\n      conv137->getOutput(0), conv136->getOutput(0), conv135->getOutput(0), conv134->getOutput(0), conv133->getOutput(0), \n        conv132->getOutput(0) };\n    IConcatenationLayer* concat142 = network->addConcatenation(input_tensor_142, 10);\n    IElementWiseLayer* conv143 = convBnSilu(network, weightMap, *concat142->getOutput(0), 576, 1, 1, 0, \"model.143\");\n\n    auto conv144 = DownC(network, weightMap, *conv143->getOutput(0), 576, 768, \"model.144\");\n    ITensor* input_tensor_145[] = { conv144->getOutput(0), conv67->getOutput(0) };\n    IConcatenationLayer* concat145 = network->addConcatenation(input_tensor_145, 2);\n\n    IElementWiseLayer* conv146 = convBnSilu(network, weightMap, *concat145->getOutput(0), 512, 1, 1, 0, \"model.146\");\n    IElementWiseLayer* conv147 = convBnSilu(network, weightMap, *concat145->getOutput(0), 512, 1, 1, 0, \"model.147\");\n\n    IElementWiseLayer* conv148 = convBnSilu(network, weightMap, *conv147->getOutput(0), 256, 3, 1, 1, \"model.148\");\n    IElementWiseLayer* conv149 = convBnSilu(network, weightMap, *conv148->getOutput(0), 256, 3, 1, 1, \"model.149\");\n    IElementWiseLayer* conv150 = convBnSilu(network, weightMap, *conv149->getOutput(0), 256, 3, 1, 1, \"model.150\");\n    IElementWiseLayer* conv151 = convBnSilu(network, weightMap, *conv150->getOutput(0), 256, 3, 1, 1, \"model.151\");\n    IElementWiseLayer* conv152 = convBnSilu(network, weightMap, *conv151->getOutput(0), 256, 3, 1, 1, \"model.152\");\n    IElementWiseLayer* conv153 = convBnSilu(network, weightMap, *conv152->getOutput(0), 256, 3, 1, 1, \"model.153\");\n    IElementWiseLayer* conv154 = convBnSilu(network, weightMap, *conv153->getOutput(0), 256, 3, 1, 1, \"model.154\");\n    IElementWiseLayer* conv155 = convBnSilu(network, weightMap, *conv154->getOutput(0), 256, 3, 1, 1, \"model.155\");\n    ITensor* input_tensor_156[] = { conv155->getOutput(0), conv154->getOutput(0),conv153->getOutput(0), conv152->getOutput(0),\n     conv151->getOutput(0), conv150->getOutput(0), conv149->getOutput(0), conv148->getOutput(0),conv147->getOutput(0),\n        conv146->getOutput(0) };\n    IConcatenationLayer* concat156 = network->addConcatenation(input_tensor_156, 10);\n    IElementWiseLayer* conv157 = convBnSilu(network, weightMap, *concat156->getOutput(0), 768, 1, 1, 0, \"model.157\");\n\n    IElementWiseLayer* conv158= convBnSilu(network, weightMap, *conv115->getOutput(0), 384, 3, 1, 1, \"model.158\");\n    IElementWiseLayer* conv159 = convBnSilu(network, weightMap, *conv129->getOutput(0), 768, 3, 1, 1, \"model.159\");\n    IElementWiseLayer* conv160 = convBnSilu(network, weightMap, *conv143->getOutput(0), 1152, 3, 1, 1, \"model.160\");\n    IElementWiseLayer* conv161 = convBnSilu(network, weightMap, *conv157->getOutput(0), 1536, 3, 1, 1, \"model.161\");\n\n\n\n    // out\n    IConvolutionLayer* cv105_0 = network->addConvolutionNd(*conv158->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.162.m.0.weight\"], weightMap[\"model.162.m.0.bias\"]);\n    assert(cv105_0);\n    cv105_0->setName(\"cv105.0\");\n    IConvolutionLayer* cv105_1 = network->addConvolutionNd(*conv159->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.162.m.1.weight\"], weightMap[\"model.162.m.1.bias\"]);\n    assert(cv105_1);\n    cv105_1->setName(\"cv105.1\");\n    IConvolutionLayer* cv105_2 = network->addConvolutionNd(*conv160->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.162.m.2.weight\"], weightMap[\"model.162.m.2.bias\"]);\n    assert(cv105_2);\n    cv105_2->setName(\"cv105.2\");\n    IConvolutionLayer* cv105_3 = network->addConvolutionNd(*conv161->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.162.m.3.weight\"], weightMap[\"model.162.m.3.bias\"]);\n    assert(cv105_3);\n    cv105_3->setName(\"cv105.3\");\n\n    /*------------detect-----------*/\n    auto yolo = addYoLoLayer(network, weightMap, \"model.162\", std::vector<IConvolutionLayer*>{cv105_0, cv105_1, cv105_2, cv105_3});\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#if defined(USE_FP16)\n    config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(BuilderFlag::kINT8);\n    Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return serialized_model;\n}\n\nIHostMemory* build_engine_yolov7e6(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, const std::string& wts_path) {\n    std::map<std::string, Weights> weightMap = loadWeights(wts_path);\n\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{ 3, kInputH, kInputW });\n    assert(data);\n\n    /*----------------------------------yolov7e6 backbone-----------------------------------------*/\n    auto* conv0 = ReOrg(network, weightMap, *data, 3);\n\n\n    IElementWiseLayer* conv1 = convBnSilu(network, weightMap, *conv0->getOutput(0), 80, 3, 1, 1, \"model.1\");\n    auto conv2 = DownC(network, weightMap, *conv1->getOutput(0), 80, 160, \"model.2\");\n\n    IElementWiseLayer* conv3 = convBnSilu(network, weightMap, *conv2->getOutput(0), 64, 1, 1, 0, \"model.3\");\n    IElementWiseLayer* conv4 = convBnSilu(network, weightMap, *conv2->getOutput(0), 64, 1, 1, 0, \"model.4\");\n\n    IElementWiseLayer* conv5 = convBnSilu(network, weightMap, *conv4->getOutput(0), 64, 3, 1, 1, \"model.5\");\n    IElementWiseLayer* conv6 = convBnSilu(network, weightMap, *conv5->getOutput(0), 64, 3, 1, 1, \"model.6\");\n    IElementWiseLayer* conv7 = convBnSilu(network, weightMap, *conv6->getOutput(0), 64, 3, 1, 1, \"model.7\");\n    IElementWiseLayer* conv8 = convBnSilu(network, weightMap, *conv7->getOutput(0), 64, 3, 1, 1, \"model.8\");\n    IElementWiseLayer* conv9 = convBnSilu(network, weightMap, *conv8->getOutput(0), 64, 3, 1, 1, \"model.9\");\n    IElementWiseLayer* conv10 = convBnSilu(network, weightMap, *conv9->getOutput(0), 64, 3, 1, 1, \"model.10\");\n\n    ITensor* input_tensor_11[] = { conv10->getOutput(0), conv8->getOutput(0),conv6->getOutput(0), conv4->getOutput(0),conv3->getOutput(0) };\n    IConcatenationLayer* concat11 = network->addConcatenation(input_tensor_11, 5);\n\n    IElementWiseLayer* conv12 = convBnSilu(network, weightMap, *concat11->getOutput(0), 160, 1, 1, 0, \"model.12\");\n\n\n    auto conv13 = DownC(network, weightMap, *conv12->getOutput(0), 160, 320, \"model.13\");\n    IElementWiseLayer* conv14 = convBnSilu(network, weightMap, *conv13->getOutput(0), 128, 1, 1, 0, \"model.14\");\n    IElementWiseLayer* conv15 = convBnSilu(network, weightMap, *conv13->getOutput(0), 128, 1, 1, 0, \"model.15\");\n    \n    IElementWiseLayer* conv16 = convBnSilu(network, weightMap, *conv15->getOutput(0), 128, 3, 1, 1, \"model.16\");\n    IElementWiseLayer* conv17 = convBnSilu(network, weightMap, *conv16->getOutput(0), 128, 3, 1, 1, \"model.17\");\n    IElementWiseLayer* conv18 = convBnSilu(network, weightMap, *conv17->getOutput(0), 128, 3, 1, 1, \"model.18\");\n    IElementWiseLayer* conv19 = convBnSilu(network, weightMap, *conv18->getOutput(0), 128, 3, 1, 1, \"model.19\");\n    IElementWiseLayer* conv20 = convBnSilu(network, weightMap, *conv19->getOutput(0), 128, 3, 1, 1, \"model.20\");\n    IElementWiseLayer* conv21 = convBnSilu(network, weightMap, *conv20->getOutput(0), 128, 3, 1, 1, \"model.21\");\n    ITensor* input_tensor_22[] = { conv21->getOutput(0), conv19->getOutput(0),conv17->getOutput(0), conv15->getOutput(0),conv14->getOutput(0) };\n    IConcatenationLayer* concat22 = network->addConcatenation(input_tensor_22, 5);\n\n    IElementWiseLayer* conv23 = convBnSilu(network, weightMap, *concat22->getOutput(0), 320, 1, 1, 0, \"model.23\");\n\n\n    auto conv24 = DownC(network, weightMap, *conv23->getOutput(0), 320, 640, \"model.24\");\n    IElementWiseLayer* conv25 = convBnSilu(network, weightMap, *conv24->getOutput(0), 256, 1, 1, 0, \"model.25\");\n    IElementWiseLayer* conv26 = convBnSilu(network, weightMap, *conv24->getOutput(0), 256, 1, 1, 0, \"model.26\");\n\n    IElementWiseLayer* conv27 = convBnSilu(network, weightMap, *conv26->getOutput(0), 256, 3, 1, 1, \"model.27\");\n    IElementWiseLayer* conv28 = convBnSilu(network, weightMap, *conv27->getOutput(0), 256, 3, 1, 1, \"model.28\");\n    IElementWiseLayer* conv29 = convBnSilu(network, weightMap, *conv28->getOutput(0), 256, 3, 1, 1, \"model.29\");\n    IElementWiseLayer* conv30 = convBnSilu(network, weightMap, *conv29->getOutput(0), 256, 3, 1, 1, \"model.30\");\n    IElementWiseLayer* conv31 = convBnSilu(network, weightMap, *conv30->getOutput(0), 256, 3, 1, 1, \"model.31\");\n    IElementWiseLayer* conv32 = convBnSilu(network, weightMap, *conv31->getOutput(0), 256, 3, 1, 1, \"model.32\");\n    ITensor* input_tensor_33[] = { conv32->getOutput(0), conv30->getOutput(0),conv28->getOutput(0), conv26->getOutput(0),conv25->getOutput(0) };\n    IConcatenationLayer* concat33 = network->addConcatenation(input_tensor_33, 5);\n\n    IElementWiseLayer* conv34 = convBnSilu(network, weightMap, *concat33->getOutput(0), 640, 1, 1, 0, \"model.34\");\n    auto conv35 = DownC(network, weightMap, *conv34->getOutput(0), 640, 960, \"model.35\");\n    IElementWiseLayer* conv36 = convBnSilu(network, weightMap, *conv35->getOutput(0), 384, 1, 1, 0, \"model.36\");\n    IElementWiseLayer* conv37 = convBnSilu(network, weightMap, *conv35->getOutput(0), 384, 1, 1, 0, \"model.37\");\n\n    IElementWiseLayer* conv38 = convBnSilu(network, weightMap, *conv37->getOutput(0), 384, 3, 1, 1, \"model.38\");\n    IElementWiseLayer* conv39 = convBnSilu(network, weightMap, *conv38->getOutput(0), 384, 3, 1, 1, \"model.39\");\n    IElementWiseLayer* conv40 = convBnSilu(network, weightMap, *conv39->getOutput(0), 384, 3, 1, 1, \"model.40\");\n    IElementWiseLayer* conv41 = convBnSilu(network, weightMap, *conv40->getOutput(0), 384, 3, 1, 1, \"model.41\");\n    IElementWiseLayer* conv42 = convBnSilu(network, weightMap, *conv41->getOutput(0), 384, 3, 1, 1, \"model.42\");\n    IElementWiseLayer* conv43 = convBnSilu(network, weightMap, *conv42->getOutput(0), 384, 3, 1, 1, \"model.43\");\n\n    ITensor* input_tensor_44[] = { conv43->getOutput(0), conv41->getOutput(0),conv39->getOutput(0), conv37->getOutput(0),conv36->getOutput(0) };\n    IConcatenationLayer* concat44 = network->addConcatenation(input_tensor_44, 5);\n    IElementWiseLayer* conv45 = convBnSilu(network, weightMap, *concat44->getOutput(0), 960, 1, 1, 0, \"model.45\");\n\n    auto conv46 = DownC(network, weightMap, *conv45->getOutput(0), 960, 1280, \"model.46\");\n    IElementWiseLayer* conv47 = convBnSilu(network, weightMap, *conv46->getOutput(0), 512, 1, 1, 0, \"model.47\");\n    IElementWiseLayer* conv48 = convBnSilu(network, weightMap, *conv46->getOutput(0), 512, 1, 1, 0, \"model.48\");\n\n    IElementWiseLayer* conv49 = convBnSilu(network, weightMap, *conv48->getOutput(0), 512, 3, 1, 1, \"model.49\");\n    IElementWiseLayer* conv50 = convBnSilu(network, weightMap, *conv49->getOutput(0), 512, 3, 1, 1, \"model.50\");\n    IElementWiseLayer* conv51 = convBnSilu(network, weightMap, *conv50->getOutput(0), 512, 3, 1, 1, \"model.51\");\n    IElementWiseLayer* conv52 = convBnSilu(network, weightMap, *conv51->getOutput(0), 512, 3, 1, 1, \"model.52\");\n    IElementWiseLayer* conv53 = convBnSilu(network, weightMap, *conv52->getOutput(0), 512, 3, 1, 1, \"model.53\");\n    IElementWiseLayer* conv54 = convBnSilu(network, weightMap, *conv53->getOutput(0), 512, 3, 1, 1, \"model.54\");\n    ITensor* input_tensor_55[] = { conv54->getOutput(0), conv52->getOutput(0),conv50->getOutput(0), conv48->getOutput(0),conv47->getOutput(0) };\n    IConcatenationLayer* concat55 = network->addConcatenation(input_tensor_55, 5);\n    IElementWiseLayer* conv56 = convBnSilu(network, weightMap, *concat55->getOutput(0), 1280, 1, 1, 0, \"model.56\");\n\n    //------------------------yolov7e6 head-------------------------------\n    auto conv57 = SPPCSPC(network, weightMap, *conv56->getOutput(0), 640, \"model.57\");\n    IElementWiseLayer* conv58 = convBnSilu(network, weightMap, *conv57->getOutput(0), 480, 1, 1, 0, \"model.58\");\n\n\n    float scale[] = { 1.0, 2.0, 2.0 };\n    IResizeLayer* re59 = network->addResize(*conv58->getOutput(0));\n    re59->setResizeMode(ResizeMode::kNEAREST);\n    re59->setScales(scale, 3);\n\n    IElementWiseLayer* conv60 = convBnSilu(network, weightMap, *conv45->getOutput(0), 480, 1, 1, 0, \"model.60\");\n    ITensor* input_tensor_61[] = { conv60->getOutput(0), re59->getOutput(0) };\n    IConcatenationLayer* concat61 = network->addConcatenation(input_tensor_61, 2);\n    IElementWiseLayer* conv62 = convBnSilu(network, weightMap, *concat61->getOutput(0), 384, 1, 1, 0, \"model.62\");\n    IElementWiseLayer* conv63 = convBnSilu(network, weightMap, *concat61->getOutput(0), 384, 1, 1, 0, \"model.63\");\n\n    IElementWiseLayer* conv64 = convBnSilu(network, weightMap, *conv63->getOutput(0), 192, 3, 1, 1, \"model.64\");\n    IElementWiseLayer* conv65 = convBnSilu(network, weightMap, *conv64->getOutput(0), 192, 3, 1, 1, \"model.65\");\n    IElementWiseLayer* conv66 = convBnSilu(network, weightMap, *conv65->getOutput(0), 192, 3, 1, 1, \"model.66\");\n    IElementWiseLayer* conv67 = convBnSilu(network, weightMap, *conv66->getOutput(0), 192, 3, 1, 1, \"model.67\");\n    IElementWiseLayer* conv68 = convBnSilu(network, weightMap, *conv67->getOutput(0), 192, 3, 1, 1, \"model.68\");\n    IElementWiseLayer* conv69 = convBnSilu(network, weightMap, *conv68->getOutput(0), 192, 3, 1, 1, \"model.69\");\n\n    ITensor* input_tensor_70[] = { conv69->getOutput(0), conv68->getOutput(0),conv67->getOutput(0), conv66->getOutput(0),\n        conv65->getOutput(0), conv64->getOutput(0), conv63->getOutput(0), conv62->getOutput(0) };\n    IConcatenationLayer* concat70 = network->addConcatenation(input_tensor_70, 8);\n    IElementWiseLayer* conv71 = convBnSilu(network, weightMap, *concat70->getOutput(0), 480, 1, 1, 0, \"model.71\");\n\n    IElementWiseLayer* conv72 = convBnSilu(network, weightMap, *conv71->getOutput(0), 320, 1, 1, 0, \"model.72\");\n    IResizeLayer* re73 = network->addResize(*conv72->getOutput(0));\n    re73->setResizeMode(ResizeMode::kNEAREST);\n    re73->setScales(scale, 3);\n    IElementWiseLayer* conv74 = convBnSilu(network, weightMap, *conv34->getOutput(0), 320, 1, 1, 0, \"model.74\");\n    ITensor* input_tensor_75[] = { conv74->getOutput(0), re73->getOutput(0) };\n    IConcatenationLayer* concat75 = network->addConcatenation(input_tensor_75, 2);\n\n    IElementWiseLayer* conv76 = convBnSilu(network, weightMap, *concat75->getOutput(0), 256, 1, 1, 0, \"model.76\");\n    IElementWiseLayer* conv77 = convBnSilu(network, weightMap, *concat75->getOutput(0), 256, 1, 1, 0, \"model.77\");\n\n    IElementWiseLayer* conv78 = convBnSilu(network, weightMap, *conv77->getOutput(0), 128, 3, 1, 1, \"model.78\");\n    IElementWiseLayer* conv79 = convBnSilu(network, weightMap, *conv78->getOutput(0), 128, 3, 1, 1, \"model.79\");\n    IElementWiseLayer* conv80 = convBnSilu(network, weightMap, *conv79->getOutput(0), 128, 3, 1, 1, \"model.80\");\n    IElementWiseLayer* conv81 = convBnSilu(network, weightMap, *conv80->getOutput(0), 128, 3, 1, 1, \"model.81\");\n    IElementWiseLayer* conv82 = convBnSilu(network, weightMap, *conv81->getOutput(0), 128, 3, 1, 1, \"model.82\");\n    IElementWiseLayer* conv83 = convBnSilu(network, weightMap, *conv82->getOutput(0), 128, 3, 1, 1, \"model.83\");\n\n    ITensor* input_tensor_84[] = { conv83->getOutput(0), conv82->getOutput(0),conv81->getOutput(0), conv80->getOutput(0),\n        conv79->getOutput(0), conv78->getOutput(0), conv77->getOutput(0), conv76->getOutput(0) };\n    IConcatenationLayer* concat84 = network->addConcatenation(input_tensor_84, 8);\n\n    IElementWiseLayer* conv85 = convBnSilu(network, weightMap, *concat84->getOutput(0), 320, 1, 1, 0, \"model.85\");\n\n    IElementWiseLayer* conv86 = convBnSilu(network, weightMap, *conv85->getOutput(0), 160, 1, 1, 0, \"model.86\");\n    IResizeLayer* re87 = network->addResize(*conv86->getOutput(0));\n    re87->setResizeMode(ResizeMode::kNEAREST);\n    re87->setScales(scale, 3);\n    IElementWiseLayer* conv88 = convBnSilu(network, weightMap, *conv23->getOutput(0), 160, 1, 1, 0, \"model.88\");\n    ITensor* input_tensor_89[] = { conv88->getOutput(0), re87->getOutput(0) };\n    IConcatenationLayer* concat89 = network->addConcatenation(input_tensor_89, 2);\n\n    IElementWiseLayer* conv90 = convBnSilu(network, weightMap, *concat89->getOutput(0), 128, 1, 1, 0, \"model.90\");\n    IElementWiseLayer* conv91 = convBnSilu(network, weightMap, *concat89->getOutput(0), 128, 1, 1, 0, \"model.91\");\n    IElementWiseLayer* conv92 = convBnSilu(network, weightMap, *conv91->getOutput(0), 64, 3, 1, 1, \"model.92\");\n    IElementWiseLayer* conv93 = convBnSilu(network, weightMap, *conv92->getOutput(0), 64, 3, 1, 1, \"model.93\");\n    IElementWiseLayer* conv94 = convBnSilu(network, weightMap, *conv93->getOutput(0), 64, 3, 1, 1, \"model.94\");\n    IElementWiseLayer* conv95 = convBnSilu(network, weightMap, *conv94->getOutput(0), 64, 3, 1, 1, \"model.95\");\n    IElementWiseLayer* conv96 = convBnSilu(network, weightMap, *conv95->getOutput(0), 64, 3, 1, 1, \"model.96\");\n    IElementWiseLayer* conv97 = convBnSilu(network, weightMap, *conv96->getOutput(0), 64, 3, 1, 1, \"model.97\");\n\n    ITensor* input_tensor_98[] = { conv97->getOutput(0), conv96->getOutput(0),conv95->getOutput(0), conv94->getOutput(0),\n       conv93->getOutput(0), conv92->getOutput(0), conv91->getOutput(0), conv90->getOutput(0) };\n    IConcatenationLayer* concat98 = network->addConcatenation(input_tensor_98, 8);\n\n    IElementWiseLayer* conv99 = convBnSilu(network, weightMap, *concat98->getOutput(0), 160, 1, 1, 0, \"model.99\");\n\n    auto conv100 = DownC(network, weightMap, *conv99->getOutput(0), 160, 320, \"model.100\");\n    ITensor* input_tensor_101[] = { conv100->getOutput(0), conv85->getOutput(0) };\n    IConcatenationLayer* concat101 = network->addConcatenation(input_tensor_101, 2);\n\n    IElementWiseLayer* conv102 = convBnSilu(network, weightMap, *concat101->getOutput(0), 256, 1, 1, 0, \"model.102\");\n    IElementWiseLayer* conv103 = convBnSilu(network, weightMap, *concat101->getOutput(0), 256, 1, 1, 0, \"model.103\");\n\n    IElementWiseLayer* conv104 = convBnSilu(network, weightMap, *conv103->getOutput(0), 128, 3, 1, 1, \"model.104\");\n    IElementWiseLayer* conv105 = convBnSilu(network, weightMap, *conv104->getOutput(0), 128, 3, 1, 1, \"model.105\");\n    IElementWiseLayer* conv106 = convBnSilu(network, weightMap, *conv105->getOutput(0), 128, 3, 1, 1, \"model.106\");\n    IElementWiseLayer* conv107 = convBnSilu(network, weightMap, *conv106->getOutput(0), 128, 3, 1, 1, \"model.107\");\n    IElementWiseLayer* conv108 = convBnSilu(network, weightMap, *conv107->getOutput(0), 128, 3, 1, 1, \"model.108\");\n    IElementWiseLayer* conv109 = convBnSilu(network, weightMap, *conv108->getOutput(0), 128, 3, 1, 1, \"model.109\");\n\n    ITensor* input_tensor_110[] = { conv109->getOutput(0), conv108->getOutput(0),conv107->getOutput(0), conv106->getOutput(0),\n       conv105->getOutput(0), conv104->getOutput(0), conv103->getOutput(0), conv102->getOutput(0) };\n    IConcatenationLayer* concat110 = network->addConcatenation(input_tensor_110, 8);\n    IElementWiseLayer* conv111 = convBnSilu(network, weightMap, *concat110->getOutput(0), 320, 1, 1, 0, \"model.111\");\n\n    auto conv112 = DownC(network, weightMap, *conv111->getOutput(0), 320, 480, \"model.112\");\n    ITensor* input_tensor_113[] = { conv112->getOutput(0), conv71->getOutput(0) };\n    IConcatenationLayer* concat113 = network->addConcatenation(input_tensor_113, 2);\n\n    IElementWiseLayer* conv114 = convBnSilu(network, weightMap, *concat113->getOutput(0), 384, 1, 1, 0, \"model.114\");\n    IElementWiseLayer* conv115 = convBnSilu(network, weightMap, *concat113->getOutput(0), 384, 1, 1, 0, \"model.115\");\n\n    IElementWiseLayer* conv116 = convBnSilu(network, weightMap, *conv115->getOutput(0), 192, 3, 1, 1, \"model.116\");\n    IElementWiseLayer* conv117 = convBnSilu(network, weightMap, *conv116->getOutput(0), 192, 3, 1, 1, \"model.117\");\n    IElementWiseLayer* conv118 = convBnSilu(network, weightMap, *conv117->getOutput(0), 192, 3, 1, 1, \"model.118\");\n    IElementWiseLayer* conv119 = convBnSilu(network, weightMap, *conv118->getOutput(0), 192, 3, 1, 1, \"model.119\");\n    IElementWiseLayer* conv120 = convBnSilu(network, weightMap, *conv119->getOutput(0), 192, 3, 1, 1, \"model.120\");\n    IElementWiseLayer* conv121 = convBnSilu(network, weightMap, *conv120->getOutput(0), 192, 3, 1, 1, \"model.121\");\n    ITensor* input_tensor_122[] = { conv121->getOutput(0), conv120->getOutput(0),conv119->getOutput(0), conv118->getOutput(0),\n      conv117->getOutput(0), conv116->getOutput(0), conv115->getOutput(0), conv114->getOutput(0) };\n    IConcatenationLayer* concat122 = network->addConcatenation(input_tensor_122, 8);\n    IElementWiseLayer* conv123 = convBnSilu(network, weightMap, *concat122->getOutput(0), 480, 1, 1, 0, \"model.123\");\n\n    auto conv124 = DownC(network, weightMap, *conv123->getOutput(0), 480, 640, \"model.124\");\n    ITensor* input_tensor_125[] = { conv124->getOutput(0), conv57->getOutput(0) };\n    IConcatenationLayer* concat125 = network->addConcatenation(input_tensor_125, 2);\n\n    IElementWiseLayer* conv126 = convBnSilu(network, weightMap, *concat125->getOutput(0), 512, 1, 1, 0, \"model.126\");\n    IElementWiseLayer* conv127 = convBnSilu(network, weightMap, *concat125->getOutput(0), 512, 1, 1, 0, \"model.127\");\n\n    IElementWiseLayer* conv128 = convBnSilu(network, weightMap, *conv127->getOutput(0), 256, 3, 1, 1, \"model.128\");\n    IElementWiseLayer* conv129 = convBnSilu(network, weightMap, *conv128->getOutput(0), 256, 3, 1, 1, \"model.129\");\n    IElementWiseLayer* conv130 = convBnSilu(network, weightMap, *conv129->getOutput(0), 256, 3, 1, 1, \"model.130\");\n    IElementWiseLayer* conv131 = convBnSilu(network, weightMap, *conv130->getOutput(0), 256, 3, 1, 1, \"model.131\");\n    IElementWiseLayer* conv132 = convBnSilu(network, weightMap, *conv131->getOutput(0), 256, 3, 1, 1, \"model.132\");\n    IElementWiseLayer* conv133 = convBnSilu(network, weightMap, *conv132->getOutput(0), 256, 3, 1, 1, \"model.133\");\n    ITensor* input_tensor_134[] = { conv133->getOutput(0), conv132->getOutput(0),conv131->getOutput(0), conv130->getOutput(0),\n     conv129->getOutput(0), conv128->getOutput(0), conv127->getOutput(0), conv126->getOutput(0) };\n    IConcatenationLayer* concat134 = network->addConcatenation(input_tensor_134, 8);\n    IElementWiseLayer* conv135 = convBnSilu(network, weightMap, *concat134->getOutput(0), 640, 1, 1, 0, \"model.135\");\n\n    IElementWiseLayer* conv136 = convBnSilu(network, weightMap, *conv99->getOutput(0), 320, 3, 1, 1, \"model.136\");\n    IElementWiseLayer* conv137 = convBnSilu(network, weightMap, *conv111->getOutput(0), 640, 3, 1, 1, \"model.137\");\n    IElementWiseLayer* conv138 = convBnSilu(network, weightMap, *conv123->getOutput(0), 960, 3, 1, 1, \"model.138\");\n    IElementWiseLayer* conv139 = convBnSilu(network, weightMap, *conv135->getOutput(0), 1280, 3, 1, 1, \"model.139\");\n\n\n\n     // out\n    IConvolutionLayer* cv105_0 = network->addConvolutionNd(*conv136->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.140.m.0.weight\"], weightMap[\"model.140.m.0.bias\"]);\n    assert(cv105_0);\n    cv105_0->setName(\"cv105.0\");\n    IConvolutionLayer* cv105_1 = network->addConvolutionNd(*conv137->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.140.m.1.weight\"], weightMap[\"model.140.m.1.bias\"]);\n    assert(cv105_1);\n    cv105_1->setName(\"cv105.1\");\n    IConvolutionLayer* cv105_2 = network->addConvolutionNd(*conv138->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.140.m.2.weight\"], weightMap[\"model.140.m.2.bias\"]);\n    assert(cv105_2);\n    cv105_2->setName(\"cv105.2\");\n    IConvolutionLayer* cv105_3 = network->addConvolutionNd(*conv139->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.140.m.3.weight\"], weightMap[\"model.140.m.3.bias\"]);\n    assert(cv105_3);\n    cv105_3->setName(\"cv105.3\");\n\n    /*------------detect-----------*/\n    auto yolo = addYoLoLayer(network, weightMap, \"model.140\", std::vector<IConvolutionLayer*>{cv105_0, cv105_1, cv105_2, cv105_3});\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#if defined(USE_FP16)\n    config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(BuilderFlag::kINT8);\n    Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return serialized_model;\n}\n\nIHostMemory* build_engine_yolov7w6(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, const std::string& wts_path) {\n    std::map<std::string, Weights> weightMap = loadWeights(wts_path);\n\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{ 3, kInputH, kInputW });\n    assert(data);\n\n    /*----------------------------------yolov7w6 backbone-----------------------------------------*/\n    auto* conv0 = ReOrg(network, weightMap, *data, 3);\n\n    IElementWiseLayer* conv1 = convBnSilu(network, weightMap, *conv0->getOutput(0), 64, 3, 1, 1, \"model.1\");\n\n    IElementWiseLayer* conv2 = convBnSilu(network, weightMap, *conv1->getOutput(0), 128, 3, 2, 1, \"model.2\");\n\n    IElementWiseLayer* conv3 = convBnSilu(network, weightMap, *conv2->getOutput(0), 64, 1, 1, 0, \"model.3\");\n    IElementWiseLayer* conv4 = convBnSilu(network, weightMap, *conv2->getOutput(0), 64, 1, 1, 0, \"model.4\");\n\n    IElementWiseLayer* conv5 = convBnSilu(network, weightMap, *conv4->getOutput(0), 64, 3, 1, 1, \"model.5\");\n    IElementWiseLayer* conv6 = convBnSilu(network, weightMap, *conv5->getOutput(0), 64, 3, 1, 1, \"model.6\");\n    IElementWiseLayer* conv7 = convBnSilu(network, weightMap, *conv6->getOutput(0), 64, 3, 1, 1, \"model.7\");\n    IElementWiseLayer* conv8 = convBnSilu(network, weightMap, *conv7->getOutput(0), 64, 3, 1, 1, \"model.8\");\n\n    ITensor* input_tensor_9[] = { conv8->getOutput(0), conv6->getOutput(0), conv4->getOutput(0), conv3->getOutput(0) };\n    IConcatenationLayer* concat9 = network->addConcatenation(input_tensor_9, 4);\n    concat9->setAxis(0);\n    IElementWiseLayer* conv10 = convBnSilu(network, weightMap, *concat9->getOutput(0), 128, 1, 1, 0, \"model.10\");\n\n    IElementWiseLayer* conv11 = convBnSilu(network, weightMap, *conv10->getOutput(0), 256, 3, 2, 1, \"model.11\");\n\n    IElementWiseLayer* conv12 = convBnSilu(network, weightMap, *conv11->getOutput(0), 128, 1, 1, 0, \"model.12\");\n    IElementWiseLayer* conv13 = convBnSilu(network, weightMap, *conv11->getOutput(0), 128, 1, 1, 0, \"model.13\");\n    IElementWiseLayer* conv14 = convBnSilu(network, weightMap, *conv13->getOutput(0), 128, 3, 1, 1, \"model.14\");\n    IElementWiseLayer* conv15 = convBnSilu(network, weightMap, *conv14->getOutput(0), 128, 3, 1, 1, \"model.15\");\n    IElementWiseLayer* conv16 = convBnSilu(network, weightMap, *conv15->getOutput(0), 128, 3, 1, 1, \"model.16\");\n    IElementWiseLayer* conv17 = convBnSilu(network, weightMap, *conv16->getOutput(0), 128, 3, 1, 1, \"model.17\");\n    ITensor* input_tensor_18[] = { conv17->getOutput(0), conv15->getOutput(0), conv13->getOutput(0), conv12->getOutput(0) };\n    IConcatenationLayer* concat18 = network->addConcatenation(input_tensor_18, 4);\n    concat18->setAxis(0);\n    IElementWiseLayer* conv19 = convBnSilu(network, weightMap, *concat18->getOutput(0), 256, 1, 1, 0, \"model.19\");\n\n    IElementWiseLayer* conv20 = convBnSilu(network, weightMap, *conv19->getOutput(0), 512, 3, 2, 1, \"model.20\");\n\n    IElementWiseLayer* conv21 = convBnSilu(network, weightMap, *conv20->getOutput(0), 256, 1, 1, 0, \"model.21\");\n    IElementWiseLayer* conv22 = convBnSilu(network, weightMap, *conv20->getOutput(0), 256, 1, 1, 0, \"model.22\");\n    IElementWiseLayer* conv23 = convBnSilu(network, weightMap, *conv22->getOutput(0), 256, 3, 1, 1, \"model.23\");\n    IElementWiseLayer* conv24 = convBnSilu(network, weightMap, *conv23->getOutput(0), 256, 3, 1, 1, \"model.24\");\n    IElementWiseLayer* conv25 = convBnSilu(network, weightMap, *conv24->getOutput(0), 256, 3, 1, 1, \"model.25\");\n    IElementWiseLayer* conv26 = convBnSilu(network, weightMap, *conv25->getOutput(0), 256, 3, 1, 1, \"model.26\");\n    ITensor* input_tensor_27[] = { conv26->getOutput(0), conv24->getOutput(0), conv22->getOutput(0), conv21->getOutput(0) };\n    IConcatenationLayer* concat27 = network->addConcatenation(input_tensor_27, 4);\n    concat27->setAxis(0);\n\n    IElementWiseLayer* conv28 = convBnSilu(network, weightMap, *concat27->getOutput(0), 512, 1, 1, 0, \"model.28\");\n\n    IElementWiseLayer* conv29 = convBnSilu(network, weightMap, *conv28->getOutput(0), 768, 3, 2, 1, \"model.29\");\n\n    IElementWiseLayer* conv30 = convBnSilu(network, weightMap, *conv29->getOutput(0), 384, 1, 1, 0, \"model.30\");\n    IElementWiseLayer* conv31 = convBnSilu(network, weightMap, *conv29->getOutput(0), 384, 1, 1, 0, \"model.31\");\n    IElementWiseLayer* conv32 = convBnSilu(network, weightMap, *conv31->getOutput(0), 384, 3, 1, 1, \"model.32\");\n    IElementWiseLayer* conv33 = convBnSilu(network, weightMap, *conv32->getOutput(0), 384, 3, 1, 1, \"model.33\");\n    IElementWiseLayer* conv34 = convBnSilu(network, weightMap, *conv33->getOutput(0), 384, 3, 1, 1, \"model.34\");\n    IElementWiseLayer* conv35 = convBnSilu(network, weightMap, *conv34->getOutput(0), 384, 3, 1, 1, \"model.35\");\n    ITensor* input_tensor_36[] = { conv35->getOutput(0), conv33->getOutput(0), conv31->getOutput(0), conv30->getOutput(0) };\n    IConcatenationLayer* concat36 = network->addConcatenation(input_tensor_36, 4);\n    IElementWiseLayer* conv37 = convBnSilu(network, weightMap, *concat36->getOutput(0), 768, 1, 1, 0, \"model.37\");\n\n    IElementWiseLayer* conv38 = convBnSilu(network, weightMap, *conv37->getOutput(0), 1024, 3, 2, 1, \"model.38\");\n\n    IElementWiseLayer* conv39 = convBnSilu(network, weightMap, *conv38->getOutput(0), 512, 1, 1, 0, \"model.39\");\n    IElementWiseLayer* conv40 = convBnSilu(network, weightMap, *conv38->getOutput(0), 512, 1, 1, 0, \"model.40\");\n    IElementWiseLayer* conv41 = convBnSilu(network, weightMap, *conv40->getOutput(0), 512, 3, 1, 1, \"model.41\");\n    IElementWiseLayer* conv42 = convBnSilu(network, weightMap, *conv41->getOutput(0), 512, 3, 1, 1, \"model.42\");\n    IElementWiseLayer* conv43 = convBnSilu(network, weightMap, *conv42->getOutput(0), 512, 3, 1, 1, \"model.43\");\n    IElementWiseLayer* conv44 = convBnSilu(network, weightMap, *conv43->getOutput(0), 512, 3, 1, 1, \"model.44\");\n    ITensor* input_tensor_45[] = { conv44->getOutput(0), conv42->getOutput(0), conv40->getOutput(0), conv39->getOutput(0) };\n    IConcatenationLayer* concat45 = network->addConcatenation(input_tensor_45, 4);\n    IElementWiseLayer* conv46 = convBnSilu(network, weightMap, *concat45->getOutput(0), 1024, 1, 1, 0, \"model.46\");\n\n    auto conv47 = SPPCSPC(network, weightMap, *conv46->getOutput(0), 512, \"model.47\");\n    IElementWiseLayer* conv48 = convBnSilu(network, weightMap, *conv47->getOutput(0), 384, 1, 1, 0, \"model.48\");\n\n    float scale[] = { 1.0, 2.0, 2.0 };\n    IResizeLayer* re49 = network->addResize(*conv48->getOutput(0));\n    re49->setResizeMode(ResizeMode::kNEAREST);\n    re49->setScales(scale, 3);\n\n    IElementWiseLayer* conv50 = convBnSilu(network, weightMap, *conv37->getOutput(0), 384, 1, 1, 0, \"model.50\");\n    ITensor* input_tensor_51[] = { conv50->getOutput(0), re49->getOutput(0) };\n    IConcatenationLayer* concat51 = network->addConcatenation(input_tensor_51, 2);\n\n    IElementWiseLayer* conv52 = convBnSilu(network, weightMap, *concat51->getOutput(0), 384, 1, 1, 0, \"model.52\");\n    IElementWiseLayer* conv53 = convBnSilu(network, weightMap, *concat51->getOutput(0), 384, 1, 1, 0, \"model.53\");\n    IElementWiseLayer* conv54 = convBnSilu(network, weightMap, *conv53->getOutput(0), 192, 3, 1, 1, \"model.54\");\n    IElementWiseLayer* conv55 = convBnSilu(network, weightMap, *conv54->getOutput(0), 192, 3, 1, 1, \"model.55\");\n    IElementWiseLayer* conv56 = convBnSilu(network, weightMap, *conv55->getOutput(0), 192, 3, 1, 1, \"model.56\");\n    IElementWiseLayer* conv57 = convBnSilu(network, weightMap, *conv56->getOutput(0), 192, 3, 1, 1, \"model.57\");\n\n    ITensor* input_tensor_58[] = { conv57->getOutput(0), conv56->getOutput(0), conv55->getOutput(0), conv54->getOutput(0), conv53->getOutput(0), conv52->getOutput(0) };\n    IConcatenationLayer* concat58 = network->addConcatenation(input_tensor_58, 6);\n\n    IElementWiseLayer* conv59 = convBnSilu(network, weightMap, *concat58->getOutput(0), 384, 1, 1, 0, \"model.59\");\n\n    IElementWiseLayer* conv60 = convBnSilu(network, weightMap, *conv59->getOutput(0), 256, 1, 1, 0, \"model.60\");\n    IResizeLayer* re61 = network->addResize(*conv60->getOutput(0));\n    re61->setResizeMode(ResizeMode::kNEAREST);\n    re61->setScales(scale, 3);\n    IElementWiseLayer* conv62 = convBnSilu(network, weightMap, *conv28->getOutput(0), 256, 1, 1, 0, \"model.62\");\n    ITensor* input_tensor_63[] = { conv62->getOutput(0), re61->getOutput(0) };\n    IConcatenationLayer* concat63 = network->addConcatenation(input_tensor_63, 2);\n\n    IElementWiseLayer* conv64 = convBnSilu(network, weightMap, *concat63->getOutput(0), 256, 1, 1, 0, \"model.64\");\n    IElementWiseLayer* conv65 = convBnSilu(network, weightMap, *concat63->getOutput(0), 256, 1, 1, 0, \"model.65\");\n    IElementWiseLayer* conv66 = convBnSilu(network, weightMap, *conv65->getOutput(0), 128, 3, 1, 1, \"model.66\");\n    IElementWiseLayer* conv67 = convBnSilu(network, weightMap, *conv66->getOutput(0), 128, 3, 1, 1, \"model.67\");\n    IElementWiseLayer* conv68 = convBnSilu(network, weightMap, *conv67->getOutput(0), 128, 3, 1, 1, \"model.68\");\n    IElementWiseLayer* conv69 = convBnSilu(network, weightMap, *conv68->getOutput(0), 128, 3, 1, 1, \"model.69\");\n\n    ITensor* input_tensor_70[] = { conv69->getOutput(0), conv68->getOutput(0), conv67->getOutput(0), conv66->getOutput(0), conv65->getOutput(0), conv64->getOutput(0) };\n    IConcatenationLayer* concat70 = network->addConcatenation(input_tensor_70, 6);\n\n    IElementWiseLayer* conv71 = convBnSilu(network, weightMap, *concat70->getOutput(0), 256, 1, 1, 0, \"model.71\");\n    IElementWiseLayer* conv72 = convBnSilu(network, weightMap, *conv71->getOutput(0), 128, 1, 1, 0, \"model.72\");\n    IResizeLayer* re73 = network->addResize(*conv72->getOutput(0));\n    re73->setResizeMode(ResizeMode::kNEAREST);\n    re73->setScales(scale, 3);\n\n    IElementWiseLayer* conv74 = convBnSilu(network, weightMap, *conv19->getOutput(0), 128, 1, 1, 0, \"model.74\");\n    ITensor* input_tensor_75[] = { conv74->getOutput(0), re73->getOutput(0) };\n    IConcatenationLayer* concat75 = network->addConcatenation(input_tensor_75, 2);\n    IElementWiseLayer* conv76 = convBnSilu(network, weightMap, *concat75->getOutput(0), 128, 1, 1, 0, \"model.76\");\n    IElementWiseLayer* conv77 = convBnSilu(network, weightMap, *concat75->getOutput(0), 128, 1, 1, 0, \"model.77\");\n\n    IElementWiseLayer* conv78 = convBnSilu(network, weightMap, *conv77->getOutput(0), 64, 3, 1, 1, \"model.78\");\n    IElementWiseLayer* conv79 = convBnSilu(network, weightMap, *conv78->getOutput(0), 64, 3, 1, 1, \"model.79\");\n    IElementWiseLayer* conv80 = convBnSilu(network, weightMap, *conv79->getOutput(0), 64, 3, 1, 1, \"model.80\");\n    IElementWiseLayer* conv81 = convBnSilu(network, weightMap, *conv80->getOutput(0), 64, 3, 1, 1, \"model.81\");\n    ITensor* input_tensor_82[] = { conv81->getOutput(0), conv80->getOutput(0), conv79->getOutput(0), conv78->getOutput(0), conv77->getOutput(0), conv76->getOutput(0) };\n    IConcatenationLayer* concat82 = network->addConcatenation(input_tensor_82, 6);\n\n    IElementWiseLayer* conv83 = convBnSilu(network, weightMap, *concat82->getOutput(0), 128, 1, 1, 0, \"model.83\");\n\n    IElementWiseLayer* conv84 = convBnSilu(network, weightMap, *conv83->getOutput(0), 256, 3, 2, 1, \"model.84\");\n    ITensor* input_tensor_85[] = { conv84->getOutput(0), conv71->getOutput(0) };\n    IConcatenationLayer* concat85 = network->addConcatenation(input_tensor_85, 2);\n\n    IElementWiseLayer* conv86 = convBnSilu(network, weightMap, *concat85->getOutput(0), 256, 1, 1, 0, \"model.86\");\n    IElementWiseLayer* conv87 = convBnSilu(network, weightMap, *concat85->getOutput(0), 256, 1, 1, 0, \"model.87\");\n    IElementWiseLayer* conv88 = convBnSilu(network, weightMap, *conv87->getOutput(0), 128, 3, 1, 1, \"model.88\");\n    IElementWiseLayer* conv89 = convBnSilu(network, weightMap, *conv88->getOutput(0), 128, 3, 1, 1, \"model.89\");\n    IElementWiseLayer* conv90 = convBnSilu(network, weightMap, *conv89->getOutput(0), 128, 3, 1, 1, \"model.90\");\n    IElementWiseLayer* conv91 = convBnSilu(network, weightMap, *conv90->getOutput(0), 128, 3, 1, 1, \"model.91\");\n\n    ITensor* input_tensor_92[] = { conv91->getOutput(0), conv90->getOutput(0), conv89->getOutput(0), conv88->getOutput(0), conv87->getOutput(0), conv86->getOutput(0) };\n    IConcatenationLayer* concat92 = network->addConcatenation(input_tensor_92, 6);\n\n    IElementWiseLayer* conv93 = convBnSilu(network, weightMap, *concat92->getOutput(0), 256, 1, 1, 0, \"model.93\");\n\n    IElementWiseLayer* conv94 = convBnSilu(network, weightMap, *conv93->getOutput(0), 384, 3, 2, 1, \"model.94\");\n    ITensor* input_tensor_95[] = { conv94->getOutput(0), conv59->getOutput(0) };\n    IConcatenationLayer* concat95 = network->addConcatenation(input_tensor_95, 2);\n\n    IElementWiseLayer* conv96 = convBnSilu(network, weightMap, *concat95->getOutput(0), 384, 1, 1, 0, \"model.96\");\n    IElementWiseLayer* conv97 = convBnSilu(network, weightMap, *concat95->getOutput(0), 384, 1, 1, 0, \"model.97\");\n\n    IElementWiseLayer* conv98 = convBnSilu(network, weightMap, *conv97->getOutput(0), 192, 3, 1, 1, \"model.98\");\n    IElementWiseLayer* conv99 = convBnSilu(network, weightMap, *conv98->getOutput(0), 192, 3, 1, 1, \"model.99\");\n    IElementWiseLayer* conv100 = convBnSilu(network, weightMap, *conv99->getOutput(0), 192, 3, 1, 1, \"model.100\");\n    IElementWiseLayer* conv101 = convBnSilu(network, weightMap, *conv100->getOutput(0), 192, 3, 1, 1, \"model.101\");\n    ITensor* input_tensor_102[] = { conv101->getOutput(0), conv100->getOutput(0), conv99->getOutput(0), conv98->getOutput(0), conv97->getOutput(0), conv96->getOutput(0) };\n    IConcatenationLayer* concat102 = network->addConcatenation(input_tensor_102, 6);\n    IElementWiseLayer* conv103 = convBnSilu(network, weightMap, *concat102->getOutput(0), 384, 1, 1, 0, \"model.103\");\n\n    IElementWiseLayer* conv104 = convBnSilu(network, weightMap, *conv103->getOutput(0), 512, 3, 2, 1, \"model.104\");\n\n    ITensor* input_tensor_105[] = { conv104->getOutput(0), conv47->getOutput(0) };\n    IConcatenationLayer* concat105 = network->addConcatenation(input_tensor_105, 2);\n\n    IElementWiseLayer* conv106 = convBnSilu(network, weightMap, *concat105->getOutput(0), 512, 1, 1, 0, \"model.106\");\n    IElementWiseLayer* conv107 = convBnSilu(network, weightMap, *concat105->getOutput(0), 512, 1, 1, 0, \"model.107\");\n\n    IElementWiseLayer* conv108 = convBnSilu(network, weightMap, *conv107->getOutput(0), 256, 3, 1, 1, \"model.108\");\n    IElementWiseLayer* conv109 = convBnSilu(network, weightMap, *conv108->getOutput(0), 256, 3, 1, 1, \"model.109\");\n    IElementWiseLayer* conv110 = convBnSilu(network, weightMap, *conv109->getOutput(0), 256, 3, 1, 1, \"model.110\");\n    IElementWiseLayer* conv111 = convBnSilu(network, weightMap, *conv110->getOutput(0), 256, 3, 1, 1, \"model.111\");\n    ITensor* input_tensor_112[] = { conv111->getOutput(0), conv110->getOutput(0), conv109->getOutput(0), conv108->getOutput(0), conv107->getOutput(0), conv106->getOutput(0) };\n    IConcatenationLayer* concat112 = network->addConcatenation(input_tensor_112, 6);\n\n    IElementWiseLayer* conv113 = convBnSilu(network, weightMap, *concat112->getOutput(0), 512, 1, 1, 0, \"model.113\");\n    IElementWiseLayer* conv114 = convBnSilu(network, weightMap, *conv83->getOutput(0), 256, 3, 1, 1, \"model.114\");\n    IElementWiseLayer* conv115 = convBnSilu(network, weightMap, *conv93->getOutput(0), 512, 3, 1, 1, \"model.115\");\n    IElementWiseLayer* conv116 = convBnSilu(network, weightMap, *conv103->getOutput(0), 768, 3, 1, 1, \"model.116\");\n    IElementWiseLayer* conv117 = convBnSilu(network, weightMap, *conv113->getOutput(0), 1024, 3, 1, 1, \"model.117\");\n\n    // out\n    IConvolutionLayer* cv105_0 = network->addConvolutionNd(*conv114->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.118.m.0.weight\"], weightMap[\"model.118.m.0.bias\"]);\n    assert(cv105_0);\n    cv105_0->setName(\"cv105.0\");\n    IConvolutionLayer* cv105_1 = network->addConvolutionNd(*conv115->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.118.m.1.weight\"], weightMap[\"model.118.m.1.bias\"]);\n    assert(cv105_1);\n    cv105_1->setName(\"cv105.1\");\n    IConvolutionLayer* cv105_2 = network->addConvolutionNd(*conv116->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.118.m.2.weight\"], weightMap[\"model.118.m.2.bias\"]);\n    assert(cv105_2);\n    cv105_2->setName(\"cv105.2\");\n    IConvolutionLayer* cv105_3 = network->addConvolutionNd(*conv117->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.118.m.3.weight\"], weightMap[\"model.118.m.3.bias\"]);\n    assert(cv105_3);\n    cv105_3->setName(\"cv105.3\");\n\n    /*------------detect-----------*/\n    auto yolo = addYoLoLayer(network, weightMap, \"model.118\", std::vector<IConvolutionLayer*>{cv105_0, cv105_1, cv105_2, cv105_3});\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#if defined(USE_FP16)\n    config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(BuilderFlag::kINT8);\n    Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return serialized_model;\n}\n\nIHostMemory* build_engine_yolov7x(unsigned int maxBatchSize,IBuilder* builder, IBuilderConfig* config, DataType dt, const std::string& wts_path) {\n    std::map<std::string, Weights> weightMap = loadWeights(wts_path);\n\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{ 3, kInputH, kInputW });\n    assert(data);\n\n    /*----------------------------------yolov7x backbone-----------------------------------------*/\n    IElementWiseLayer* conv0 = convBnSilu(network, weightMap, *data, 40, 3, 1, 1, \"model.0\");\n\n    IElementWiseLayer* conv1 = convBnSilu(network, weightMap, *conv0->getOutput(0), 80, 3, 2, 1, \"model.1\");\n    IElementWiseLayer* conv2 = convBnSilu(network, weightMap, *conv1->getOutput(0), 80, 3, 1, 1, \"model.2\");\n    IElementWiseLayer* conv3 = convBnSilu(network, weightMap, *conv2->getOutput(0), 160, 3, 2, 1, \"model.3\");\n\n    IElementWiseLayer* conv4 = convBnSilu(network, weightMap, *conv3->getOutput(0), 64, 1, 1, 0, \"model.4\");\n\n    IElementWiseLayer* conv5 = convBnSilu(network, weightMap, *conv3->getOutput(0), 64, 1, 1, 0, \"model.5\");\n    IElementWiseLayer* conv6 = convBnSilu(network, weightMap, *conv5->getOutput(0), 64, 3, 1, 1, \"model.6\");\n    IElementWiseLayer* conv7 = convBnSilu(network, weightMap, *conv6->getOutput(0), 64, 3, 1, 1, \"model.7\");\n    IElementWiseLayer* conv8 = convBnSilu(network, weightMap, *conv7->getOutput(0), 64, 3, 1, 1, \"model.8\");\n    IElementWiseLayer* conv9 = convBnSilu(network, weightMap, *conv8->getOutput(0), 64, 3, 1, 1, \"model.9\");\n    IElementWiseLayer* conv10 = convBnSilu(network, weightMap, *conv9->getOutput(0), 64, 3, 1, 1, \"model.10\");\n    IElementWiseLayer* conv11 = convBnSilu(network, weightMap, *conv10->getOutput(0), 64, 3, 1, 1, \"model.11\");\n\n    ITensor* input_tensor_12[] = { conv11->getOutput(0), conv9->getOutput(0), conv7->getOutput(0), conv5->getOutput(0), conv4->getOutput(0) };\n    IConcatenationLayer* concat12 = network->addConcatenation(input_tensor_12, 5);\n    //concat9->setAxis(0);\n    IElementWiseLayer* conv13 = convBnSilu(network, weightMap, *concat12->getOutput(0), 320, 1, 1, 0, \"model.13\");\n\n    IPoolingLayer* mp1 = network->addPoolingNd(*conv13->getOutput(0), PoolingType::kMAX, DimsHW{ 2, 2 });\n    mp1->setStrideNd(DimsHW{ 2, 2 });\n    IElementWiseLayer* conv15 = convBnSilu(network, weightMap, *mp1->getOutput(0), 160, 1, 1, 0, \"model.15\");\n\n    IElementWiseLayer* conv16 = convBnSilu(network, weightMap, *conv13->getOutput(0), 160, 1, 1, 0, \"model.16\");\n    IElementWiseLayer* conv17 = convBnSilu(network, weightMap, *conv16->getOutput(0), 160, 3, 2, 1, \"model.17\");\n    ITensor* input_tensor_18[] = { conv17->getOutput(0), conv15->getOutput(0) };\n    IConcatenationLayer* concat18 = network->addConcatenation(input_tensor_18, 2);\n\n    //IConcatenationLayer* mp1 = MPC3(network, weightMap, *conv13->getOutput(0), 160, \"model.15\", \"model.16\", \"model.17\");\n\n\n    IElementWiseLayer* conv19 = convBnSilu(network, weightMap, *concat18->getOutput(0), 128, 1, 1, 0, \"model.19\");\n\n    IElementWiseLayer* conv20 = convBnSilu(network, weightMap, *concat18->getOutput(0), 128, 1, 1, 0, \"model.20\");\n    IElementWiseLayer* conv21 = convBnSilu(network, weightMap, *conv20->getOutput(0), 128, 3, 1, 1, \"model.21\");\n    IElementWiseLayer* conv22 = convBnSilu(network, weightMap, *conv21->getOutput(0), 128, 3, 1, 1, \"model.22\");\n    IElementWiseLayer* conv23 = convBnSilu(network, weightMap, *conv22->getOutput(0), 128, 3, 1, 1, \"model.23\");\n    IElementWiseLayer* conv24 = convBnSilu(network, weightMap, *conv23->getOutput(0), 128, 3, 1, 1, \"model.24\");\n    IElementWiseLayer* conv25 = convBnSilu(network, weightMap, *conv24->getOutput(0), 128, 3, 1, 1, \"model.25\");\n    IElementWiseLayer* conv26 = convBnSilu(network, weightMap, *conv25->getOutput(0), 128, 3, 1, 1, \"model.26\");\n\n    ITensor* input_tensor_27[] = { conv26->getOutput(0), conv24->getOutput(0), conv22->getOutput(0), conv20->getOutput(0),conv19->getOutput(0) };\n    IConcatenationLayer* concat27 = network->addConcatenation(input_tensor_27, 5);\n    //concat9->setAxis(0);\n    IElementWiseLayer* conv28 = convBnSilu(network, weightMap, *concat27->getOutput(0), 640, 1, 1, 0, \"model.28\");\n\n\n    IPoolingLayer* mp2 = network->addPoolingNd(*conv28->getOutput(0), PoolingType::kMAX, DimsHW{ 2, 2 });\n    mp2->setStrideNd(DimsHW{ 2, 2 });\n    IElementWiseLayer* conv30 = convBnSilu(network, weightMap, *mp2->getOutput(0), 320, 1, 1, 0, \"model.30\");\n\n    IElementWiseLayer* conv31 = convBnSilu(network, weightMap, *conv28->getOutput(0), 320, 1, 1, 0, \"model.31\");\n    IElementWiseLayer* conv32 = convBnSilu(network, weightMap, *conv31->getOutput(0), 320, 3, 2, 1, \"model.32\");\n\n    ITensor* input_tensor_33[] = { conv32->getOutput(0), conv30->getOutput(0) };\n    IConcatenationLayer* concat33 = network->addConcatenation(input_tensor_33, 2);\n    //IConcatenationLayer* mp2 = MPC3(network, weightMap, *conv28->getOutput(0), 320, \"model.30\", \"model.31\", \"model.32\");\n\n\n    IElementWiseLayer* conv34 = convBnSilu(network, weightMap, *concat33->getOutput(0), 256, 1, 1, 0, \"model.34\");\n\n    IElementWiseLayer* conv35 = convBnSilu(network, weightMap, *concat33->getOutput(0), 256, 1, 1, 0, \"model.35\");\n    IElementWiseLayer* conv36 = convBnSilu(network, weightMap, *conv35->getOutput(0), 256, 3, 1, 1, \"model.36\");\n    IElementWiseLayer* conv37 = convBnSilu(network, weightMap, *conv36->getOutput(0), 256, 3, 1, 1, \"model.37\");\n    IElementWiseLayer* conv38 = convBnSilu(network, weightMap, *conv37->getOutput(0), 256, 3, 1, 1, \"model.38\");\n    IElementWiseLayer* conv39 = convBnSilu(network, weightMap, *conv38->getOutput(0), 256, 3, 1, 1, \"model.39\");\n    IElementWiseLayer* conv40 = convBnSilu(network, weightMap, *conv39->getOutput(0), 256, 3, 1, 1, \"model.40\");\n    IElementWiseLayer* conv41 = convBnSilu(network, weightMap, *conv40->getOutput(0), 256, 3, 1, 1, \"model.41\");\n\n    ITensor* input_tensor_42[] = { conv41->getOutput(0), conv39->getOutput(0), conv37->getOutput(0), conv35->getOutput(0),conv34->getOutput(0) };\n    IConcatenationLayer* concat42 = network->addConcatenation(input_tensor_42, 5);\n    //concat9->setAxis(0);\n    IElementWiseLayer* conv43 = convBnSilu(network, weightMap, *concat42->getOutput(0), 1280, 1, 1, 0, \"model.43\");\n\n\n    IPoolingLayer* mp3 = network->addPoolingNd(*conv43->getOutput(0), PoolingType::kMAX, DimsHW{ 2, 2 });\n    mp3->setStrideNd(DimsHW{ 2, 2 });\n    IElementWiseLayer* conv45 = convBnSilu(network, weightMap, *mp3->getOutput(0), 640, 1, 1, 0, \"model.45\");\n\n    IElementWiseLayer* conv46 = convBnSilu(network, weightMap, *conv43->getOutput(0), 640, 1, 1, 0, \"model.46\");\n    IElementWiseLayer* conv47 = convBnSilu(network, weightMap, *conv46->getOutput(0), 640, 3, 2, 1, \"model.47\");\n    ITensor* input_tensor_48[] = { conv47->getOutput(0), conv45->getOutput(0) };\n    IConcatenationLayer* concat48 = network->addConcatenation(input_tensor_48, 2);\n\n    //IConcatenationLayer* mp3 = MPC3(network, weightMap, *conv43->getOutput(0), 640, \"model.45\", \"model.46\", \"model.47\");\n\n\n    IElementWiseLayer* conv49 = convBnSilu(network, weightMap, *concat48->getOutput(0), 256, 1, 1, 0, \"model.49\");\n\n    IElementWiseLayer* conv50 = convBnSilu(network, weightMap, *concat48->getOutput(0), 256, 1, 1, 0, \"model.50\");\n    IElementWiseLayer* conv51 = convBnSilu(network, weightMap, *conv50->getOutput(0), 256, 3, 1, 1, \"model.51\");\n    IElementWiseLayer* conv52 = convBnSilu(network, weightMap, *conv51->getOutput(0), 256, 3, 1, 1, \"model.52\");\n    IElementWiseLayer* conv53 = convBnSilu(network, weightMap, *conv52->getOutput(0), 256, 3, 1, 1, \"model.53\");\n    IElementWiseLayer* conv54 = convBnSilu(network, weightMap, *conv53->getOutput(0), 256, 3, 1, 1, \"model.54\");\n    IElementWiseLayer* conv55 = convBnSilu(network, weightMap, *conv54->getOutput(0), 256, 3, 1, 1, \"model.55\");\n    IElementWiseLayer* conv56 = convBnSilu(network, weightMap, *conv55->getOutput(0), 256, 3, 1, 1, \"model.56\");\n\n    ITensor* input_tensor_57[] = { conv56->getOutput(0), conv54->getOutput(0), conv52->getOutput(0), conv50->getOutput(0),conv49->getOutput(0) };\n    IConcatenationLayer* concat57 = network->addConcatenation(input_tensor_57, 5);\n    //concat9->setAxis(0);\n    IElementWiseLayer* conv58 = convBnSilu(network, weightMap, *concat57->getOutput(0), 1280, 1, 1, 0, \"model.58\");\n\n\n    //-----------------------yolov7 head---------------------------\n    //-----SPPCSPC-----------\n    IElementWiseLayer* conv59 = SPPCSPC(network, weightMap, *conv58->getOutput(0), 640, \"model.59\");\n\n    IElementWiseLayer* conv60 = convBnSilu(network, weightMap, *conv59->getOutput(0), 320, 1, 1, 0, \"model.60\");\n\n\n    float scale[] = { 1.0, 2.0, 2.0 };\n    IResizeLayer* re61 = network->addResize(*conv60->getOutput(0));\n    re61->setResizeMode(ResizeMode::kNEAREST);\n    re61->setScales(scale, 3);\n\n    IElementWiseLayer* conv62 = convBnSilu(network, weightMap, *conv43->getOutput(0), 320, 1, 1, 0, \"model.62\");\n\n\n    ITensor* input_tensor_63[] = { conv62->getOutput(0), re61->getOutput(0) };\n    IConcatenationLayer* concat63 = network->addConcatenation(input_tensor_63, 2);\n    //concat63->setAxis(0);\n\n\n    IElementWiseLayer* conv64 = convBnSilu(network, weightMap, *concat63->getOutput(0), 256, 1, 1, 0, \"model.64\");\n\n    IElementWiseLayer* conv65 = convBnSilu(network, weightMap, *concat63->getOutput(0), 256, 1, 1, 0, \"model.65\");\n    IElementWiseLayer* conv66 = convBnSilu(network, weightMap, *conv65->getOutput(0), 256, 3, 1, 1, \"model.66\");\n    IElementWiseLayer* conv67 = convBnSilu(network, weightMap, *conv66->getOutput(0), 256, 3, 1, 1, \"model.67\");\n    IElementWiseLayer* conv68 = convBnSilu(network, weightMap, *conv67->getOutput(0), 256, 3, 1, 1, \"model.68\");\n    IElementWiseLayer* conv69 = convBnSilu(network, weightMap, *conv68->getOutput(0), 256, 3, 1, 1, \"model.69\");\n    IElementWiseLayer* conv70 = convBnSilu(network, weightMap, *conv69->getOutput(0), 256, 3, 1, 1, \"model.70\");\n    IElementWiseLayer* conv71 = convBnSilu(network, weightMap, *conv70->getOutput(0), 256, 3, 1, 1, \"model.71\");\n\n    ITensor* input_tensor_72[] = { conv71->getOutput(0), conv69->getOutput(0), conv67->getOutput(0), conv65->getOutput(0),conv64->getOutput(0) };\n    IConcatenationLayer* concat72 = network->addConcatenation(input_tensor_72, 5);\n    //concat9->setAxis(0);\n    IElementWiseLayer* conv73 = convBnSilu(network, weightMap, *concat72->getOutput(0), 320, 1, 1, 0, \"model.73\");\n\n    IElementWiseLayer* conv74 = convBnSilu(network, weightMap, *conv73->getOutput(0), 160, 1, 1, 0, \"model.74\");\n\n    IResizeLayer* re75 = network->addResize(*conv74->getOutput(0));\n    re75->setResizeMode(ResizeMode::kNEAREST);\n    re75->setScales(scale, 3);\n\n\n    IElementWiseLayer* conv76 = convBnSilu(network, weightMap, *conv28->getOutput(0), 160, 1, 1, 0, \"model.76\");\n\n\n    ITensor* input_tensor_77[] = { conv76->getOutput(0), re75->getOutput(0) };\n    IConcatenationLayer* concat77 = network->addConcatenation(input_tensor_77, 2);\n\n\n\n    IElementWiseLayer* conv78 = convBnSilu(network, weightMap, *concat77->getOutput(0), 128, 1, 1, 0, \"model.78\");\n\n    IElementWiseLayer* conv79 = convBnSilu(network, weightMap, *concat77->getOutput(0), 128, 1, 1, 0, \"model.79\");\n    IElementWiseLayer* conv80 = convBnSilu(network, weightMap, *conv79->getOutput(0), 128, 3, 1, 1, \"model.80\");\n    IElementWiseLayer* conv81 = convBnSilu(network, weightMap, *conv80->getOutput(0), 128, 3, 1, 1, \"model.81\");\n    IElementWiseLayer* conv82 = convBnSilu(network, weightMap, *conv81->getOutput(0), 128, 3, 1, 1, \"model.82\");\n    IElementWiseLayer* conv83 = convBnSilu(network, weightMap, *conv82->getOutput(0), 128, 3, 1, 1, \"model.83\");\n    IElementWiseLayer* conv84 = convBnSilu(network, weightMap, *conv83->getOutput(0), 128, 3, 1, 1, \"model.84\");\n    IElementWiseLayer* conv85 = convBnSilu(network, weightMap, *conv84->getOutput(0), 128, 3, 1, 1, \"model.85\");\n\n\n    ITensor* input_tensor_86[] = { conv85->getOutput(0), conv83->getOutput(0), conv81->getOutput(0), conv79->getOutput(0),conv78->getOutput(0) };\n    IConcatenationLayer* concat86 = network->addConcatenation(input_tensor_86, 5);\n    //concat9->setAxis(0);\n    IElementWiseLayer* conv87 = convBnSilu(network, weightMap, *concat86->getOutput(0), 160, 1, 1, 0, \"model.87\");\n\n\n    IPoolingLayer* mp88 = network->addPoolingNd(*conv87->getOutput(0), PoolingType::kMAX, DimsHW{ 2, 2 });\n    mp88->setStrideNd(DimsHW{ 2, 2 });\n    IElementWiseLayer* conv89 = convBnSilu(network, weightMap, *mp88->getOutput(0), 160, 1, 1, 0, \"model.89\");\n\n    IElementWiseLayer* conv90 = convBnSilu(network, weightMap, *conv87->getOutput(0), 160, 1, 1, 0, \"model.90\");\n    IElementWiseLayer* conv91 = convBnSilu(network, weightMap, *conv90->getOutput(0), 160, 3, 2, 1, \"model.91\");\n\n\n    ITensor* input_tensor_92[] = { conv91->getOutput(0), conv89->getOutput(0),conv73->getOutput(0) };\n    IConcatenationLayer* concat92 = network->addConcatenation(input_tensor_92, 3);\n\n\n    IElementWiseLayer* conv93 = convBnSilu(network, weightMap, *concat92->getOutput(0), 256, 1, 1, 0, \"model.93\");\n\n    IElementWiseLayer* conv94 = convBnSilu(network, weightMap, *concat92->getOutput(0), 256, 1, 1, 0, \"model.94\");\n    IElementWiseLayer* conv95 = convBnSilu(network, weightMap, *conv94->getOutput(0), 256, 3, 1, 1, \"model.95\");\n    IElementWiseLayer* conv96 = convBnSilu(network, weightMap, *conv95->getOutput(0), 256, 3, 1, 1, \"model.96\");\n    IElementWiseLayer* conv97 = convBnSilu(network, weightMap, *conv96->getOutput(0), 256, 3, 1, 1, \"model.97\");\n    IElementWiseLayer* conv98 = convBnSilu(network, weightMap, *conv97->getOutput(0), 256, 3, 1, 1, \"model.98\");\n    IElementWiseLayer* conv99 = convBnSilu(network, weightMap, *conv98->getOutput(0), 256, 3, 1, 1, \"model.99\");\n    IElementWiseLayer* conv100 = convBnSilu(network, weightMap, *conv99->getOutput(0), 256, 3, 1, 1, \"model.100\");\n\n\n    ITensor* input_tensor_101[] = { conv100->getOutput(0), conv98->getOutput(0), conv96->getOutput(0), conv94->getOutput(0),conv93->getOutput(0) };\n    IConcatenationLayer* concat101 = network->addConcatenation(input_tensor_101, 5);\n    //concat9->setAxis(0);\n    IElementWiseLayer* conv102 = convBnSilu(network, weightMap, *concat101->getOutput(0), 320, 1, 1, 0, \"model.102\");\n\n    IPoolingLayer* mp103 = network->addPoolingNd(*conv102->getOutput(0), PoolingType::kMAX, DimsHW{ 2, 2 });\n    mp103->setStrideNd(DimsHW{ 2, 2 });\n    IElementWiseLayer* conv104 = convBnSilu(network, weightMap, *mp103->getOutput(0), 320, 1, 1, 0, \"model.104\");\n\n    IElementWiseLayer* conv105 = convBnSilu(network, weightMap, *conv102->getOutput(0), 320, 1, 1, 0, \"model.105\");\n    IElementWiseLayer* conv106 = convBnSilu(network, weightMap, *conv105->getOutput(0), 320, 3, 2, 1, \"model.106\");\n\n\n    ITensor* input_tensor_107[] = { conv106->getOutput(0), conv104->getOutput(0),conv59->getOutput(0) };\n    IConcatenationLayer* concat107 = network->addConcatenation(input_tensor_107, 3);\n\n\n\n    IElementWiseLayer* conv108 = convBnSilu(network, weightMap, *concat107->getOutput(0), 512, 1, 1, 0, \"model.108\");\n\n    IElementWiseLayer* conv109 = convBnSilu(network, weightMap, *concat107->getOutput(0), 512, 1, 1, 0, \"model.109\");\n    IElementWiseLayer* conv110 = convBnSilu(network, weightMap, *conv109->getOutput(0), 512, 3, 1, 1, \"model.110\");\n    IElementWiseLayer* conv111 = convBnSilu(network, weightMap, *conv110->getOutput(0), 512, 3, 1, 1, \"model.111\");\n    IElementWiseLayer* conv112 = convBnSilu(network, weightMap, *conv111->getOutput(0), 512, 3, 1, 1, \"model.112\");\n    IElementWiseLayer* conv113 = convBnSilu(network, weightMap, *conv112->getOutput(0), 512, 3, 1, 1, \"model.113\");\n    IElementWiseLayer* conv114 = convBnSilu(network, weightMap, *conv113->getOutput(0), 512, 3, 1, 1, \"model.114\");\n    IElementWiseLayer* conv115 = convBnSilu(network, weightMap, *conv114->getOutput(0), 512, 3, 1, 1, \"model.115\");\n\n    ITensor* input_tensor_116[] = { conv115->getOutput(0), conv113->getOutput(0), conv111->getOutput(0), conv109->getOutput(0),conv108->getOutput(0) };\n    IConcatenationLayer* concat116 = network->addConcatenation(input_tensor_116, 5);\n    //concat9->setAxis(0);\n    IElementWiseLayer* conv117 = convBnSilu(network, weightMap, *concat116->getOutput(0), 640, 1, 1, 0, \"model.117\");\n\n\n    IElementWiseLayer* con_0 = convBnSilu(network, weightMap, *conv87->getOutput(0), 320, 3, 1, 1, \"model.118\");\n    IElementWiseLayer* con_1 = convBnSilu(network, weightMap, *conv102->getOutput(0), 640, 3, 1, 1, \"model.119\");\n    IElementWiseLayer* con_2 = convBnSilu(network, weightMap, *conv117->getOutput(0), 1280, 3, 1, 1, \"model.120\");\n\n\n    /*----------------------------------yolov7 out-----------------------------------------*/\n    IConvolutionLayer* det0 = network->addConvolutionNd(*con_0->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.121.m.0.weight\"], weightMap[\"model.121.m.0.bias\"]);\n    assert(det0);\n    det0->setName(\"det0\");\n    IConvolutionLayer* det1 = network->addConvolutionNd(*con_1->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.121.m.1.weight\"], weightMap[\"model.121.m.1.bias\"]);\n    assert(det1);\n    det1->setName(\"det1\");\n    IConvolutionLayer* det2 = network->addConvolutionNd(*con_2->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.121.m.2.weight\"], weightMap[\"model.121.m.2.bias\"]);\n    assert(det2);\n    det2->setName(\"det2\");\n\n    auto yolo = addYoLoLayer(network, weightMap, \"model.121\", std::vector<IConvolutionLayer*>{det0, det1, det2});\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n#if defined(USE_FP16)\n    config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(BuilderFlag::kINT8);\n    Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nIHostMemory* build_engine_yolov7(unsigned int maxBatchSize,IBuilder* builder, IBuilderConfig* config, DataType dt, const std::string& wts_path) {\n    std::map<std::string, Weights> weightMap = loadWeights(wts_path);\n\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{ 3, kInputH, kInputW });\n    assert(data);\n    /*----------------------------------yolov7 backbone-----------------------------------------*/\n    IElementWiseLayer* conv0 = convBnSilu(network, weightMap, *data, 32, 3, 1, 1, \"model.0\");\n\n    IElementWiseLayer* conv1 = convBnSilu(network, weightMap, *conv0->getOutput(0), 64, 3, 2, 1, \"model.1\");\n    IElementWiseLayer* conv2 = convBnSilu(network, weightMap, *conv1->getOutput(0), 64, 3, 1, 1, \"model.2\");\n\n    IElementWiseLayer* conv3 = convBnSilu(network, weightMap, *conv2->getOutput(0), 128, 3, 2, 1, \"model.3\");\n    IElementWiseLayer* conv4 = convBnSilu(network, weightMap, *conv3->getOutput(0), 64, 1, 1, 0, \"model.4\");\n    IElementWiseLayer* conv5 = convBnSilu(network, weightMap, *conv3->getOutput(0), 64, 1, 1, 0, \"model.5\");\n    IElementWiseLayer* conv6 = convBnSilu(network, weightMap, *conv5->getOutput(0), 64, 3, 1, 1, \"model.6\");\n    IElementWiseLayer* conv7 = convBnSilu(network, weightMap, *conv6->getOutput(0), 64, 3, 1, 1, \"model.7\");\n    IElementWiseLayer* conv8 = convBnSilu(network, weightMap, *conv7->getOutput(0), 64, 3, 1, 1, \"model.8\");\n    IElementWiseLayer* conv9 = convBnSilu(network, weightMap, *conv8->getOutput(0), 64, 3, 1, 1, \"model.9\");\n    ITensor* input_tensor_10[] = { conv9->getOutput(0), conv7->getOutput(0), conv5->getOutput(0), conv4->getOutput(0) };\n    IConcatenationLayer* concat10 = network->addConcatenation(input_tensor_10, 4);\n    concat10->setAxis(0);\n    IElementWiseLayer* conv11 = convBnSilu(network, weightMap, *concat10->getOutput(0), 256, 1, 1, 0, \"model.11\");\n\n    IPoolingLayer* mp12 = network->addPoolingNd(*conv11->getOutput(0), PoolingType::kMAX, DimsHW{ 2, 2 });\n    mp12->setStrideNd(DimsHW{ 2, 2 });\n    IElementWiseLayer* conv13 = convBnSilu(network, weightMap, *mp12->getOutput(0), 128, 1, 1, 0, \"model.13\");\n    IElementWiseLayer* conv14 = convBnSilu(network, weightMap, *conv11->getOutput(0), 128, 1, 1, 0, \"model.14\");\n    IElementWiseLayer* conv15 = convBnSilu(network, weightMap, *conv14->getOutput(0), 128, 3, 2, 1, \"model.15\");\n    ITensor* input_tensor_16[] = { conv15->getOutput(0), conv13->getOutput(0) };\n    IConcatenationLayer* concat16 = network->addConcatenation(input_tensor_16, 2);\n    IElementWiseLayer* conv17 = convBnSilu(network, weightMap, *concat16->getOutput(0), 128, 1, 1, 0, \"model.17\");\n    IElementWiseLayer* conv18 = convBnSilu(network, weightMap, *concat16->getOutput(0), 128, 1, 1, 0, \"model.18\");\n    IElementWiseLayer* conv19 = convBnSilu(network, weightMap, *conv18->getOutput(0), 128, 3, 1, 1, \"model.19\");\n    IElementWiseLayer* conv20 = convBnSilu(network, weightMap, *conv19->getOutput(0), 128, 3, 1, 1, \"model.20\");\n    IElementWiseLayer* conv21 = convBnSilu(network, weightMap, *conv20->getOutput(0), 128, 3, 1, 1, \"model.21\");\n    IElementWiseLayer* conv22 = convBnSilu(network, weightMap, *conv21->getOutput(0), 128, 3, 1, 1, \"model.22\");\n    ITensor* input_tensor_23[] = { conv22->getOutput(0), conv20->getOutput(0), conv18->getOutput(0), conv17->getOutput(0) };\n    IConcatenationLayer* concat23 = network->addConcatenation(input_tensor_23, 4);\n    concat23->setAxis(0);\n    IElementWiseLayer* conv24 = convBnSilu(network, weightMap, *concat23->getOutput(0), 512, 1, 1, 0, \"model.24\");\n\n    IPoolingLayer* mp25 = network->addPoolingNd(*conv24->getOutput(0), PoolingType::kMAX, DimsHW{ 2, 2 });\n    mp25->setStrideNd(DimsHW{ 2, 2 });\n    IElementWiseLayer* conv26 = convBnSilu(network, weightMap, *mp25->getOutput(0), 256, 1, 1, 0, \"model.26\");\n    IElementWiseLayer* conv27 = convBnSilu(network, weightMap, *conv24->getOutput(0), 256, 1, 1, 0, \"model.27\");\n    IElementWiseLayer* conv28 = convBnSilu(network, weightMap, *conv27->getOutput(0), 256, 3, 2, 1, \"model.28\");\n    ITensor* input_tensor_29[] = { conv28->getOutput(0), conv26->getOutput(0) };\n    IConcatenationLayer* concat29 = network->addConcatenation(input_tensor_29, 2);\n    IElementWiseLayer* conv30 = convBnSilu(network, weightMap, *concat29->getOutput(0), 256, 1, 1, 0, \"model.30\");\n    IElementWiseLayer* conv31 = convBnSilu(network, weightMap, *concat29->getOutput(0), 256, 1, 1, 0, \"model.31\");\n    IElementWiseLayer* conv32 = convBnSilu(network, weightMap, *conv31->getOutput(0), 256, 3, 1, 1, \"model.32\");\n    IElementWiseLayer* conv33 = convBnSilu(network, weightMap, *conv32->getOutput(0), 256, 3, 1, 1, \"model.33\");\n    IElementWiseLayer* conv34 = convBnSilu(network, weightMap, *conv33->getOutput(0), 256, 3, 1, 1, \"model.34\");\n    IElementWiseLayer* conv35 = convBnSilu(network, weightMap, *conv34->getOutput(0), 256, 3, 1, 1, \"model.35\");\n    ITensor* input_tensor_36[] = { conv35->getOutput(0), conv33->getOutput(0), conv31->getOutput(0), conv30->getOutput(0) };\n    IConcatenationLayer* concat36 = network->addConcatenation(input_tensor_36, 4);\n    concat36->setAxis(0);\n    IElementWiseLayer* conv37 = convBnSilu(network, weightMap, *concat36->getOutput(0), 1024, 1, 1, 0, \"model.37\");\n\n    IPoolingLayer* mp38 = network->addPoolingNd(*conv37->getOutput(0), PoolingType::kMAX, DimsHW{ 2, 2 });\n    mp38->setStrideNd(DimsHW{ 2, 2 });\n    IElementWiseLayer* conv39 = convBnSilu(network, weightMap, *mp38->getOutput(0), 512, 1, 1, 0, \"model.39\");\n    IElementWiseLayer* conv40 = convBnSilu(network, weightMap, *conv37->getOutput(0), 512, 1, 1, 0, \"model.40\");\n    IElementWiseLayer* conv41 = convBnSilu(network, weightMap, *conv40->getOutput(0), 512, 3, 2, 1, \"model.41\");\n    ITensor* input_tensor_42[] = { conv41->getOutput(0), conv39->getOutput(0) };\n    IConcatenationLayer* concat42 = network->addConcatenation(input_tensor_42, 2);\n    concat42->setAxis(0);\n    IElementWiseLayer* conv43 = convBnSilu(network, weightMap, *concat42->getOutput(0), 256, 1, 1, 0, \"model.43\");\n    IElementWiseLayer* conv44 = convBnSilu(network, weightMap, *concat42->getOutput(0), 256, 1, 1, 0, \"model.44\");\n    IElementWiseLayer* conv45 = convBnSilu(network, weightMap, *conv44->getOutput(0), 256, 3, 1, 1, \"model.45\");\n    IElementWiseLayer* conv46 = convBnSilu(network, weightMap, *conv45->getOutput(0), 256, 3, 1, 1, \"model.46\");\n    IElementWiseLayer* conv47 = convBnSilu(network, weightMap, *conv46->getOutput(0), 256, 3, 1, 1, \"model.47\");\n    IElementWiseLayer* conv48 = convBnSilu(network, weightMap, *conv47->getOutput(0), 256, 3, 1, 1, \"model.48\");\n    ITensor* input_tensor_49[] = { conv48->getOutput(0), conv46->getOutput(0), conv44->getOutput(0), conv43->getOutput(0) };\n    IConcatenationLayer* concat49 = network->addConcatenation(input_tensor_49, 4);\n    concat49->setAxis(0);\n    IElementWiseLayer* conv50 = convBnSilu(network, weightMap, *concat49->getOutput(0), 1024, 1, 1, 0, \"model.50\");\n\n    /*----------------------------------yolov7 head-----------------------------------------*/\n    IElementWiseLayer* conv51 = SPPCSPC(network, weightMap, *conv50->getOutput(0), 512, \"model.51\");\n\n    IElementWiseLayer* conv52 = convBnSilu(network, weightMap, *conv51->getOutput(0), 256, 1, 1, 0, \"model.52\");\n    float scale[] = { 1.0, 2.0, 2.0 };\n    IResizeLayer* re53 = network->addResize(*conv52->getOutput(0));\n    re53->setResizeMode(ResizeMode::kNEAREST);\n    re53->setScales(scale, 3);\n    IElementWiseLayer* conv54 = convBnSilu(network, weightMap, *conv37->getOutput(0), 256, 1, 1, 0, \"model.54\");\n    ITensor* input_tensor_55[] = { conv54->getOutput(0), re53->getOutput(0) };\n    IConcatenationLayer* concat55 = network->addConcatenation(input_tensor_55, 2);\n    concat55->setAxis(0);\n\n    IElementWiseLayer* conv56 = convBnSilu(network, weightMap, *concat55->getOutput(0), 256, 1, 1, 0, \"model.56\");\n    IElementWiseLayer* conv57 = convBnSilu(network, weightMap, *concat55->getOutput(0), 256, 1, 1, 0, \"model.57\");\n    IElementWiseLayer* conv58 = convBnSilu(network, weightMap, *conv57->getOutput(0), 128, 3, 1, 1, \"model.58\");\n    IElementWiseLayer* conv59 = convBnSilu(network, weightMap, *conv58->getOutput(0), 128, 3, 1, 1, \"model.59\");\n    IElementWiseLayer* conv60 = convBnSilu(network, weightMap, *conv59->getOutput(0), 128, 3, 1, 1, \"model.60\");\n    IElementWiseLayer* conv61 = convBnSilu(network, weightMap, *conv60->getOutput(0), 128, 3, 1, 1, \"model.61\");\n    ITensor* input_tensor_62[] = { conv61->getOutput(0), conv60->getOutput(0), conv59->getOutput(0), conv58->getOutput(0), conv57->getOutput(0), conv56->getOutput(0) };\n    IConcatenationLayer* concat62 = network->addConcatenation(input_tensor_62, 6);\n    concat62->setAxis(0);\n    IElementWiseLayer* conv63 = convBnSilu(network, weightMap, *concat62->getOutput(0), 256, 1, 1, 0, \"model.63\");\n\n    IElementWiseLayer* conv64 = convBnSilu(network, weightMap, *conv63->getOutput(0), 128, 1, 1, 0, \"model.64\");\n    IResizeLayer* re65 = network->addResize(*conv64->getOutput(0));\n    re65->setResizeMode(ResizeMode::kNEAREST);\n    re65->setScales(scale, 3);\n    IElementWiseLayer* conv66 = convBnSilu(network, weightMap, *conv24->getOutput(0), 128, 1, 1, 0, \"model.66\");\n    ITensor* input_tensor_67[] = { conv66->getOutput(0), re65->getOutput(0) };\n    IConcatenationLayer* concat67 = network->addConcatenation(input_tensor_67, 2);\n    concat67->setAxis(0);\n\n    IElementWiseLayer* conv68 = convBnSilu(network, weightMap, *concat67->getOutput(0), 128, 1, 1, 0, \"model.68\");\n    IElementWiseLayer* conv69 = convBnSilu(network, weightMap, *concat67->getOutput(0), 128, 1, 1, 0, \"model.69\");\n    IElementWiseLayer* conv70 = convBnSilu(network, weightMap, *conv69->getOutput(0), 64, 3, 1, 1, \"model.70\");\n    IElementWiseLayer* conv71 = convBnSilu(network, weightMap, *conv70->getOutput(0), 64, 3, 1, 1, \"model.71\");\n    IElementWiseLayer* conv72 = convBnSilu(network, weightMap, *conv71->getOutput(0), 64, 3, 1, 1, \"model.72\");\n    IElementWiseLayer* conv73 = convBnSilu(network, weightMap, *conv72->getOutput(0), 64, 3, 1, 1, \"model.73\");\n    ITensor* input_tensor_74[] = { conv73->getOutput(0), conv72->getOutput(0), conv71->getOutput(0), conv70->getOutput(0), conv69->getOutput(0), conv68->getOutput(0) };\n    IConcatenationLayer* concat74 = network->addConcatenation(input_tensor_74, 6);\n    concat74->setAxis(0);\n    IElementWiseLayer* conv75 = convBnSilu(network, weightMap, *concat74->getOutput(0), 128, 1, 1, 0, \"model.75\");\n\n    IPoolingLayer* mp76 = network->addPoolingNd(*conv75->getOutput(0), PoolingType::kMAX, DimsHW{ 2, 2 });\n    mp76->setStrideNd(DimsHW{ 2, 2 });\n    IElementWiseLayer* conv77 = convBnSilu(network, weightMap, *mp76->getOutput(0), 128, 1, 1, 0, \"model.77\");\n    IElementWiseLayer* conv78 = convBnSilu(network, weightMap, *conv75->getOutput(0), 128, 1, 1, 0, \"model.78\");\n    IElementWiseLayer* conv79 = convBnSilu(network, weightMap, *conv78->getOutput(0), 128, 3, 2, 1, \"model.79\");\n    ITensor* input_tensor_80[] = { conv79->getOutput(0), conv77->getOutput(0), conv63->getOutput(0) };\n    IConcatenationLayer* concat80 = network->addConcatenation(input_tensor_80, 3);\n    concat80->setAxis(0);\n\n    IElementWiseLayer* conv81 = convBnSilu(network, weightMap, *concat80->getOutput(0), 256, 1, 1, 0, \"model.81\");\n    IElementWiseLayer* conv82 = convBnSilu(network, weightMap, *concat80->getOutput(0), 256, 1, 1, 0, \"model.82\");\n    IElementWiseLayer* conv83 = convBnSilu(network, weightMap, *conv82->getOutput(0), 128, 3, 1, 1, \"model.83\");\n    IElementWiseLayer* conv84 = convBnSilu(network, weightMap, *conv83->getOutput(0), 128, 3, 1, 1, \"model.84\");\n    IElementWiseLayer* conv85 = convBnSilu(network, weightMap, *conv84->getOutput(0), 128, 3, 1, 1, \"model.85\");\n    IElementWiseLayer* conv86 = convBnSilu(network, weightMap, *conv85->getOutput(0), 128, 3, 1, 1, \"model.86\");\n    ITensor* input_tensor_87[] = { conv86->getOutput(0), conv85->getOutput(0), conv84->getOutput(0), conv83->getOutput(0), conv82->getOutput(0), conv81->getOutput(0) };\n    IConcatenationLayer* concat87 = network->addConcatenation(input_tensor_87, 6);\n    concat87->setAxis(0);\n    IElementWiseLayer* conv88 = convBnSilu(network, weightMap, *concat87->getOutput(0), 256, 1, 1, 0, \"model.88\");\n\n    IPoolingLayer* mp89 = network->addPoolingNd(*conv88->getOutput(0), PoolingType::kMAX, DimsHW{ 2, 2 });\n    mp89->setStrideNd(DimsHW{ 2, 2 });\n    IElementWiseLayer* conv90 = convBnSilu(network, weightMap, *mp89->getOutput(0), 256, 1, 1, 0, \"model.90\");\n    IElementWiseLayer* conv91 = convBnSilu(network, weightMap, *conv88->getOutput(0), 256, 1, 1, 0, \"model.91\");\n    IElementWiseLayer* conv92 = convBnSilu(network, weightMap, *conv91->getOutput(0), 256, 3, 2, 1, \"model.92\");\n    ITensor* input_tensor_93[] = { conv92->getOutput(0), conv90->getOutput(0), conv51->getOutput(0) };\n    IConcatenationLayer* concat93 = network->addConcatenation(input_tensor_93, 3);\n    concat93->setAxis(0);\n\n    IElementWiseLayer* conv94 = convBnSilu(network, weightMap, *concat93->getOutput(0), 512, 1, 1, 0, \"model.94\");\n    IElementWiseLayer* conv95 = convBnSilu(network, weightMap, *concat93->getOutput(0), 512, 1, 1, 0, \"model.95\");\n    IElementWiseLayer* conv96 = convBnSilu(network, weightMap, *conv95->getOutput(0), 256, 3, 1, 1, \"model.96\");\n    IElementWiseLayer* conv97 = convBnSilu(network, weightMap, *conv96->getOutput(0), 256, 3, 1, 1, \"model.97\");\n    IElementWiseLayer* conv98 = convBnSilu(network, weightMap, *conv97->getOutput(0), 256, 3, 1, 1, \"model.98\");\n    IElementWiseLayer* conv99 = convBnSilu(network, weightMap, *conv98->getOutput(0), 256, 3, 1, 1, \"model.99\");\n    ITensor* input_tensor_100[] = { conv99->getOutput(0), conv98->getOutput(0), conv97->getOutput(0), conv96->getOutput(0), conv95->getOutput(0), conv94->getOutput(0) };\n    IConcatenationLayer* concat100 = network->addConcatenation(input_tensor_100, 6);\n    concat100->setAxis(0);\n    IElementWiseLayer* conv101 = convBnSilu(network, weightMap, *concat100->getOutput(0), 512, 1, 1, 0, \"model.101\");\n\n    IElementWiseLayer* conv102 = RepConv(network, weightMap, *conv75->getOutput(0), 256, 3, 1, \"model.102\");\n    IElementWiseLayer* conv103 = RepConv(network, weightMap, *conv88->getOutput(0), 512, 3, 1, \"model.103\");\n    IElementWiseLayer* conv104 = RepConv(network, weightMap, *conv101->getOutput(0), 1024, 3, 1, \"model.104\");\n\n    /*----------------------------------yolov7 out-----------------------------------------*/\n    IConvolutionLayer* cv105_0 = network->addConvolutionNd(*conv102->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.105.m.0.weight\"], weightMap[\"model.105.m.0.bias\"]);\n    assert(cv105_0);\n    cv105_0->setName(\"cv105.0\");\n    IConvolutionLayer* cv105_1 = network->addConvolutionNd(*conv103->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.105.m.1.weight\"], weightMap[\"model.105.m.1.bias\"]);\n    assert(cv105_1);\n    cv105_1->setName(\"cv105.1\");\n    IConvolutionLayer* cv105_2 = network->addConvolutionNd(*conv104->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.105.m.2.weight\"], weightMap[\"model.105.m.2.bias\"]);\n    assert(cv105_2);\n    cv105_2->setName(\"cv105.2\");\n\n    auto yolo = addYoLoLayer(network, weightMap, \"model.105\", std::vector<IConvolutionLayer*>{cv105_0, cv105_1, cv105_2});\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n#if defined(USE_FP16)\n    config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(BuilderFlag::kINT8);\n    Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nIHostMemory* build_engine_yolov7_tiny(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt, std::string& wts_name) {\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{ 3, kInputH, kInputW });\n    assert(data);\n    std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n    /* ------ yolov7-tiny backbone------ */\n    // [32, 3, 2, None, 1, nn.LeakyReLU(0.1)]]---> outch、ksize、stride、padding、groups------\n    auto conv0 = convBlockLeakRelu(network, weightMap, *data, 32, 3, 2, 1, \"model.0\");\n    assert(conv0);\n\n    // [-1, 1, Conv, [64, 3, 2, None, 1, nn.LeakyReLU(0.1)]],  # 1-P2/4\n    auto conv1 = convBlockLeakRelu(network, weightMap, *conv0->getOutput(0), 64, 3, 2, 1, \"model.1\");\n    assert(conv1);\n\n    //  [-1, 1, Conv, [32, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv2 = convBlockLeakRelu(network, weightMap, *conv1->getOutput(0), 32, 1, 1, 0, \"model.2\");\n    assert(conv2);\n\n    // [-2, 1, Conv, [32, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv3 = convBlockLeakRelu(network, weightMap, *conv1->getOutput(0), 32, 1, 1, 0, \"model.3\");\n    assert(conv3);\n\n    // [-1, 1, Conv, [32, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv4 = convBlockLeakRelu(network, weightMap, *conv3->getOutput(0), 32, 3, 1, 1, \"model.4\");\n    assert(conv4);\n\n    // [-1, 1, Conv, [32, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv5 = convBlockLeakRelu(network, weightMap, *conv4->getOutput(0), 32, 3, 1, 1, \"model.5\");\n    assert(conv5);\n\n    ITensor* input_tensor_6[] = { conv5->getOutput(0), conv4->getOutput(0), conv3->getOutput(0), conv2->getOutput(0) };\n    auto cat6 = network->addConcatenation(input_tensor_6, 4);\n    //cat6->setAxis(0);\n\n    // [-1, 1, Conv, [64, 1, 1, None, 1, nn.LeakyReLU(0.1)]],  # 7\n    auto conv7 = convBlockLeakRelu(network, weightMap, *cat6->getOutput(0), 64, 1, 1, 0, \"model.7\");\n    assert(conv7);\n\n    auto* pool8 = network->addPoolingNd(*conv7->getOutput(0), PoolingType::kMAX, DimsHW{ 2, 2 });\n    assert(pool8);\n    pool8->setStrideNd(DimsHW{ 2, 2 });\n\n    //[-1, 1, Conv, [64, 1, 1, None, 1, nn.LeakyReLU(0.1)]] ,\n    auto conv9 = convBlockLeakRelu(network, weightMap, *pool8->getOutput(0), 64, 1, 1, 0, \"model.9\");\n    assert(conv9);\n\n    // [-2, 1, Conv, [64, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv10 = convBlockLeakRelu(network, weightMap, *pool8->getOutput(0), 64, 1, 1, 0, \"model.10\");\n    assert(conv10);\n    //[-1, 1, Conv, [64, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv11 = convBlockLeakRelu(network, weightMap, *conv10->getOutput(0), 64, 3, 1, 1, \"model.11\");\n    assert(conv11);\n    //[-1, 1, Conv, [64, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv12 = convBlockLeakRelu(network, weightMap, *conv11->getOutput(0), 64, 3, 1, 1, \"model.12\");\n    assert(conv12);\n\n    ITensor* input_tensor_13[] = { conv12->getOutput(0), conv11->getOutput(0), conv10->getOutput(0), conv9->getOutput(0) };\n    auto cat13 = network->addConcatenation(input_tensor_13, 4);\n    //cat2->setAxis(0);\n    \n    // [-1, 1, Conv, [128, 1, 1, None, 1, nn.LeakyReLU(0.1)]],  # 14\n    auto conv14 = convBlockLeakRelu(network, weightMap, *cat13->getOutput(0), 128, 1, 1, 0, \"model.14\");\n    assert(conv14);\n\n    auto* pool15 = network->addPoolingNd(*conv14->getOutput(0), PoolingType::kMAX, DimsHW{ 2, 2 });\n    assert(pool15);\n    pool15->setStrideNd(DimsHW{ 2, 2 });\n\n    // [-1, 1, Conv, [128, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv16 = convBlockLeakRelu(network, weightMap, *pool15->getOutput(0), 128, 1, 1, 0, \"model.16\");\n    assert(conv16);\n    //[-2, 1, Conv, [128, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv17 = convBlockLeakRelu(network, weightMap, *pool15->getOutput(0), 128, 1, 1, 0, \"model.17\");\n    assert(conv17);\n    //[-1, 1, Conv, [128, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv18 = convBlockLeakRelu(network, weightMap, *conv17->getOutput(0), 128, 3, 1, 1, \"model.18\");\n    assert(conv18);\n    // [-1, 1, Conv, [128, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv19 = convBlockLeakRelu(network, weightMap, *conv18->getOutput(0), 128, 3, 1, 1, \"model.19\");\n    assert(conv19);\n\n    ITensor* input_tensor_20[] = { conv19->getOutput(0), conv18->getOutput(0), conv17->getOutput(0), conv16->getOutput(0) };\n    auto cat20 = network->addConcatenation(input_tensor_20, 4);\n    //cat20->setAxis(0);\n    //[-1, 1, Conv, [256, 1, 1, None, 1, nn.LeakyReLU(0.1)]],  # 21\n    auto conv21 = convBlockLeakRelu(network, weightMap, *cat20->getOutput(0), 256, 1, 1, 0, \"model.21\");\n    assert(conv21);\n\n    auto* pool22 = network->addPoolingNd(*conv21->getOutput(0), PoolingType::kMAX, DimsHW{ 2, 2 });\n    assert(pool22);\n    pool22->setStrideNd(DimsHW{ 2, 2 });\n\n    // [-1, 1, Conv, [256, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv23 = convBlockLeakRelu(network, weightMap, *pool22->getOutput(0), 256, 1, 1, 0, \"model.23\");\n    assert(conv23);\n\n    // [-2, 1, Conv, [256, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv24 = convBlockLeakRelu(network, weightMap, *pool22->getOutput(0), 256, 1, 1, 0, \"model.24\");\n    assert(conv24);\n\n    // [-1, 1, Conv, [256, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv25 = convBlockLeakRelu(network, weightMap, *conv24->getOutput(0), 256, 3, 1, 1, \"model.25\");\n    assert(conv25);\n\n    // [-1, 1, Conv, [256, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv26 = convBlockLeakRelu(network, weightMap, *conv25->getOutput(0), 256, 3, 1, 1, \"model.26\");\n    assert(conv26);\n\n\n    ITensor* input_tensor_27[] = { conv26->getOutput(0), conv25->getOutput(0), conv24->getOutput(0), conv23->getOutput(0) };\n    auto cat27 = network->addConcatenation(input_tensor_27, 4);\n    //cat27->setAxis(0);\n\n    // [-1, 1, Conv, [512, 1, 1, None, 1, nn.LeakyReLU(0.1)]],  # 28\n    auto conv28 = convBlockLeakRelu(network, weightMap, *cat27->getOutput(0), 512, 1, 1, 0, \"model.28\");\n    assert(conv28);\n\n    /*===============================yolov7-tiny head======================================*/\n\n    // [-1, 1, Conv, [256, 1, 1, None, 1, nn.LeakyReLU(0.1)]]\n    auto conv29 = convBlockLeakRelu(network, weightMap, *conv28->getOutput(0), 256, 1, 1, 0, \"model.29\");\n    assert(conv29);\n\n    // [-2, 1, Conv, [256, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv30 = convBlockLeakRelu(network, weightMap, *conv28->getOutput(0), 256, 1, 1, 0, \"model.30\");\n    assert(conv30);\n\n    //[-1, 1, SP, [5]],\n    auto* pool31 = network->addPoolingNd(*conv30->getOutput(0), PoolingType::kMAX, DimsHW{ 5, 5 });\n    assert(pool31);\n    pool31->setStrideNd(DimsHW{ 1, 1 });\n    pool31->setPaddingNd(DimsHW{ 2, 2 });\n    // [-2, 1, SP, [9]],\n    auto* pool32 = network->addPoolingNd(*conv30->getOutput(0), PoolingType::kMAX, DimsHW{ 9, 9 });\n    assert(pool32);\n    pool32->setStrideNd(DimsHW{ 1, 1 });\n    pool32->setPaddingNd(DimsHW{ 4, 4 });\n\n    // [-3, 1, SP, [13]],\n    auto* pool33 = network->addPoolingNd(*conv30->getOutput(0), PoolingType::kMAX, DimsHW{ 13, 13 });\n    assert(pool33);\n    pool33->setStrideNd(DimsHW{ 1, 1 });\n    pool33->setPaddingNd(DimsHW{ 6, 6 });\n\n    ITensor* input_tensor_34[] = { pool33->getOutput(0), pool32->getOutput(0), pool31->getOutput(0), conv30->getOutput(0) };\n    auto cat34 = network->addConcatenation(input_tensor_34, 4);\n    //cat34->setAxis(0);\n\n    // [-1, 1, Conv, [256, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv35 = convBlockLeakRelu(network, weightMap, *cat34->getOutput(0), 256, 1, 1, 0, \"model.35\");\n    assert(conv35);\n\n    ITensor* input_tensor_36[] = { conv35->getOutput(0), conv29->getOutput(0) };\n    auto cat36 = network->addConcatenation(input_tensor_36, 2);\n    //cat36->setAxis(0);\n\n    // [-1, 1, Conv, [256, 1, 1, None, 1, nn.LeakyReLU(0.1)]],  # 37\n    auto conv37 = convBlockLeakRelu(network, weightMap, *cat36->getOutput(0), 256, 1, 1, 0, \"model.37\");\n    assert(conv37);\n\n    // [-1, 1, Conv, [128, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv38 = convBlockLeakRelu(network, weightMap, *conv37->getOutput(0), 128, 1, 1, 0, \"model.38\");\n    assert(conv38);\n\n    float scale[] = { 1.0, 2.0, 2.0 };\n    IResizeLayer* resize39 = network->addResize(*conv38->getOutput(0));\n    resize39->setResizeMode(ResizeMode::kNEAREST);\n    resize39->setScales(scale, 3);\n\n    //    [21, 1, Conv, [128, 1, 1, None, 1, nn.LeakyReLU(0.1)]], # route backbone P4 ---->conv16\n    auto conv40 = convBlockLeakRelu(network, weightMap, *conv21->getOutput(0), 128, 1, 1, 0, \"model.40\");\n    assert(conv40);\n\n    ITensor* input_tensor_41[] = { conv40->getOutput(0), resize39->getOutput(0) };\n    auto cat41 = network->addConcatenation(input_tensor_41, 2);\n    //cat41->setAxis(0);\n\n    //   [-1, 1, Conv, [64, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv42 = convBlockLeakRelu(network, weightMap, *cat41->getOutput(0), 64, 1, 1, 0, \"model.42\");\n    assert(conv42);\n\n    //[-2, 1, Conv, [64, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv43 = convBlockLeakRelu(network, weightMap, *cat41->getOutput(0), 64, 1, 1, 0, \"model.43\");\n    assert(conv43);\n\n    // [-1, 1, Conv, [64, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv44 = convBlockLeakRelu(network, weightMap, *conv43->getOutput(0), 64, 3, 1, 1, \"model.44\");\n    assert(conv44);\n\n    // [-1, 1, Conv, [64, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv45 = convBlockLeakRelu(network, weightMap, *conv44->getOutput(0), 64, 3, 1, 1, \"model.45\");\n    assert(conv45);\n\n    ITensor* input_tensor_46[] = { conv45->getOutput(0), conv44->getOutput(0), conv43->getOutput(0), conv42->getOutput(0) };\n    auto cat46 = network->addConcatenation(input_tensor_46, 4);\n    //cat46->setAxis(0);\n\n    //  [-1, 1, Conv, [128, 1, 1, None, 1, nn.LeakyReLU(0.1)]],  # 47\n    auto conv47 = convBlockLeakRelu(network, weightMap, *cat46->getOutput(0), 128, 1, 1, 0, \"model.47\");\n    assert(conv47);\n\n    //    [-1, 1, Conv, [64, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv48 = convBlockLeakRelu(network, weightMap, *conv47->getOutput(0), 64, 1, 1, 0, \"model.48\");\n    assert(conv48);\n\n    IResizeLayer* resize49 = network->addResize(*conv48->getOutput(0));\n    resize49->setResizeMode(ResizeMode::kNEAREST);\n    resize49->setScales(scale, 3);\n\n    // [14, 1, Conv, [64, 1, 1, None, 1, nn.LeakyReLU(0.1)]], # route backbone P3 conv11\n    auto conv50 = convBlockLeakRelu(network, weightMap, *conv14->getOutput(0), 64, 1, 1, 0, \"model.50\");\n    assert(conv50);\n\n    ITensor* input_tensor_51[] = { conv50->getOutput(0), resize49->getOutput(0) };\n    IConcatenationLayer* cat51 = network->addConcatenation(input_tensor_51, 2);\n    //cat51->setAxis(0);\n\n    // [-1, 1, Conv, [32, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv52 = convBlockLeakRelu(network, weightMap, *cat51->getOutput(0), 32, 1, 1, 0, \"model.52\");\n    assert(conv52);\n    // [-2, 1, Conv, [32, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv53 = convBlockLeakRelu(network, weightMap, *cat51->getOutput(0), 32, 1, 1, 0, \"model.53\");\n    assert(conv53);\n\n    // [-1, 1, Conv, [32, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv54 = convBlockLeakRelu(network, weightMap, *conv53->getOutput(0), 32, 3, 1, 1, \"model.54\");\n    assert(conv54);\n    // [-1, 1, Conv, [32, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv55 = convBlockLeakRelu(network, weightMap, *conv54->getOutput(0), 32, 3, 1, 1, \"model.55\");\n    assert(conv55);\n\n    ITensor* input_tensor_56[] = { conv55->getOutput(0), conv54->getOutput(0), conv53->getOutput(0),conv52->getOutput(0) };\n    IConcatenationLayer* cat56 = network->addConcatenation(input_tensor_56, 4);\n    //cat56->setAxis(0);\n\n    // [-1, 1, Conv, [64, 1, 1, None, 1, nn.LeakyReLU(0.1)]],  # 57\n    auto conv57 = convBlockLeakRelu(network, weightMap, *cat56->getOutput(0), 64, 1, 1, 0, \"model.57\");\n    assert(conv57);\n\n    // [-1, 1, Conv, [128, 3, 2, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv58 = convBlockLeakRelu(network, weightMap, *conv57->getOutput(0), 128, 3, 2, 1, \"model.58\");\n    assert(conv58);\n\n    // conv32 [[-1, 47], 1, Concat, [1]],\n    ITensor* input_tensor_59[] = { conv58->getOutput(0), conv47->getOutput(0) };\n    IConcatenationLayer* cat59 = network->addConcatenation(input_tensor_59, 2);\n    //cat59->setAxis(0);\n\n    // [-1, 1, Conv, [64, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv60 = convBlockLeakRelu(network, weightMap, *cat59->getOutput(0), 64, 1, 1, 0, \"model.60\");\n    assert(conv60);\n    // [-2, 1, Conv, [64, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv61 = convBlockLeakRelu(network, weightMap, *cat59->getOutput(0), 64, 1, 1, 0, \"model.61\");\n    assert(conv61);\n\n    // [-1, 1, Conv, [64, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv62 = convBlockLeakRelu(network, weightMap, *conv61->getOutput(0), 64, 3, 1, 1, \"model.62\");\n    assert(conv62);\n    // [-1, 1, Conv, [64, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv63 = convBlockLeakRelu(network, weightMap, *conv62->getOutput(0), 64, 3, 1, 1, \"model.63\");\n    assert(conv63);\n\n    ITensor* input_tensor_64[] = { conv63->getOutput(0), conv62->getOutput(0), conv61->getOutput(0), conv60->getOutput(0) };\n    IConcatenationLayer* cat64 = network->addConcatenation(input_tensor_64, 4);\n    //cat64->setAxis(0);\n\n    // [-1, 1, Conv, [128, 1, 1, None, 1, nn.LeakyReLU(0.1)]] , # 65\n    auto conv65 = convBlockLeakRelu(network, weightMap, *cat64->getOutput(0), 128, 1, 1, 0, \"model.65\");\n    assert(conv65);\n\n    // [-1, 1, Conv, [256, 3, 2, None, 1, nn.LeakyReLU(0.1)]] ,\n    auto conv66 = convBlockLeakRelu(network, weightMap, *conv65->getOutput(0), 256, 3, 2, 1, \"model.66\");\n    assert(conv66);\n\n    ITensor* input_tensor_67[] = { conv66->getOutput(0), conv37->getOutput(0) };\n    IConcatenationLayer* cat67 = network->addConcatenation(input_tensor_67, 2);\n    //cat67->setAxis(0);\n\n    // [-1, 1, Conv, [128, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv68 = convBlockLeakRelu(network, weightMap, *cat67->getOutput(0), 128, 1, 1, 0, \"model.68\");\n    assert(conv68);\n    // [-2, 1, Conv, [128, 1, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv69 = convBlockLeakRelu(network, weightMap, *cat67->getOutput(0), 128, 1, 1, 0, \"model.69\");\n    assert(conv69);\n\n    // [-1, 1, Conv, [128, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv70 = convBlockLeakRelu(network, weightMap, *conv69->getOutput(0), 128, 3, 1, 1, \"model.70\");\n    assert(conv70);\n\n    // [-1, 1, Conv, [128, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv71 = convBlockLeakRelu(network, weightMap, *conv70->getOutput(0), 128, 3, 1, 1, \"model.71\");\n    assert(conv71);\n\n    ITensor* input_tensor_72[] = { conv71->getOutput(0), conv70->getOutput(0), conv69->getOutput(0), conv68->getOutput(0) };\n    IConcatenationLayer* cat72 = network->addConcatenation(input_tensor_72, 4);\n    //cat72->setAxis(0);\n\n    // [-1, 1, Conv, [256, 1, 1, None, 1, nn.LeakyReLU(0.1)]],  # 73\n    auto conv73 = convBlockLeakRelu(network, weightMap, *cat72->getOutput(0), 256, 1, 1, 0, \"model.73\");\n    assert(conv73);\n\n\n    // [57, 1, Conv, [128, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv74 = convBlockLeakRelu(network, weightMap, *conv57->getOutput(0), 128, 3, 1, 1, \"model.74\");\n    assert(conv74);\n    // [65, 1, Conv, [256, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv75 = convBlockLeakRelu(network, weightMap, *conv65->getOutput(0), 256, 3, 1, 1, \"model.75\");\n    assert(conv75);\n    // [73, 1, Conv, [512, 3, 1, None, 1, nn.LeakyReLU(0.1)]],\n    auto conv76 = convBlockLeakRelu(network, weightMap, *conv73->getOutput(0), 512, 3, 1, 1, \"model.76\");\n    assert(conv76);\n\n    /* ------ detect ------ */\n    IConvolutionLayer* det0 = network->addConvolutionNd(*conv74->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.77.m.0.weight\"], weightMap[\"model.77.m.0.bias\"]);\n   \n    IConvolutionLayer* det1 = network->addConvolutionNd(*conv75->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.77.m.1.weight\"], weightMap[\"model.77.m.1.bias\"]);\n\n    IConvolutionLayer* det2 = network->addConvolutionNd(*conv76->getOutput(0), kNumAnchor * (kNumClass + 5), DimsHW{ 1, 1 }, weightMap[\"model.77.m.2.weight\"], weightMap[\"model.77.m.2.bias\"]);\n\n    auto yolo = addYoLoLayer(network, weightMap, \"model.77\", std::vector<IConvolutionLayer*>{det0, det1, det2});\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n    // Build engine\n    builder->setMaxBatchSize(maxBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));  // 16MB\n#if defined(USE_FP16)\n    config->setFlag(BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(BuilderFlag::kINT8);\n    Int8EntropyCalibrator2* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, \"./coco_calib/\", \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Don't need the network any more\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\n"
  },
  {
    "path": "yolov7/src/postprocess.cpp",
    "content": "#include \"postprocess.h\"\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n  float l, r, t, b;\n  float r_w = kInputW / (img.cols * 1.0);\n  float r_h = kInputH / (img.rows * 1.0);\n  if (r_h > r_w) {\n    l = bbox[0] - bbox[2] / 2.f;\n    r = bbox[0] + bbox[2] / 2.f;\n    t = bbox[1] - bbox[3] / 2.f - (kInputH - r_w * img.rows) / 2;\n    b = bbox[1] + bbox[3] / 2.f - (kInputH - r_w * img.rows) / 2;\n    l = l / r_w;\n    r = r / r_w;\n    t = t / r_w;\n    b = b / r_w;\n  } else {\n    l = bbox[0] - bbox[2] / 2.f - (kInputW - r_h * img.cols) / 2;\n    r = bbox[0] + bbox[2] / 2.f - (kInputW - r_h * img.cols) / 2;\n    t = bbox[1] - bbox[3] / 2.f;\n    b = bbox[1] + bbox[3] / 2.f;\n    l = l / r_h;\n    r = r / r_h;\n    t = t / r_h;\n    b = b / r_h;\n  }\n  return cv::Rect(round(l), round(t), round(r - l), round(b - t));\n}\n\nstatic float iou(float lbox[4], float rbox[4]) {\n  float interBox[] = {\n    (std::max)(lbox[0] - lbox[2] / 2.f , rbox[0] - rbox[2] / 2.f), //left\n    (std::min)(lbox[0] + lbox[2] / 2.f , rbox[0] + rbox[2] / 2.f), //right\n    (std::max)(lbox[1] - lbox[3] / 2.f , rbox[1] - rbox[3] / 2.f), //top\n    (std::min)(lbox[1] + lbox[3] / 2.f , rbox[1] + rbox[3] / 2.f), //bottom\n  };\n\n  if (interBox[2] > interBox[3] || interBox[0] > interBox[1])\n    return 0.0f;\n\n  float interBoxS = (interBox[1] - interBox[0])*(interBox[3] - interBox[2]);\n  return interBoxS / (lbox[2] * lbox[3] + rbox[2] * rbox[3] - interBoxS);\n}\n\nstatic bool cmp(const Detection& a, const Detection& b) {\n  return a.conf > b.conf;\n}\n\nvoid nms(std::vector<Detection>& res, float *output, float conf_thresh, float nms_thresh) {\n  int det_size = sizeof(Detection) / sizeof(float);\n  std::map<float, std::vector<Detection>> m;\n  for (int i = 0; i < output[0] && i < kMaxNumOutputBbox; i++) {\n    if (output[1 + det_size * i + 4] <= conf_thresh) continue;\n    Detection det;\n    memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n    if (m.count(det.class_id) == 0) m.emplace(det.class_id, std::vector<Detection>());\n    m[det.class_id].push_back(det);\n  }\n  for (auto it = m.begin(); it != m.end(); it++) {\n    auto& dets = it->second;\n    std::sort(dets.begin(), dets.end(), cmp);\n    for (size_t m = 0; m < dets.size(); ++m) {\n      auto& item = dets[m];\n      res.push_back(item);\n      for (size_t n = m + 1; n < dets.size(); ++n) {\n        if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n          dets.erase(dets.begin() + n);\n          --n;\n        }\n      }\n    }\n  }\n}\n\nvoid batch_nms(std::vector<std::vector<Detection>>& res_batch, float *output, int batch_size, int output_size, float conf_thresh, float nms_thresh) {\n  res_batch.resize(batch_size);\n  for (int i = 0; i < batch_size; i++) {\n    nms(res_batch[i], &output[i * output_size], conf_thresh, nms_thresh);\n  }\n}\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n  for (size_t i = 0; i < img_batch.size(); i++) {\n    auto& res = res_batch[i];\n    cv::Mat img = img_batch[i];\n    for (size_t j = 0; j < res.size(); j++) {\n      cv::Rect r = get_rect(img, res[j].bbox);\n      cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n      cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n    }\n  }\n}\n\n"
  },
  {
    "path": "yolov7/src/preprocess.cu",
    "content": "#include \"preprocess.h\"\n#include \"cuda_utils.h\"\n\nstatic uint8_t* img_buffer_host = nullptr;\nstatic uint8_t* img_buffer_device = nullptr;\n\nstruct AffineMatrix{\n  float value[6];\n};\n\n__global__ void warpaffine_kernel(\n    uint8_t* src, int src_line_size, int src_width,\n    int src_height, float* dst, int dst_width,\n    int dst_height, uint8_t const_value_st,\n    AffineMatrix d2s, int edge) {\n  int position = blockDim.x * blockIdx.x + threadIdx.x;\n  if (position >= edge) return;\n\n  float m_x1 = d2s.value[0];\n  float m_y1 = d2s.value[1];\n  float m_z1 = d2s.value[2];\n  float m_x2 = d2s.value[3];\n  float m_y2 = d2s.value[4];\n  float m_z2 = d2s.value[5];\n\n  int dx = position % dst_width;\n  int dy = position / dst_width;\n  float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;\n  float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;\n  float c0, c1, c2;\n\n  if (src_x <= -1 || src_x >= src_width || src_y <= -1 || src_y >= src_height) {\n    // out of range\n    c0 = const_value_st;\n    c1 = const_value_st;\n    c2 = const_value_st;\n  } else {\n    int y_low = floorf(src_y);\n    int x_low = floorf(src_x);\n    int y_high = y_low + 1;\n    int x_high = x_low + 1;\n\n    uint8_t const_value[] = {const_value_st, const_value_st, const_value_st};\n    float ly = src_y - y_low;\n    float lx = src_x - x_low;\n    float hy = 1 - ly;\n    float hx = 1 - lx;\n    float w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;\n    uint8_t* v1 = const_value;\n    uint8_t* v2 = const_value;\n    uint8_t* v3 = const_value;\n    uint8_t* v4 = const_value;\n\n    if (y_low >= 0) {\n      if (x_low >= 0)\n        v1 = src + y_low * src_line_size + x_low * 3;\n\n      if (x_high < src_width)\n        v2 = src + y_low * src_line_size + x_high * 3;\n    }\n\n    if (y_high < src_height) {\n      if (x_low >= 0)\n        v3 = src + y_high * src_line_size + x_low * 3;\n\n      if (x_high < src_width)\n        v4 = src + y_high * src_line_size + x_high * 3;\n    }\n\n    c0 = w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0];\n    c1 = w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1];\n    c2 = w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2];\n  }\n\n  // bgr to rgb \n  float t = c2;\n  c2 = c0;\n  c0 = t;\n\n  // normalization\n  c0 = c0 / 255.0f;\n  c1 = c1 / 255.0f;\n  c2 = c2 / 255.0f;\n\n  // rgbrgbrgb to rrrgggbbb\n  int area = dst_width * dst_height;\n  float* pdst_c0 = dst + dy * dst_width + dx;\n  float* pdst_c1 = pdst_c0 + area;\n  float* pdst_c2 = pdst_c1 + area;\n  *pdst_c0 = c0;\n  *pdst_c1 = c1;\n  *pdst_c2 = c2;\n}\n\nvoid cuda_preprocess(\n    uint8_t* src, int src_width, int src_height,\n    float* dst, int dst_width, int dst_height,\n    cudaStream_t stream) {\n  int img_size = src_width * src_height * 3;\n  // copy data to pinned memory\n  memcpy(img_buffer_host, src, img_size);\n  // copy data to device memory\n  CUDA_CHECK(cudaMemcpyAsync(img_buffer_device, img_buffer_host, img_size, cudaMemcpyHostToDevice, stream));\n\n  AffineMatrix s2d, d2s;\n  float scale = std::min(dst_height / (float)src_height, dst_width / (float)src_width);\n\n  s2d.value[0] = scale;\n  s2d.value[1] = 0;\n  s2d.value[2] = -scale * src_width * 0.5 + dst_width * 0.5;\n  s2d.value[3] = 0;\n  s2d.value[4] = scale;\n  s2d.value[5] = -scale * src_height * 0.5 + dst_height * 0.5;\n  cv::Mat m2x3_s2d(2, 3, CV_32F, s2d.value);\n  cv::Mat m2x3_d2s(2, 3, CV_32F, d2s.value);\n  cv::invertAffineTransform(m2x3_s2d, m2x3_d2s);\n\n  memcpy(d2s.value, m2x3_d2s.ptr<float>(0), sizeof(d2s.value));\n\n  int jobs = dst_height * dst_width;\n  int threads = 256;\n  int blocks = ceil(jobs / (float)threads);\n  warpaffine_kernel<<<blocks, threads, 0, stream>>>(\n      img_buffer_device, src_width * 3, src_width,\n      src_height, dst, dst_width,\n      dst_height, 128, d2s, jobs);\n}\n\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch,\n                           float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream) {\n  int dst_size = dst_width * dst_height * 3;\n  for (size_t i = 0; i < img_batch.size(); i++) {\n    cuda_preprocess(img_batch[i].ptr(), img_batch[i].cols, img_batch[i].rows, &dst[dst_size * i], dst_width, dst_height, stream);\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n  }\n}\n\nvoid cuda_preprocess_init(int max_image_size) {\n  // prepare input data in pinned memory\n  CUDA_CHECK(cudaMallocHost((void**)&img_buffer_host, max_image_size * 3));\n  // prepare input data in device memory\n  CUDA_CHECK(cudaMalloc((void**)&img_buffer_device, max_image_size * 3));\n}\n\nvoid cuda_preprocess_destroy() {\n  CUDA_CHECK(cudaFree(img_buffer_device));\n  CUDA_CHECK(cudaFreeHost(img_buffer_host));\n}\n\n"
  },
  {
    "path": "yolov7/yolov7_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLov7 project.\n    param: \n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n        line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n        )\n\n\nclass YoLov7TRT(object):\n    \"\"\"\n    description: A YOLOv7 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        engine = self.engine\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            result_boxes, result_scores, result_classid = self.post_process(\n                output[i * 6001: (i + 1) * 6001], batch_origin_h[i], batch_origin_w[i]\n            )\n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        \n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n        \n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 0] + x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1] - x[:, 3] / 2\n            y[:, 3] = x[:, 1] + x[:, 3] / 2\n            y /= r_h\n\n        return y\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...] \n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        pred = np.reshape(output[1:], (-1, 6))[:num, :]\n        # Do nms\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        return result_boxes, result_scores, result_classid\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))            \n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None) * \\\n                     np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None)\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\n        # clip the coordinates\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w -1)\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w -1)\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h -1)\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h -1)\n        # Object confidence\n        confs = boxes[:, 4]\n        # Sort by the confs\n        boxes = boxes[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_boxes = []\n        while boxes.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\n            label_match = boxes[0, -1] == boxes[:, -1]\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\n            invalid = large_overlap & label_match\n            keep_boxes += [boxes[0]]\n            boxes = boxes[~invalid]\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\n        return boxes\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov7_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov7_wrapper = yolov7_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov7_wrapper.infer(self.yolov7_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov7_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov7_wrapper = yolov7_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov7_wrapper.infer(self.yolov7_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"build/libmyplugins.so\"\n    engine_file_path = \"yolov7.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load coco labels\n\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n            \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n            \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n            \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n            \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n            \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n            \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n            \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n            \"hair drier\", \"toothbrush\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov7TRT instance\n    yolov7_wrapper = YoLov7TRT(engine_file_path)\n    try:\n        print('batch size is', yolov7_wrapper.batch_size)\n        \n        image_dir = \"samples/\"\n        image_path_batches = get_img_path_batches(yolov7_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov7_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov7_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov7_wrapper.destroy()\n"
  },
  {
    "path": "yolov8/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\n\nproject(yolov8)\n\nadd_definitions(-std=c++11)\nadd_definitions(-DAPI_EXPORTS)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\n\nset(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)\nenable_language(CUDA)\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include)\ninclude_directories(${PROJECT_SOURCE_DIR}/plugin)\n\n# include and link dirs of cuda and tensorrt, you need adapt them if yours are different\nif (CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n  message(\"embed_platform on\")\n  include_directories(/usr/local/cuda/targets/aarch64-linux/include)\n  link_directories(/usr/local/cuda/targets/aarch64-linux/lib)\nelse()\n  message(\"embed_platform off\")\n  # cuda\n  include_directories(/usr/local/cuda/include)\n  link_directories(/usr/local/cuda/lib64)\n\n  # tensorrt\n  include_directories(/home/lindsay/TensorRT-8.6.1.6/include)\n  link_directories(/home/lindsay/TensorRT-8.6.1.6/lib)\n  #  include_directories(/home/lindsay/TensorRT-7.2.3.4/include)\n  #  link_directories(/home/lindsay/TensorRT-7.2.3.4/lib)\n\n\nendif()\n\nadd_library(myplugins SHARED ${PROJECT_SOURCE_DIR}/plugin/yololayer.cu)\ntarget_link_libraries(myplugins nvinfer cudart)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\n\nfile(GLOB_RECURSE SRCS ${PROJECT_SOURCE_DIR}/src/*.cpp ${PROJECT_SOURCE_DIR}/src/*.cu)\nadd_executable(yolov8_det ${PROJECT_SOURCE_DIR}/yolov8_det.cpp ${SRCS})\n\ntarget_link_libraries(yolov8_det nvinfer)\ntarget_link_libraries(yolov8_det cudart)\ntarget_link_libraries(yolov8_det myplugins)\ntarget_link_libraries(yolov8_det ${OpenCV_LIBS})\n\nadd_executable(yolov8_seg ${PROJECT_SOURCE_DIR}/yolov8_seg.cpp ${SRCS})\ntarget_link_libraries(yolov8_seg nvinfer cudart myplugins ${OpenCV_LIBS})\n\n\nadd_executable(yolov8_pose ${PROJECT_SOURCE_DIR}/yolov8_pose.cpp ${SRCS})\ntarget_link_libraries(yolov8_pose nvinfer cudart myplugins ${OpenCV_LIBS})\n\nadd_executable(yolov8_cls ${PROJECT_SOURCE_DIR}/yolov8_cls.cpp ${SRCS})\ntarget_link_libraries(yolov8_cls nvinfer cudart myplugins ${OpenCV_LIBS})\n\nadd_executable(yolov8_5u_det ${PROJECT_SOURCE_DIR}/yolov8_5u_det.cpp ${SRCS})\ntarget_link_libraries(yolov8_5u_det nvinfer cudart myplugins ${OpenCV_LIBS})\n\nadd_executable(yolov8_obb ${PROJECT_SOURCE_DIR}/yolov8_obb.cpp ${SRCS})\ntarget_link_libraries(yolov8_obb nvinfer cudart myplugins ${OpenCV_LIBS})\n"
  },
  {
    "path": "yolov8/README.md",
    "content": "# YOLOv8\n\nThe Pytorch implementation is [ultralytics/yolov8](https://github.com/ultralytics/ultralytics/tree/main/ultralytics).\n\nThe tensorrt code is derived from [xiaocao-tian/yolov8_tensorrt](https://github.com/xiaocao-tian/yolov8_tensorrt)\n\n## Contributors\n\n<a href=\"https://github.com/xiaocao-tian\"><img src=\"https://avatars.githubusercontent.com/u/65889782?v=4?s=48\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/lindsayshuo\"><img src=\"https://avatars.githubusercontent.com/u/45239466?v=4?s=48\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/xinsuinizhuan\"><img src=\"https://avatars.githubusercontent.com/u/40679769?v=4?s=48\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/Rex-LK\"><img src=\"https://avatars.githubusercontent.com/u/74702576?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/emptysoal\"><img src=\"https://avatars.githubusercontent.com/u/57931586?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n<a href=\"https://github.com/ChangjunDAI\"><img src=\"https://avatars.githubusercontent.com/u/65420228?s=48&v=4\" width=\"40px;\" alt=\"\"/></a>\n\n## Requirements\n\n- TensorRT 8.0+\n- OpenCV 3.4.0+\n- ultralytics<=8.2.103\n\n## Different versions of yolov8\n\nCurrently, we support yolov8\n\n- For yolov8 , download .pt from [https://github.com/ultralytics/assets/releases](https://github.com/ultralytics/assets/releases), then follow how-to-run in current page.\n\n## Config\n\n- Choose the model n/s/m/l/x/n2/s2/m2/l2/x2/n6/s6/m6/l6/x6 from command line arguments.\n- Check more configs in [include/config.h](./include/config.h)\n\n## How to Run, yolov8n as example\n\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\n\n```\n// download https://github.com/ultralytics/assets/releases/yolov8n.pt\n// download https://github.com/lindsayshuo/yolov8-p2/releases/download/VisDrone_train_yolov8x_p2_bs1_epochs_100_imgsz_1280_last.pt (only for 10 cls p2 model)\ncp {tensorrtx}/yolov8/gen_wts.py {ultralytics}/ultralytics\ncd {ultralytics}/ultralytics\npython gen_wts.py -w yolov8n.pt -o yolov8n.wts -t detect\n// a file 'yolov8n.wts' will be generated.\n\n\n// For p2 model\n// download https://github.com/lindsayshuo/yolov8_p2_tensorrtx/releases/download/VisDrone_train_yolov8x_p2_bs1_epochs_100_imgsz_1280_last/VisDrone_train_yolov8x_p2_bs1_epochs_100_imgsz_1280_last.pt (only for 10 cls p2 model)\ncd {ultralytics}/ultralytics\npython gen_wts.py -w VisDrone_train_yolov8x_p2_bs1_epochs_100_imgsz_1280_last.pt -o VisDrone_train_yolov8x_p2_bs1_epochs_100_imgsz_1280_last.wts -t detect (only for  10 cls p2 model)\n// a file 'VisDrone_train_yolov8x_p2_bs1_epochs_100_imgsz_1280_last.wts' will be generated.\n\n// For yolov8_5u_det model\n// download https://github.com/ultralytics/assets/releases/yolov5nu.pt\ncd {ultralytics}/ultralytics\npython gen_wts.py -w yolov5nu.pt -o yolov5nu.wts -t detect\n// a file 'yolov5nu.wts' will be generated.\n\n```\n\n2. build tensorrtx/yolov8 and run\n\n### Detection\n```\ncd {tensorrtx}/yolov8/\nmkdir build\ncd build\ncp {ultralytics}/ultralytics/yolov8.wts {tensorrtx}/yolov8/build\ncmake ..\nmake\nsudo ./yolov8_det -s [.wts] [.engine] [n/s/m/l/x/n2/s2/m2/l2/x2/n6/s6/m6/l6/x6]  // serialize model to plan file\nsudo ./yolov8_det -d [.engine] [image folder]  [c/g] // deserialize and run inference, the images in [image folder] will be processed.\n\n// For example yolov8n\nsudo ./yolov8_det -s yolov8n.wts yolov8.engine n\nsudo ./yolov8_det -d yolov8n.engine ../images c //cpu postprocess\nsudo ./yolov8_det -d yolov8n.engine ../images g //gpu postprocess\n\n\n// For p2 model:\n// change the  \"const static int kNumClass\" in config.h to 10;\nsudo ./yolov8_det -s VisDrone_train_yolov8x_p2_bs1_epochs_100_imgsz_1280_last.wts VisDrone_train_yolov8x_p2_bs1_epochs_100_imgsz_1280_last.engine x2\nwget https://github.com/lindsayshuo/yolov8-p2/releases/download/VisDrone_train_yolov8x_p2_bs1_epochs_100_imgsz_1280_last/0000008_01999_d_0000040.jpg\ncp -r 0000008_01999_d_0000040.jpg ../images\nsudo ./yolov8_det -d VisDrone_train_yolov8x_p2_bs1_epochs_100_imgsz_1280_last.engine ../images c //cpu postprocess\nsudo ./yolov8_det -d VisDrone_train_yolov8x_p2_bs1_epochs_100_imgsz_1280_last.engine ../images g //gpu postprocess\n\n// For yolov8_5u_det(YOLOv5u with the anchor-free, objectness-free split head structure based on YOLOv8 features) model:\nsudo ./yolov8_5u_det -s [.wts] [.engine] [n/s/m/l/x//n6/s6/m6/l6/x6]\nsudo ./yolov8_5u_det -d yolov5xu.engine ../images c //cpu postprocess\nsudo ./yolov8_5u_det -d yolov5xu.engine ../images g //gpu postprocess\n```\n\n### Instance Segmentation\n```\n# Build and serialize TensorRT engine\n./yolov8_seg -s yolov8s-seg.wts yolov8s-seg.engine s\n\n# Download the labels file\nwget -O coco.txt https://raw.githubusercontent.com/amikelive/coco-labels/master/coco-labels-2014_2017.txt\n\n# Run inference with labels file\n./yolov8_seg -d yolov8s-seg.engine ../images c coco.txt\n```\n\n### Classification\n```\ncd {tensorrtx}/yolov8/\n// Download inference images\nwget  https://github.com/lindsayshuo/infer_pic/releases/download/pics/1709970363.6990473rescls.jpg\nmkdir samples\ncp -r  1709970363.6990473rescls.jpg samples\n// Download ImageNet labels\nwget https://github.com/joannzhang00/ImageNet-dataset-classes-labels/blob/main/imagenet_classes.txt\n\n// update kClsNumClass in config.h if your model is trained on custom dataset\nmkdir build\ncd build\ncp {ultralytics}/ultralytics/yolov8n-cls.wts {tensorrtx}/yolov8/build\ncmake ..\nmake\nsudo ./yolov8_cls -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to plan file\nsudo ./yolov8_cls -d [.engine] [image folder]  // deserialize and run inference, the images in [image folder] will be processed.\n\n// For example yolov8n\nsudo ./yolov8_cls -s yolov8n-cls.wts yolov8-cls.engine n\nsudo ./yolov8_cls -d yolov8n-cls.engine ../samples\n```\n\n\n### Pose Estimation\n```\ncd {tensorrtx}/yolov8/\n// update \"kPoseNumClass = 1\" in config.h\nmkdir build\ncd build\ncp {ultralytics}/ultralytics/yolov8-pose.wts {tensorrtx}/yolov8/build\ncmake ..\nmake\nsudo ./yolov8_pose -s [.wts] [.engine] [n/s/m/l/x/n2/s2/m2/l2/x2/n6/s6/m6/l6/x6]  // serialize model to plan file\nsudo ./yolov8_pose -d [.engine] [image folder]  [c/g] // deserialize and run inference, the images in [image folder] will be processed.\n\n// For example yolov8-pose\nsudo ./yolov8_pose -s yolov8n-pose.wts yolov8n-pose.engine n\nsudo ./yolov8_pose -d yolov8n-pose.engine ../images c //cpu postprocess\nsudo ./yolov8_pose -d yolov8n-pose.engine ../images g //gpu postprocess\n```\n\n\n### Oriented Bounding Boxes (OBB) Estimation\n```\ncd {tensorrtx}/yolov8/\n// update \"kObbNumClass = 15\" \"kInputH = 1024\" \"kInputW = 1024\" in config.h\nwget https://github.com/lindsayshuo/infer_pic/releases/download/pics/obb.png\nmkdir images\nmv obb.png ./images\nmkdir build\ncd build\ncp {ultralytics}/ultralytics/yolov8-obb.wts {tensorrtx}/yolov8/build\ncmake ..\nmake\nsudo ./yolov8_obb -s [.wts] [.engine] [n/s/m/l/x/n2/s2/m2/l2/x2/n6/s6/m6/l6/x6]  // serialize model to plan file\nsudo ./yolov8_obb -d [.engine] [image folder]  [c/g] // deserialize and run inference, the images in [image folder] will be processed.\n\n// For example yolov8-obb\nsudo ./yolov8_obb -s yolov8n-obb.wts yolov8n-obb.engine n\nsudo ./yolov8_obb -d yolov8n-obb.engine ../images c //cpu postprocess\nsudo ./yolov8_obb -d yolov8n-obb.engine ../images g //gpu postprocess\n```\n\n\n4. optional, load and run the tensorrt model in python\n\n```\n// install python-tensorrt, pycuda, etc.\n// ensure the yolov8n.engine and libmyplugins.so have been built\npython yolov8_det_trt.py  # Detection\npython yolov8_seg_trt.py  # Segmentation\npython yolov8_cls_trt.py  # Classification\npython yolov8_pose_trt.py  # Pose Estimation\npython yolov8_5u_det_trt.py  # yolov8_5u_det(YOLOv5u with the anchor-free, objectness-free split head structure based on YOLOv8 features) model\npython yolov8_obb_trt.py  # Oriented Bounding Boxes (OBB) Estimation\n```\n\n# INT8 Quantization\n\n1. Prepare calibration images, you can randomly select 1000s images from your train set. For coco, you can also download my calibration images `coco_calib` from [GoogleDrive](https://drive.google.com/drive/folders/1s7jE9DtOngZMzJC1uL307J2MiaGwdRSI?usp=sharing) or [BaiduPan](https://pan.baidu.com/s/1GOm_-JobpyLMAqZWCDUhKg) pwd: a9wh\n\n2. unzip it in yolov8/build\n\n3. set the macro `USE_INT8` in config.h, change `kInputQuantizationFolder` into your image folder path and make\n\n4. serialize the model and test\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/15235574/78247927-4d9fac00-751e-11ea-8b1b-704a0aeb3fcf.jpg\" height=\"360px;\">\n</p>\n\n## More Information\n\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\n"
  },
  {
    "path": "yolov8/gen_wts.py",
    "content": "import sys  # noqa: F401\nimport argparse\nimport os\nimport struct\nimport torch\n\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description='Convert .pt file to .wts')\n    parser.add_argument('-w', '--weights', required=True,\n                        help='Input weights (.pt) file path (required)')\n    parser.add_argument(\n        '-o', '--output', help='Output (.wts) file path (optional)')\n    parser.add_argument(\n        '-t', '--type', type=str, default='detect', choices=['detect', 'cls', 'seg', 'pose', 'obb'],\n        help='determines the model is detection/classification')\n    args = parser.parse_args()\n    if not os.path.isfile(args.weights):\n        raise SystemExit('Invalid input file')\n    if not args.output:\n        args.output = os.path.splitext(args.weights)[0] + '.wts'\n    elif os.path.isdir(args.output):\n        args.output = os.path.join(\n            args.output,\n            os.path.splitext(os.path.basename(args.weights))[0] + '.wts')\n    return args.weights, args.output, args.type\n\n\npt_file, wts_file, m_type = parse_args()\n\nprint(f'Generating .wts for {m_type} model')\n\n# Load model\nprint(f'Loading {pt_file}')\n\n# Initialize\ndevice = 'cpu'\n\n# Load model\nmodel = torch.load(pt_file, map_location=device, weights_only=False)  # Load FP32 weights\nmodel = model['ema' if model.get('ema') else 'model'].float()\n\nif m_type in ['detect', 'seg', 'pose', 'obb']:\n    anchor_grid = model.model[-1].anchors * model.model[-1].stride[..., None, None]\n\n    delattr(model.model[-1], 'anchors')\n\nmodel.to(device).eval()\n\nwith open(wts_file, 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\n"
  },
  {
    "path": "yolov8/include/block.h",
    "content": "#pragma once\n#include <map>\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n\nint calculateP(int ksize);\n\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file);\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, int k, int s, int p, std::string lname);\n\nnvinfer1::IElementWiseLayer* C2F(nvinfer1::INetworkDefinition* network,\n                                 std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                 int c2, int n, bool shortcut, float e, std::string lname);\n\nnvinfer1::IElementWiseLayer* C2(nvinfer1::INetworkDefinition* network,\n                                std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c1,\n                                int c2, int n, bool shortcut, float e, std::string lname);\n\nnvinfer1::IElementWiseLayer* C3(nvinfer1::INetworkDefinition* network,\n                                std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                int c2, int n, bool shortcut, float e, std::string lname);\n\nnvinfer1::IElementWiseLayer* SPPF(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int k, std::string lname);\n\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int grid, int k, int s, int p, std::string lname);\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network,\n                                       std::vector<nvinfer1::IConcatenationLayer*> dets, const int* px_arry,\n                                       int px_arry_num, int num_class, bool is_segmentation, bool is_pose, bool is_obb);\n"
  },
  {
    "path": "yolov8/include/calibrator.h",
    "content": "#ifndef ENTROPY_CALIBRATOR_H\n#define ENTROPY_CALIBRATOR_H\n\n#include <NvInfer.h>\n#include <string>\n#include <vector>\n#include \"macros.h\"\n\n//! \\class Int8EntropyCalibrator2\n//!\n//! \\brief Implements Entropy calibrator 2.\n//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.\n//!\nclass Int8EntropyCalibrator2 : public nvinfer1::IInt8EntropyCalibrator2 {\n   public:\n    Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name,\n                           const char* input_blob_name, bool read_cache = true);\n    virtual ~Int8EntropyCalibrator2();\n    int getBatchSize() const TRT_NOEXCEPT override;\n    bool getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT override;\n    const void* readCalibrationCache(size_t& length) TRT_NOEXCEPT override;\n    void writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT override;\n\n   private:\n    int batchsize_;\n    int input_w_;\n    int input_h_;\n    int img_idx_;\n    std::string img_dir_;\n    std::vector<std::string> img_files_;\n    size_t input_count_;\n    std::string calib_table_name_;\n    const char* input_blob_name_;\n    bool read_cache_;\n    void* device_input_;\n    std::vector<char> calib_cache_;\n};\n\n#endif  // ENTROPY_CALIBRATOR_H\n"
  },
  {
    "path": "yolov8/include/config.h",
    "content": "#define USE_FP16\n//#define USE_FP32\n//#define USE_INT8\n\nconst static char* kInputTensorName = \"images\";\nconst static char* kOutputTensorName = \"output\";\nconst static int kNumClass = 80;\nconst static int kBatchSize = 1;\nconst static int kGpuId = 0;\nconst static int kInputH = 640;\nconst static int kInputW = 640;\nconst static float kNmsThresh = 0.45f;\nconst static float kConfThresh = 0.5f;\nconst static float kConfThreshKeypoints = 0.5f;  // keypoints confidence\nconst static int kMaxInputImageSize = 3000 * 3000;\nconst static int kMaxNumOutputBbox = 1000;\n//Quantization input image folder path\nconst static char* kInputQuantizationFolder = \"./coco_calib\";\n\n// Classfication model's number of classes\nconstexpr static int kClsNumClass = 1000;\n// Classfication model's input shape\nconstexpr static int kClsInputH = 224;\nconstexpr static int kClsInputW = 224;\n\n// pose model's number of classes\nconstexpr static int kPoseNumClass = 1;\nconst static int kNumberOfPoints = 17;  // number of keypoints total\n\n// obb model's number of classes\nconstexpr static int kObbNumClass = 15;\n"
  },
  {
    "path": "yolov8/include/cuda_utils.h",
    "content": "#ifndef TRTX_CUDA_UTILS_H_\n#define TRTX_CUDA_UTILS_H_\n\n#include <cuda_runtime_api.h>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)                                                                    \\\n    {                                                                                          \\\n        cudaError_t error_code = callstr;                                                      \\\n        if (error_code != cudaSuccess) {                                                       \\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__; \\\n            assert(0);                                                                         \\\n        }                                                                                      \\\n    }\n#endif  // CUDA_CHECK\n\n#endif  // TRTX_CUDA_UTILS_H_\n"
  },
  {
    "path": "yolov8/include/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"NvInferRuntimeCommon.h\"\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf {\n   public:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream), mPrefix(prefix), mShouldLog(shouldLog) {}\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other) : mOutput(other.mOutput) {}\n\n    ~LogStreamConsumerBuffer() {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr()) {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync() {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput() {\n        if (mShouldLog) {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog) { mShouldLog = shouldLog; }\n\n   private:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase {\n   public:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog) {}\n\n   protected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream {\n   public:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(severity <= reportableSeverity),\n          mSeverity(severity) {}\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog),\n          std::ostream(&mBuffer)  // links the stream buffer with the stream\n          ,\n          mShouldLog(other.mShouldLog),\n          mSeverity(other.mSeverity) {}\n\n    void setReportableSeverity(Severity reportableSeverity) {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\n   private:\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger {\n   public:\n    Logger(Severity severity = Severity::kWARNING) : mReportableSeverity(severity) {}\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult {\n        kRUNNING,  //!< The test is running\n        kPASSED,   //!< The test passed\n        kFAILED,   //!< The test failed\n        kWAIVED    //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger() { return *this; }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity) { mReportableSeverity = severity; }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom {\n       public:\n        TestAtom(TestAtom&&) = default;\n\n       private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started), mName(name), mCmdline(cmdline) {}\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline) {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv) {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom) {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result) {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom) {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass) {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const { return mReportableSeverity; }\n\n   private:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity) {\n        switch (severity) {\n            case Severity::kINTERNAL_ERROR:\n                return \"[F] \";\n            case Severity::kERROR:\n                return \"[E] \";\n            case Severity::kWARNING:\n                return \"[W] \";\n            case Severity::kINFO:\n                return \"[I] \";\n            case Severity::kVERBOSE:\n                return \"[V] \";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result) {\n        switch (result) {\n            case TestResult::kRUNNING:\n                return \"RUNNING\";\n            case TestResult::kPASSED:\n                return \"PASSED\";\n            case TestResult::kFAILED:\n                return \"FAILED\";\n            case TestResult::kWAIVED:\n                return \"WAIVED\";\n            default:\n                assert(0);\n                return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity) {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result) {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv) {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++) {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace {\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger) {\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n}  // anonymous namespace\n\n#endif  // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "yolov8/include/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#include \"NvInfer.h\"\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "yolov8/include/model.h",
    "content": "#pragma once\n#include <assert.h>\n#include <string>\n#include \"NvInfer.h\"\n\nnvinfer1::IHostMemory* buildEngineYolov8Det(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels);\n\nnvinfer1::IHostMemory* buildEngineYolov8DetP6(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                              nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                              int& max_channels);\n\nnvinfer1::IHostMemory* buildEngineYolov8DetP2(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                              nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                              int& max_channels);\n\nnvinfer1::IHostMemory* buildEngineYolov8Cls(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw);\n\nnvinfer1::IHostMemory* buildEngineYolov8Seg(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels);\n\nnvinfer1::IHostMemory* buildEngineYolov8Pose(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                             nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                             int& max_channels);\n\nnvinfer1::IHostMemory* buildEngineYolov8PoseP6(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                               nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                               int& max_channels);\n\nnvinfer1::IHostMemory* buildEngineYolov8_5uDet(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                               nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                               int& max_channels);\n\nnvinfer1::IHostMemory* buildEngineYolov8_5uDetP6(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                                 nvinfer1::DataType dt, const std::string& wts_path, float& gd,\n                                                 float& gw, int& max_channels);\n\nnvinfer1::IHostMemory* buildEngineYolov8Obb(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels);\n"
  },
  {
    "path": "yolov8/include/postprocess.h",
    "content": "#pragma once\n\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\n// Preprocessing functions\ncv::Rect get_rect(cv::Mat& img, float bbox[4]);\n\n// Processing functions\nvoid batch_process(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                   int bbox_element, const std::vector<cv::Mat>& img_batch);\nvoid batch_process_obb(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                       int bbox_element, const std::vector<cv::Mat>& img_batch);\nvoid process_decode_ptr_host(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element, cv::Mat& img,\n                             int count);\nvoid process_decode_ptr_host_obb(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element,\n                                 cv::Mat& img, int count);\n\n// NMS functions\nvoid nms(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh = 0.5);\nvoid batch_nms(std::vector<std::vector<Detection>>& batch_res, float* output, int batch_size, int output_size,\n               float conf_thresh, float nms_thresh = 0.5);\nvoid nms_obb(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh = 0.5);\nvoid batch_nms_obb(std::vector<std::vector<Detection>>& batch_res, float* output, int batch_size, int output_size,\n                   float conf_thresh, float nms_thresh = 0.5);\n\n// CUDA-related functions\nvoid cuda_decode(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                 cudaStream_t stream);\nvoid cuda_nms(float* parray, float nms_threshold, int max_objects, cudaStream_t stream);\nvoid cuda_decode_obb(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                     cudaStream_t stream);\nvoid cuda_nms_obb(float* parray, float nms_threshold, int max_objects, cudaStream_t stream);\n\n// Drawing functions\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\nvoid draw_bbox_obb(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\nvoid draw_bbox_keypoints_line(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks,\n                    std::unordered_map<int, std::string>& labels_map);\n"
  },
  {
    "path": "yolov8/include/preprocess.h",
    "content": "#pragma once\n\n#include <map>\n#include <opencv2/opencv.hpp>\n#include \"NvInfer.h\"\n#include \"types.h\"\n\nvoid cuda_preprocess_init(int max_image_size);\n\nvoid cuda_preprocess_destroy();\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream);\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream);\n"
  },
  {
    "path": "yolov8/include/types.h",
    "content": "#pragma once\n#include \"config.h\"\n\nstruct alignas(float) Detection {\n    //center_x center_y w h\n    float bbox[4];\n    float conf;  // bbox_conf * cls_conf\n    float class_id;\n    float mask[32];\n    float keypoints[kNumberOfPoints * 3];  // keypoints array with dynamic size based on kNumberOfPoints\n    float angle;                           // obb angle\n};\n\nstruct AffineMatrix {\n    float value[6];\n};\n\nconst int bbox_element =\n        sizeof(AffineMatrix) / sizeof(float) + 1;  // left, top, right, bottom, confidence, class, keepflag\n"
  },
  {
    "path": "yolov8/include/utils.h",
    "content": "#pragma once\n#include <dirent.h>\n#include <fstream>\n#include <opencv2/opencv.hpp>\n\nstatic inline cv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h) {\n    int w, h, x, y;\n    float r_w = input_w / (img.cols * 1.0);\n    float r_h = input_h / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = input_w;\n        h = r_w * img.rows;\n        x = 0;\n        y = (input_h - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = input_h;\n        x = (input_w - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n    cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\nstatic inline int read_files_in_dir(const char* p_dir_name, std::vector<std::string>& file_names) {\n    DIR* p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 && strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n\n// Function to trim leading and trailing whitespace from a string\nstatic inline std::string trim_leading_whitespace(const std::string& str) {\n    size_t first = str.find_first_not_of(' ');\n    if (std::string::npos == first) {\n        return str;\n    }\n    size_t last = str.find_last_not_of(' ');\n    return str.substr(first, (last - first + 1));\n}\n\n// Src: https://stackoverflow.com/questions/16605967\nstatic inline std::string to_string_with_precision(const float a_value, const int n = 2) {\n    std::ostringstream out;\n    out.precision(n);\n    out << std::fixed << a_value;\n    return out.str();\n}\n\nstatic inline int read_labels(const std::string labels_filename, std::unordered_map<int, std::string>& labels_map) {\n    std::ifstream file(labels_filename);\n    // Read each line of the file\n    std::string line;\n    int index = 0;\n    while (std::getline(file, line)) {\n        // Strip the line of any leading or trailing whitespace\n        line = trim_leading_whitespace(line);\n\n        // Add the stripped line to the labels_map, using the loop index as the key\n        labels_map[index] = line;\n        index++;\n    }\n    // Close the file\n    file.close();\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov8/plugin/yololayer.cu",
    "content": "#include <assert.h>\n#include <math.h>\n#include <iostream>\n#include <vector>\n#include \"cuda_utils.h\"\n#include \"types.h\"\n#include \"yololayer.h\"\n\nnamespace Tn {\ntemplate <typename T>\nvoid write(char*& buffer, const T& val) {\n    *reinterpret_cast<T*>(buffer) = val;\n    buffer += sizeof(T);\n}\n\ntemplate <typename T>\nvoid read(const char*& buffer, T& val) {\n    val = *reinterpret_cast<const T*>(buffer);\n    buffer += sizeof(T);\n}\n}  // namespace Tn\n\n__device__ float sigmoid(float x) {\n    return 1.0f / (1.0f + exp(-x));\n}\n\nnamespace nvinfer1 {\nYoloLayerPlugin::YoloLayerPlugin(int classCount, int numberofpoints, float confthreshkeypoints, int netWidth,\n                                 int netHeight, int maxOut, bool is_segmentation, bool is_pose, bool is_obb,\n                                 const int* strides, int stridesLength) {\n\n    mClassCount = classCount;\n    mNumberofpoints = numberofpoints;\n    mConfthreshkeypoints = confthreshkeypoints;\n    mYoloV8NetWidth = netWidth;\n    mYoloV8netHeight = netHeight;\n    mMaxOutObject = maxOut;\n    mStridesLength = stridesLength;\n    mStrides = new int[stridesLength];\n    memcpy(mStrides, strides, stridesLength * sizeof(int));\n    is_segmentation_ = is_segmentation;\n    is_pose_ = is_pose;\n    is_obb_ = is_obb;\n}\n\nYoloLayerPlugin::~YoloLayerPlugin() {\n    if (mStrides != nullptr) {\n        delete[] mStrides;\n        mStrides = nullptr;\n    }\n}\n\nYoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length) {\n    using namespace Tn;\n    const char *d = reinterpret_cast<const char*>(data), *a = d;\n    read(d, mClassCount);\n    read(d, mNumberofpoints);\n    read(d, mConfthreshkeypoints);\n    read(d, mThreadCount);\n    read(d, mYoloV8NetWidth);\n    read(d, mYoloV8netHeight);\n    read(d, mMaxOutObject);\n    read(d, mStridesLength);\n    mStrides = new int[mStridesLength];\n    for (int i = 0; i < mStridesLength; ++i) {\n        read(d, mStrides[i]);\n    }\n    read(d, is_segmentation_);\n    read(d, is_pose_);\n    read(d, is_obb_);\n\n    assert(d == a + length);\n}\n\nvoid YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT {\n\n    using namespace Tn;\n    char *d = static_cast<char*>(buffer), *a = d;\n    write(d, mClassCount);\n    write(d, mNumberofpoints);\n    write(d, mConfthreshkeypoints);\n    write(d, mThreadCount);\n    write(d, mYoloV8NetWidth);\n    write(d, mYoloV8netHeight);\n    write(d, mMaxOutObject);\n    write(d, mStridesLength);\n    for (int i = 0; i < mStridesLength; ++i) {\n        write(d, mStrides[i]);\n    }\n    write(d, is_segmentation_);\n    write(d, is_pose_);\n    write(d, is_obb_);\n\n    assert(d == a + getSerializationSize());\n}\n\nsize_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT {\n    return sizeof(mClassCount) + sizeof(mNumberofpoints) + sizeof(mConfthreshkeypoints) + sizeof(mThreadCount) +\n           sizeof(mYoloV8netHeight) + sizeof(mYoloV8NetWidth) + sizeof(mMaxOutObject) + sizeof(mStridesLength) +\n           sizeof(int) * mStridesLength + sizeof(is_segmentation_) + sizeof(is_pose_) + sizeof(is_obb_);\n}\n\nint YoloLayerPlugin::initialize() TRT_NOEXCEPT {\n    return 0;\n}\n\nnvinfer1::Dims YoloLayerPlugin::getOutputDimensions(int index, const nvinfer1::Dims* inputs,\n                                                    int nbInputDims) TRT_NOEXCEPT {\n    int total_size = mMaxOutObject * sizeof(Detection) / sizeof(float);\n    return nvinfer1::Dims3(total_size + 1, 1, 1);\n}\n\nvoid YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT {\n    mPluginNamespace = pluginNamespace;\n}\n\nconst char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT {\n    return mPluginNamespace;\n}\n\nnvinfer1::DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes,\n                                                      int nbInputs) const TRT_NOEXCEPT {\n    return nvinfer1::DataType::kFLOAT;\n}\n\nbool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                                   int nbInputs) const TRT_NOEXCEPT {\n\n    return false;\n}\n\nbool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT {\n\n    return false;\n}\n\nvoid YoloLayerPlugin::configurePlugin(nvinfer1::PluginTensorDesc const* in, int nbInput,\n                                      nvinfer1::PluginTensorDesc const* out, int nbOutput) TRT_NOEXCEPT{};\n\nvoid YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                                      IGpuAllocator* gpuAllocator) TRT_NOEXCEPT{};\n\nvoid YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\nconst char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT {\n\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nvoid YoloLayerPlugin::destroy() TRT_NOEXCEPT {\n    delete this;\n}\n\nnvinfer1::IPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT {\n\n    YoloLayerPlugin* p =\n            new YoloLayerPlugin(mClassCount, mNumberofpoints, mConfthreshkeypoints, mYoloV8NetWidth, mYoloV8netHeight,\n                                mMaxOutObject, is_segmentation_, is_pose_, is_obb_, mStrides, mStridesLength);\n    p->setPluginNamespace(mPluginNamespace);\n    return p;\n}\n\nint YoloLayerPlugin::enqueue(int batchSize, const void* TRT_CONST_ENQUEUE* inputs, void* const* outputs,\n                             void* workspace, cudaStream_t stream) TRT_NOEXCEPT {\n\n    forwardGpu((const float* const*)inputs, (float*)outputs[0], stream, mYoloV8netHeight, mYoloV8NetWidth, batchSize);\n    return 0;\n}\n\n__device__ float Logist(float data) {\n    return 1.0f / (1.0f + expf(-data));\n};\n\n__global__ void CalDetection(const float* input, float* output, int numElements, int maxoutobject, const int grid_h,\n                             int grid_w, const int stride, int classes, int nk, float confkeypoints, int outputElem,\n                             bool is_segmentation, bool is_pose, bool is_obb) {\n    int idx = threadIdx.x + blockDim.x * blockIdx.x;\n    if (idx >= numElements)\n        return;\n\n    const int N_kpts = nk;\n    int total_grid = grid_h * grid_w;\n    int info_len = 4 + classes + (is_segmentation ? 32 : 0) + (is_pose ? N_kpts * 3 : 0) + (is_obb ? 1 : 0);\n    int batchIdx = idx / total_grid;\n    int elemIdx = idx % total_grid;\n    const float* curInput = input + batchIdx * total_grid * info_len;\n    int outputIdx = batchIdx * outputElem;\n\n    int class_id = 0;\n    float max_cls_prob = 0.0;\n    for (int i = 4; i < 4 + classes; i++) {\n        float p = Logist(curInput[elemIdx + i * total_grid]);\n        if (p > max_cls_prob) {\n            max_cls_prob = p;\n            class_id = i - 4;\n        }\n    }\n\n    if (max_cls_prob < 0.1)\n        return;\n\n    int count = (int)atomicAdd(output + outputIdx, 1);\n    if (count >= maxoutobject)\n        return;\n    char* data = (char*)(output + outputIdx) + sizeof(float) + count * sizeof(Detection);\n    Detection* det = (Detection*)(data);\n\n    int row = elemIdx / grid_w;\n    int col = elemIdx % grid_w;\n\n    det->conf = max_cls_prob;\n    det->class_id = class_id;\n    det->bbox[0] = (col + 0.5f - curInput[elemIdx + 0 * total_grid]) * stride;\n    det->bbox[1] = (row + 0.5f - curInput[elemIdx + 1 * total_grid]) * stride;\n    det->bbox[2] = (col + 0.5f + curInput[elemIdx + 2 * total_grid]) * stride;\n    det->bbox[3] = (row + 0.5f + curInput[elemIdx + 3 * total_grid]) * stride;\n\n    if (is_segmentation) {\n        for (int k = 0; k < 32; ++k) {\n            det->mask[k] =\n                    curInput[elemIdx + (4 + classes + (is_pose ? N_kpts * 3 : 0) + (is_obb ? 1 : 0) + k) * total_grid];\n        }\n    }\n\n    if (is_pose) {\n        for (int kpt = 0; kpt < N_kpts; kpt++) {\n            int kpt_x_idx = (4 + classes + (is_segmentation ? 32 : 0) + (is_obb ? 1 : 0) + kpt * 3) * total_grid;\n            int kpt_y_idx = (4 + classes + (is_segmentation ? 32 : 0) + (is_obb ? 1 : 0) + kpt * 3 + 1) * total_grid;\n            int kpt_conf_idx = (4 + classes + (is_segmentation ? 32 : 0) + (is_obb ? 1 : 0) + kpt * 3 + 2) * total_grid;\n\n            float kpt_confidence = sigmoid(curInput[elemIdx + kpt_conf_idx]);\n\n            float kpt_x = (curInput[elemIdx + kpt_x_idx] * 2.0 + col) * stride;\n            float kpt_y = (curInput[elemIdx + kpt_y_idx] * 2.0 + row) * stride;\n\n            bool is_within_bbox =\n                    kpt_x >= det->bbox[0] && kpt_x <= det->bbox[2] && kpt_y >= det->bbox[1] && kpt_y <= det->bbox[3];\n\n            if (kpt_confidence < confkeypoints || !is_within_bbox) {\n                det->keypoints[kpt * 3] = -1;\n                det->keypoints[kpt * 3 + 1] = -1;\n                det->keypoints[kpt * 3 + 2] = -1;\n            } else {\n                det->keypoints[kpt * 3] = kpt_x;\n                det->keypoints[kpt * 3 + 1] = kpt_y;\n                det->keypoints[kpt * 3 + 2] = kpt_confidence;\n            }\n        }\n    }\n\n    if (is_obb) {\n        double pi = M_PI;\n        auto angle_inx = curInput[elemIdx + (4 + classes + (is_segmentation ? 32 : 0) + (is_pose ? N_kpts * 3 : 0) +\n                                             0) * total_grid];\n        auto angle = (sigmoid(angle_inx) - 0.25f) * pi;\n\n        auto cos1 = cos(angle);\n        auto sin1 = sin(angle);\n        auto xf = (curInput[elemIdx + 2 * total_grid] - curInput[elemIdx + 0 * total_grid]) / 2;\n        auto yf = (curInput[elemIdx + 3 * total_grid] - curInput[elemIdx + 1 * total_grid]) / 2;\n\n        auto x = xf * cos1 - yf * sin1;\n        auto y = xf * sin1 + yf * cos1;\n\n        float cx = (col + 0.5f + x) * stride;\n        float cy = (row + 0.5f + y) * stride;\n\n        float w1 = (curInput[elemIdx + 0 * total_grid] + curInput[elemIdx + 2 * total_grid]) * stride;\n        float h1 = (curInput[elemIdx + 1 * total_grid] + curInput[elemIdx + 3 * total_grid]) * stride;\n        det->bbox[0] = cx;\n        det->bbox[1] = cy;\n        det->bbox[2] = w1;\n        det->bbox[3] = h1;\n        det->angle = angle;\n    }\n}\n\nvoid YoloLayerPlugin::forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV8netHeight,\n                                 int mYoloV8NetWidth, int batchSize) {\n\n    int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);\n    cudaMemsetAsync(output, 0, sizeof(float), stream);\n    for (int idx = 0; idx < batchSize; ++idx) {\n        CUDA_CHECK(cudaMemsetAsync(output + idx * outputElem, 0, sizeof(float), stream));\n    }\n    int numElem = 0;\n    int maxGrids = mStridesLength;\n    int flatGridsLen = 2 * maxGrids;\n    int* flatGrids = new int[flatGridsLen];\n\n    for (int i = 0; i < maxGrids; ++i) {\n        flatGrids[2 * i] = mYoloV8netHeight / mStrides[i];\n        flatGrids[2 * i + 1] = mYoloV8NetWidth / mStrides[i];\n    }\n\n    for (unsigned int i = 0; i < maxGrids; i++) {\n        // Access the elements of the original 2D array from the flattened 1D array\n        int grid_h = flatGrids[2 * i];      // Corresponds to the access of grids[i][0]\n        int grid_w = flatGrids[2 * i + 1];  // Corresponds to the access of grids[i][1]\n        int stride = mStrides[i];\n        numElem = grid_h * grid_w * batchSize;  // Calculate the total number of elements\n        if (numElem < mThreadCount)             // Adjust the thread count if needed\n            mThreadCount = numElem;\n\n        // The CUDA kernel call remains unchanged\n        CalDetection<<<(numElem + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream>>>(\n                inputs[i], output, numElem, mMaxOutObject, grid_h, grid_w, stride, mClassCount, mNumberofpoints,\n                mConfthreshkeypoints, outputElem, is_segmentation_, is_pose_, is_obb_);\n    }\n\n    delete[] flatGrids;\n}\n\nPluginFieldCollection YoloPluginCreator::mFC{};\nstd::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\nYoloPluginCreator::YoloPluginCreator() {\n    mPluginAttributes.clear();\n    mFC.nbFields = mPluginAttributes.size();\n    mFC.fields = mPluginAttributes.data();\n}\n\nconst char* YoloPluginCreator::getPluginName() const TRT_NOEXCEPT {\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloPluginCreator::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nconst PluginFieldCollection* YoloPluginCreator::getFieldNames() TRT_NOEXCEPT {\n    return &mFC;\n}\n\nIPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT {\n    assert(fc->nbFields == 1);\n    assert(strcmp(fc->fields[0].name, \"combinedInfo\") == 0);\n    const int* combinedInfo = static_cast<const int*>(fc->fields[0].data);\n    int netinfo_count = 9;\n    int class_count = combinedInfo[0];\n    int numberofpoints = combinedInfo[1];\n    float confthreshkeypoints = combinedInfo[2];\n    int input_w = combinedInfo[3];\n    int input_h = combinedInfo[4];\n    int max_output_object_count = combinedInfo[5];\n    bool is_segmentation = combinedInfo[6];\n    bool is_pose = combinedInfo[7];\n    bool is_obb = combinedInfo[8];\n    const int* px_arry = combinedInfo + netinfo_count;\n    int px_arry_length = fc->fields[0].length - netinfo_count;\n    YoloLayerPlugin* obj =\n            new YoloLayerPlugin(class_count, numberofpoints, confthreshkeypoints, input_w, input_h,\n                                max_output_object_count, is_segmentation, is_pose, is_obb, px_arry, px_arry_length);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\nIPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData,\n                                                     size_t serialLength) TRT_NOEXCEPT {\n    // This object will be deleted when the network is destroyed, which will\n    // call YoloLayerPlugin::destroy()\n    YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\n}  // namespace nvinfer1\n"
  },
  {
    "path": "yolov8/plugin/yololayer.h",
    "content": "#pragma once\n#include <string>\n#include <vector>\n#include \"NvInfer.h\"\n#include \"macros.h\"\nnamespace nvinfer1 {\nclass API YoloLayerPlugin : public IPluginV2IOExt {\n   public:\n    YoloLayerPlugin(int classCount, int numberofpoints, float confthreshkeypoints, int netWidth, int netHeight,\n                    int maxOut, bool is_segmentation, bool is_pose, bool is_obb, const int* strides, int stridesLength);\n\n    YoloLayerPlugin(const void* data, size_t length);\n    ~YoloLayerPlugin();\n\n    int getNbOutputs() const TRT_NOEXCEPT override { return 1; }\n\n    nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n    int initialize() TRT_NOEXCEPT override;\n\n    virtual void terminate() TRT_NOEXCEPT override {}\n\n    virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0; }\n\n    virtual int enqueue(int batchSize, const void* const* inputs, void* TRT_CONST_ENQUEUE* outputs, void* workspace,\n                        cudaStream_t stream) TRT_NOEXCEPT override;\n\n    virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n    virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n    bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs,\n                                   int nbOutputs) const TRT_NOEXCEPT override {\n        return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n    }\n\n    const char* getPluginType() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    void destroy() TRT_NOEXCEPT override;\n\n    IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n    nvinfer1::DataType getOutputDataType(int32_t index, nvinfer1::DataType const* inputTypes,\n                                         int32_t nbInputs) const TRT_NOEXCEPT;\n\n    bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted,\n                                      int nbInputs) const TRT_NOEXCEPT override;\n\n    bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n    void attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext,\n                         IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n    void configurePlugin(PluginTensorDesc const* in, int32_t nbInput, PluginTensorDesc const* out,\n                         int32_t nbOutput) TRT_NOEXCEPT override;\n\n    void detachFromContext() TRT_NOEXCEPT override;\n\n   private:\n    void forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV8netHeight,\n                    int mYoloV8NetWidth, int batchSize);\n    int mThreadCount = 256;\n    const char* mPluginNamespace;\n    int mClassCount;\n    int mNumberofpoints;\n    float mConfthreshkeypoints;\n    int mYoloV8NetWidth;\n    int mYoloV8netHeight;\n    int mMaxOutObject;\n    bool is_segmentation_;\n    bool is_pose_;\n    bool is_obb_;\n    int* mStrides;\n    int mStridesLength;\n};\n\nclass API YoloPluginCreator : public IPluginCreator {\n   public:\n    YoloPluginCreator();\n    ~YoloPluginCreator() override = default;\n\n    const char* getPluginName() const TRT_NOEXCEPT override;\n\n    const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n    const nvinfer1::PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n    nvinfer1::IPluginV2IOExt* createPlugin(const char* name,\n                                           const nvinfer1::PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n    nvinfer1::IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData,\n                                                size_t serialLength) TRT_NOEXCEPT override;\n\n    void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override { mNamespace = libNamespace; }\n\n    const char* getPluginNamespace() const TRT_NOEXCEPT override { return mNamespace.c_str(); }\n\n   private:\n    std::string mNamespace;\n    static PluginFieldCollection mFC;\n    static std::vector<PluginField> mPluginAttributes;\n};\nREGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n}  // namespace nvinfer1\n"
  },
  {
    "path": "yolov8/src/block.cpp",
    "content": "#include \"block.h\"\n#include <assert.h>\n#include <math.h>\n#include <fstream>\n#include <iostream>\n#include \"config.h\"\n#include \"yololayer.h\"\n\nint calculateP(int ksize) {\n    return ksize / 3;\n}\n\nstd::map<std::string, nvinfer1::Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, nvinfer1::Weights> WeightMap;\n\n    std::ifstream input(file);\n    assert(input.is_open() &&\n           \"Unable to load weight file. please check if the \"\n           \".wts file path is right!!!!!!\");\n\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        nvinfer1::Weights wt{nvinfer1::DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = nvinfer1::DataType::kFLOAT;\n\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; x++) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n        wt.count = size;\n        WeightMap[name] = wt;\n    }\n    return WeightMap;\n}\n\nstatic nvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                             std::map<std::string, nvinfer1::Weights> weightMap,\n                                             nvinfer1::ITensor& input, std::string lname, float eps) {\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights scale{nvinfer1::DataType::kFLOAT, scval, len};\n\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights shift{nvinfer1::DataType::kFLOAT, shval, len};\n\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    nvinfer1::Weights power{nvinfer1::DataType::kFLOAT, pval, len};\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    nvinfer1::IScaleLayer* output = network->addScale(input, nvinfer1::ScaleMode::kCHANNEL, shift, scale, power);\n    assert(output);\n    return output;\n}\n\nnvinfer1::IElementWiseLayer* convBnSiLU(nvinfer1::INetworkDefinition* network,\n                                        std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input,\n                                        int ch, int k, int s, int p, std::string lname) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k, k}, weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    nvinfer1::IElementWiseLayer* ew =\n            network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n\nnvinfer1::ILayer* bottleneck(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int c1, int c2, bool shortcut, float e, std::string lname) {\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, c2, 3, 1, 1, lname + \".cv1\");\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *conv1->getOutput(0), c2, 3, 1, 1, lname + \".cv2\");\n\n    if (shortcut && c1 == c2) {\n        nvinfer1::IElementWiseLayer* ew =\n                network->addElementWise(input, *conv2->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n        return ew;\n    }\n    return conv2;\n}\n\nstatic nvinfer1::ILayer* bottleneck_c3(nvinfer1::INetworkDefinition* network,\n                                       std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input,\n                                       int c1, int c2, bool shortcut, float e, std::string lname) {\n    nvinfer1::IElementWiseLayer* cv1 =\n            convBnSiLU(network, weightMap, input, (int)((float)c2 * e), 1, 1, calculateP(1), lname + \".cv1\");\n    nvinfer1::IElementWiseLayer* cv2 =\n            convBnSiLU(network, weightMap, *cv1->getOutput(0), c2, 3, 1, calculateP(3), lname + \".cv2\");\n    if (shortcut && c1 == c2) {\n        auto ew = network->addElementWise(input, *cv2->getOutput(0), nvinfer1::ElementWiseOperation::kSUM);\n        return ew;\n    }\n    return cv2;\n}\n\nnvinfer1::IElementWiseLayer* C2F(nvinfer1::INetworkDefinition* network,\n                                 std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                 int c2, int n, bool shortcut, float e, std::string lname) {\n    int c_ = (float)c2 * e;\n\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, 2 * c_, 1, 1, 0, lname + \".cv1\");\n    nvinfer1::Dims d = conv1->getOutput(0)->getDimensions();\n\n    nvinfer1::ISliceLayer* split1 =\n            network->addSlice(*conv1->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n                              nvinfer1::Dims3{d.d[0] / 2, d.d[1], d.d[2]}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split2 =\n            network->addSlice(*conv1->getOutput(0), nvinfer1::Dims3{d.d[0] / 2, 0, 0},\n                              nvinfer1::Dims3{d.d[0] / 2, d.d[1], d.d[2]}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ITensor* inputTensor0[] = {split1->getOutput(0), split2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensor0, 2);\n    nvinfer1::ITensor* y1 = split2->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        auto* b = bottleneck(network, weightMap, *y1, c_, c_, shortcut, 1.0, lname + \".m.\" + std::to_string(i));\n        y1 = b->getOutput(0);\n\n        nvinfer1::ITensor* inputTensors[] = {cat->getOutput(0), b->getOutput(0)};\n        cat = network->addConcatenation(inputTensors, 2);\n    }\n\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, 1, 1, 0, lname + \".cv2\");\n\n    return conv2;\n}\n\nnvinfer1::IElementWiseLayer* C2(nvinfer1::INetworkDefinition* network,\n                                std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int c1,\n                                int c2, int n, bool shortcut, float e, std::string lname) {\n    assert(network != nullptr);\n    int hidden_channels = static_cast<int>(c2 * e);\n\n    // cv1 branch\n    nvinfer1::IElementWiseLayer* conv1 =\n            convBnSiLU(network, weightMap, input, 2 * hidden_channels, 1, 1, 0, lname + \".cv1\");\n    nvinfer1::ITensor* cv1_out = conv1->getOutput(0);\n\n    // Split the output of cv1 into two tensors\n    nvinfer1::Dims dims = cv1_out->getDimensions();\n    nvinfer1::ISliceLayer* split1 =\n            network->addSlice(*cv1_out, nvinfer1::Dims3{0, 0, 0}, nvinfer1::Dims3{dims.d[0] / 2, dims.d[1], dims.d[2]},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split2 =\n            network->addSlice(*cv1_out, nvinfer1::Dims3{dims.d[0] / 2, 0, 0},\n                              nvinfer1::Dims3{dims.d[0] / 2, dims.d[1], dims.d[2]}, nvinfer1::Dims3{1, 1, 1});\n\n    // Create y1 bottleneck sequence\n    nvinfer1::ITensor* y1 = split1->getOutput(0);\n    for (int i = 0; i < n; ++i) {\n        auto* bottleneck_layer = bottleneck(network, weightMap, *y1, hidden_channels, hidden_channels, shortcut, 1.0,\n                                            lname + \".m.\" + std::to_string(i));\n        y1 = bottleneck_layer->getOutput(0);  // update 'y1' to be the output of the current bottleneck\n    }\n\n    // Concatenate y1 with the second split of cv1\n    nvinfer1::ITensor* concatInputs[2] = {y1, split2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(concatInputs, 2);\n\n    // cv2 to produce the final output\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, 1, 1, 0, lname + \".cv2\");\n\n    return conv2;\n}\n\nnvinfer1::IElementWiseLayer* C3(nvinfer1::INetworkDefinition* network,\n                                std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                int c2, int n, bool shortcut, float e, std::string lname) {\n    int c_ = (float)c2 * e;\n    nvinfer1::IElementWiseLayer* cv1 = convBnSiLU(network, weightMap, input, c_, 1, 1, calculateP(1), lname + \".cv1\");\n    nvinfer1::IElementWiseLayer* cv2 = convBnSiLU(network, weightMap, input, c_, 1, 1, calculateP(1), lname + \".cv2\");\n    nvinfer1::ITensor* y1 = cv1->getOutput(0);\n    for (int i = 0; i < n; i++) {\n        auto b = bottleneck_c3(network, weightMap, *y1, c_, c_, shortcut, 1.0, lname + \".m.\" + std::to_string(i));\n        y1 = b->getOutput(0);\n    }\n    nvinfer1::ITensor* inputTensors[] = {y1, cv2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensors, 2);\n    nvinfer1::IElementWiseLayer* conv3 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, 1, 1, calculateP(1), lname + \".cv3\");\n    return conv3;\n}\n\nnvinfer1::IElementWiseLayer* SPPF(nvinfer1::INetworkDefinition* network,\n                                  std::map<std::string, nvinfer1::Weights> weightMap, nvinfer1::ITensor& input, int c1,\n                                  int c2, int k, std::string lname) {\n    int c_ = c1 / 2;\n    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, input, c_, 1, 1, 0, lname + \".cv1\");\n    nvinfer1::IPoolingLayer* pool1 =\n            network->addPoolingNd(*conv1->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool1->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool1->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::IPoolingLayer* pool2 =\n            network->addPoolingNd(*pool1->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool2->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::IPoolingLayer* pool3 =\n            network->addPoolingNd(*pool2->getOutput(0), nvinfer1::PoolingType::kMAX, nvinfer1::DimsHW{k, k});\n    pool3->setStrideNd(nvinfer1::DimsHW{1, 1});\n    pool3->setPaddingNd(nvinfer1::DimsHW{k / 2, k / 2});\n    nvinfer1::ITensor* inputTensors[] = {conv1->getOutput(0), pool1->getOutput(0), pool2->getOutput(0),\n                                         pool3->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat = network->addConcatenation(inputTensors, 4);\n    nvinfer1::IElementWiseLayer* conv2 =\n            convBnSiLU(network, weightMap, *cat->getOutput(0), c2, 1, 1, 0, lname + \".cv2\");\n    return conv2;\n}\n\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int grid, int k, int s, int p, std::string lname) {\n\n    nvinfer1::IShuffleLayer* shuffle1 = network->addShuffle(input);\n    shuffle1->setReshapeDimensions(nvinfer1::Dims3{4, 16, grid});\n    shuffle1->setSecondTranspose(nvinfer1::Permutation{1, 0, 2});\n    nvinfer1::ISoftMaxLayer* softmax = network->addSoftMax(*shuffle1->getOutput(0));\n\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(*softmax->getOutput(0), 1, nvinfer1::DimsHW{1, 1}, weightMap[lname], bias_empty);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n\n    nvinfer1::IShuffleLayer* shuffle2 = network->addShuffle(*conv->getOutput(0));\n    shuffle2->setReshapeDimensions(nvinfer1::Dims2{4, grid});\n\n    return shuffle2;\n}\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network,\n                                       std::vector<nvinfer1::IConcatenationLayer*> dets, const int* px_arry,\n                                       int px_arry_num, int num_class, bool is_segmentation, bool is_pose,\n                                       bool is_obb) {\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n    const int netinfo_count = 9;  // Assuming the first 5 elements are for netinfo as per existing code.\n    const int total_count = netinfo_count + px_arry_num;  // Total number of elements for netinfo and px_arry combined.\n\n    std::vector<int> combinedInfo(total_count);\n    // Fill in the first 5 elements as per existing netinfo.\n    combinedInfo[0] = num_class;\n    combinedInfo[1] = kNumberOfPoints;\n    combinedInfo[2] = kConfThreshKeypoints;\n    combinedInfo[3] = kInputW;\n    combinedInfo[4] = kInputH;\n    combinedInfo[5] = kMaxNumOutputBbox;\n    combinedInfo[6] = is_segmentation;\n    combinedInfo[7] = is_pose;\n    combinedInfo[8] = is_obb;\n\n    // Copy the contents of px_arry into the combinedInfo vector after the initial\n    // 5 elements.\n    std::copy(px_arry, px_arry + px_arry_num, combinedInfo.begin() + netinfo_count);\n\n    // Now let's create the PluginField object to hold this combined information.\n    nvinfer1::PluginField pluginField;\n    pluginField.name = \"combinedInfo\";  // This can be any name that the plugin will recognize\n    pluginField.data = combinedInfo.data();\n    pluginField.type = nvinfer1::PluginFieldType::kINT32;\n    pluginField.length = combinedInfo.size();\n\n    // Create the PluginFieldCollection to hold the PluginField object.\n    nvinfer1::PluginFieldCollection pluginFieldCollection;\n    pluginFieldCollection.nbFields = 1;  // We have just one field, but it's a combined array\n    pluginFieldCollection.fields = &pluginField;\n\n    // Create the plugin object using the PluginFieldCollection.\n    nvinfer1::IPluginV2* pluginObject = creator->createPlugin(\"yololayer\", &pluginFieldCollection);\n\n    // We assume that the plugin is to be added onto the network.\n    // Prepare input tensors for the YOLO Layer.\n    std::vector<nvinfer1::ITensor*> inputTensors;\n    for (auto det : dets) {\n        inputTensors.push_back(det->getOutput(0));  // Assuming each IConcatenationLayer has one output tensor.\n    }\n\n    // Add the plugin to the network using the prepared input tensors.\n    nvinfer1::IPluginV2Layer* yoloLayer = network->addPluginV2(inputTensors.data(), inputTensors.size(), *pluginObject);\n\n    return yoloLayer;  // Return the added YOLO layer.\n}\n"
  },
  {
    "path": "yolov8/src/calibrator.cpp",
    "content": "#include \"calibrator.h\"\n#include <fstream>\n#include <iostream>\n#include <iterator>\n#include <opencv2/dnn/dnn.hpp>\n#include \"cuda_utils.h\"\n#include \"utils.h\"\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir,\n                                               const char* calib_table_name, const char* input_blob_name,\n                                               bool read_cache)\n    : batchsize_(batchsize),\n      input_w_(input_w),\n      input_h_(input_h),\n      img_idx_(0),\n      img_dir_(img_dir),\n      calib_table_name_(calib_table_name),\n      input_blob_name_(input_blob_name),\n      read_cache_(read_cache) {\n    input_count_ = 3 * input_w * input_h * batchsize;\n    CUDA_CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n    read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2() {\n    CUDA_CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const TRT_NOEXCEPT {\n    return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT {\n    if (img_idx_ + batchsize_ > (int)img_files_.size()) {\n        return false;\n    }\n\n    std::vector<cv::Mat> input_imgs_;\n    for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n        std::cout << img_files_[i] << \"  \" << i << std::endl;\n        cv::Mat temp = cv::imread(img_dir_ + img_files_[i]);\n        if (temp.empty()) {\n            std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n            return false;\n        }\n        cv::Mat pr_img = preprocess_img(temp, input_w_, input_h_);\n        input_imgs_.push_back(pr_img);\n    }\n    img_idx_ += batchsize_;\n    cv::Mat blob = cv::dnn::blobFromImages(input_imgs_, 1.0 / 255.0, cv::Size(input_w_, input_h_), cv::Scalar(0, 0, 0),\n                                           true, false);\n    CUDA_CHECK(cudaMemcpy(device_input_, blob.ptr<float>(0), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n    assert(!strcmp(names[0], input_blob_name_));\n    bindings[0] = device_input_;\n    return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length) TRT_NOEXCEPT {\n    std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n    calib_cache_.clear();\n    std::ifstream input(calib_table_name_, std::ios::binary);\n    input >> std::noskipws;\n    if (read_cache_ && input.good()) {\n        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n    }\n    length = calib_cache_.size();\n    return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT {\n    std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n    std::ofstream output(calib_table_name_, std::ios::binary);\n    output.write(reinterpret_cast<const char*>(cache), length);\n}\n"
  },
  {
    "path": "yolov8/src/model.cpp",
    "content": "#include <math.h>\n#include <iostream>\n\n#include \"block.h\"\n#include \"calibrator.h\"\n#include \"config.h\"\n#include \"model.h\"\n\nstatic int get_width_5u(int x, float gw, int divisor = 8) {\n    return int(ceil((x * gw) / divisor)) * divisor;\n}\n\nstatic int get_width(int x, float gw, int max_channels, int divisor = 8) {\n    auto channel = int(ceil((x * gw) / divisor)) * divisor;\n    return channel >= max_channels ? max_channels : channel;\n}\n\nstatic int get_depth(int x, float gd) {\n    if (x == 1)\n        return 1;\n    int r = round(x * gd);\n    if (x * gd - int(x * gd) == 0.5 && (int(x * gd) % 2) == 0)\n        --r;\n    return std::max<int>(r, 1);\n}\n\nvoid calculateStrides(nvinfer1::IElementWiseLayer* conv_layers[], int size, int reference_size, int strides[]) {\n    for (int i = 0; i < size; ++i) {\n        nvinfer1::ILayer* layer = conv_layers[i];\n        nvinfer1::Dims dims = layer->getOutput(0)->getDimensions();\n        int feature_map_size = dims.d[1];\n        strides[i] = reference_size / feature_map_size;\n    }\n}\n\nstatic nvinfer1::IElementWiseLayer* Proto(nvinfer1::INetworkDefinition* network,\n                                          std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input,\n                                          std::string lname, float gw, int max_channels) {\n    int mid_channel = get_width(256, gw, max_channels);\n    auto cv1 = convBnSiLU(network, weightMap, input, mid_channel, 3, 1, 1, \"model.22.proto.cv1\");\n    float* convTranpsose_bais = (float*)weightMap[\"model.22.proto.upsample.bias\"].values;\n    int convTranpsose_bais_len = weightMap[\"model.22.proto.upsample.bias\"].count;\n    nvinfer1::Weights bias{nvinfer1::DataType::kFLOAT, convTranpsose_bais, convTranpsose_bais_len};\n    auto convTranpsose = network->addDeconvolutionNd(*cv1->getOutput(0), mid_channel, nvinfer1::DimsHW{2, 2},\n                                                     weightMap[\"model.22.proto.upsample.weight\"], bias);\n    assert(convTranpsose);\n    convTranpsose->setStrideNd(nvinfer1::DimsHW{2, 2});\n    auto cv2 = convBnSiLU(network, weightMap, *convTranpsose->getOutput(0), mid_channel, 3, 1, 1, \"model.22.proto.cv2\");\n    auto cv3 = convBnSiLU(network, weightMap, *cv2->getOutput(0), 32, 1, 1, 0, \"model.22.proto.cv3\");\n    assert(cv3);\n    return cv3;\n}\n\nstatic nvinfer1::IShuffleLayer* cv4_conv_combined(nvinfer1::INetworkDefinition* network,\n                                                  std::map<std::string, nvinfer1::Weights>& weightMap,\n                                                  nvinfer1::ITensor& input, std::string lname, int grid_shape, float gw,\n                                                  std::string algo_type) {\n    int mid_channle = 0;\n    int output_channel = 0;\n\n    if (algo_type == \"seg\") {\n        if (gw == 0.25 || gw == 0.5) {\n            mid_channle = 32;\n        } else if (gw == 0.75) {\n            mid_channle = 48;\n        } else if (gw == 1.00) {\n            mid_channle = 64;\n        } else if (gw == 1.25) {\n            mid_channle = 80;\n        }\n\n        output_channel = 32;\n\n    } else if (algo_type == \"pose\") {\n        std::string bn_weight_key = lname + \".0.bn.weight\";\n        mid_channle = weightMap[bn_weight_key].count;\n        output_channel = kNumberOfPoints * 3;\n    } else if (algo_type == \"obb\") {\n        std::string bn_weight_key = lname + \".0.bn.weight\";\n        mid_channle = weightMap[bn_weight_key].count;\n        output_channel = 1;\n    }\n\n    auto cv0 = convBnSiLU(network, weightMap, input, mid_channle, 3, 1, 1, lname + \".0\");\n    auto cv1 = convBnSiLU(network, weightMap, *cv0->getOutput(0), mid_channle, 3, 1, 1, lname + \".1\");\n    float* cv2_bais_value = (float*)weightMap[lname + \".2\" + \".bias\"].values;\n    int cv2_bais_len = weightMap[lname + \".2\" + \".bias\"].count;\n    nvinfer1::Weights cv2_bais{nvinfer1::DataType::kFLOAT, cv2_bais_value, cv2_bais_len};\n    auto cv2 = network->addConvolutionNd(*cv1->getOutput(0), output_channel, nvinfer1::DimsHW{1, 1},\n                                         weightMap[lname + \".2\" + \".weight\"], cv2_bais);\n    cv2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    nvinfer1::IShuffleLayer* cv2_shuffle = network->addShuffle(*cv2->getOutput(0));\n    cv2_shuffle->setReshapeDimensions(nvinfer1::Dims2{output_channel, grid_shape});\n\n    return cv2_shuffle;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov8Det(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    /*******************************************************************************************************\n  ******************************************  YOLOV8 INPUT\n  ***********************************************\n  *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims3{3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n  *****************************************  YOLOV8 BACKBONE\n  *********************************************\n  *******************************************************************************************************/\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), 3, 2, 1, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), 3, 2, 1, \"model.1\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv2 = C2F(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                                             get_width(128, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.2\");\n    nvinfer1::IElementWiseLayer* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), 3, 2, 1, \"model.3\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv4 = C2F(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                                             get_width(256, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 =\n            convBnSiLU(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), 3, 2, 1, \"model.5\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv6 = C2F(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                                             get_width(512, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.6\");\n    nvinfer1::IElementWiseLayer* conv7 =\n            convBnSiLU(network, weightMap, *conv6->getOutput(0), get_width(1024, gw, max_channels), 3, 2, 1, \"model.7\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv8 =\n            C2F(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.8\");\n    nvinfer1::IElementWiseLayer* conv9 =\n            SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 5, \"model.9\");\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 HEAD\n  *********************************************\n  *******************************************************************************************************/\n    float scale[] = {1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample10 = network->addResize(*conv9->getOutput(0));\n    assert(upsample10);\n    upsample10->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample10->setScales(scale, 3);\n\n    nvinfer1::ITensor* inputTensor11[] = {upsample10->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat11 = network->addConcatenation(inputTensor11, 2);\n\n    nvinfer1::IElementWiseLayer* conv12 =\n            C2F(network, weightMap, *cat11->getOutput(0), get_width(512, gw, max_channels),\n                get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.12\");\n\n    nvinfer1::IResizeLayer* upsample13 = network->addResize(*conv12->getOutput(0));\n    assert(upsample13);\n    upsample13->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample13->setScales(scale, 3);\n\n    nvinfer1::ITensor* inputTensor14[] = {upsample13->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat14 = network->addConcatenation(inputTensor14, 2);\n\n    nvinfer1::IElementWiseLayer* conv15 =\n            C2F(network, weightMap, *cat14->getOutput(0), get_width(256, gw, max_channels),\n                get_width(256, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.15\");\n    nvinfer1::IElementWiseLayer* conv16 = convBnSiLU(network, weightMap, *conv15->getOutput(0),\n                                                     get_width(256, gw, max_channels), 3, 2, 1, \"model.16\");\n    nvinfer1::ITensor* inputTensor17[] = {conv16->getOutput(0), conv12->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat17 = network->addConcatenation(inputTensor17, 2);\n    nvinfer1::IElementWiseLayer* conv18 =\n            C2F(network, weightMap, *cat17->getOutput(0), get_width(512, gw, max_channels),\n                get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.18\");\n    nvinfer1::IElementWiseLayer* conv19 = convBnSiLU(network, weightMap, *conv18->getOutput(0),\n                                                     get_width(512, gw, max_channels), 3, 2, 1, \"model.19\");\n    nvinfer1::ITensor* inputTensor20[] = {conv19->getOutput(0), conv9->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat20 = network->addConcatenation(inputTensor20, 2);\n    nvinfer1::IElementWiseLayer* conv21 =\n            C2F(network, weightMap, *cat20->getOutput(0), get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.21\");\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 OUTPUT\n  *******************************************\n  *******************************************************************************************************/\n    int base_in_channel = (gw == 1.25) ? 80 : 64;\n    int base_out_channel = (gw == 0.25) ? std::max(64, std::min(kNumClass, 100)) : get_width(256, gw, max_channels);\n\n    // output0\n    nvinfer1::IElementWiseLayer* conv22_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv15->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv22_cv2_0_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv2_0_2 =\n            network->addConvolutionNd(*conv22_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv2.0.2.weight\"], weightMap[\"model.22.cv2.0.2.bias\"]);\n    conv22_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv22_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv22_cv3_0_0 =\n            convBnSiLU(network, weightMap, *conv15->getOutput(0), base_out_channel, 3, 1, 1, \"model.22.cv3.0.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv3_0_1 = convBnSiLU(network, weightMap, *conv22_cv3_0_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.22.cv3.0.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv3_0_2 =\n            network->addConvolutionNd(*conv22_cv3_0_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv3.0.2.weight\"], weightMap[\"model.22.cv3.0.2.bias\"]);\n    conv22_cv3_0_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv22_cv3_0_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor22_0[] = {conv22_cv2_0_2->getOutput(0), conv22_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_0 = network->addConcatenation(inputTensor22_0, 2);\n\n    // output1\n    nvinfer1::IElementWiseLayer* conv22_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv18->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv22_cv2_1_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv2_1_2 =\n            network->addConvolutionNd(*conv22_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv2.1.2.weight\"], weightMap[\"model.22.cv2.1.2.bias\"]);\n    conv22_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv22_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv22_cv3_1_0 =\n            convBnSiLU(network, weightMap, *conv18->getOutput(0), base_out_channel, 3, 1, 1, \"model.22.cv3.1.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv3_1_1 = convBnSiLU(network, weightMap, *conv22_cv3_1_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.22.cv3.1.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv3_1_2 =\n            network->addConvolutionNd(*conv22_cv3_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv3.1.2.weight\"], weightMap[\"model.22.cv3.1.2.bias\"]);\n    conv22_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv22_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor22_1[] = {conv22_cv2_1_2->getOutput(0), conv22_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_1 = network->addConcatenation(inputTensor22_1, 2);\n\n    // output2\n    nvinfer1::IElementWiseLayer* conv22_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv21->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv22_cv2_2_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv2_2_2 =\n            network->addConvolution(*conv22_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.22.cv2.2.2.weight\"], weightMap[\"model.22.cv2.2.2.bias\"]);\n    nvinfer1::IElementWiseLayer* conv22_cv3_2_0 =\n            convBnSiLU(network, weightMap, *conv21->getOutput(0), base_out_channel, 3, 1, 1, \"model.22.cv3.2.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv3_2_1 = convBnSiLU(network, weightMap, *conv22_cv3_2_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.22.cv3.2.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv3_2_2 =\n            network->addConvolution(*conv22_cv3_2_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.22.cv3.2.2.weight\"], weightMap[\"model.22.cv3.2.2.bias\"]);\n    nvinfer1::ITensor* inputTensor22_2[] = {conv22_cv2_2_2->getOutput(0), conv22_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_2 = network->addConcatenation(inputTensor22_2, 2);\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 DETECT\n  *******************************************\n  *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle22_0 = network->addShuffle(*cat22_0->getOutput(0));\n    shuffle22_0->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split22_0_0 = network->addSlice(\n            *shuffle22_0->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split22_0_1 = network->addSlice(\n            *shuffle22_0->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl22_0 =\n            DFL(network, weightMap, *split22_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.22.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor22_dfl_0[] = {dfl22_0->getOutput(0), split22_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_0 = network->addConcatenation(inputTensor22_dfl_0, 2);\n\n    nvinfer1::IShuffleLayer* shuffle22_1 = network->addShuffle(*cat22_1->getOutput(0));\n    shuffle22_1->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split22_1_0 = network->addSlice(\n            *shuffle22_1->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split22_1_1 = network->addSlice(\n            *shuffle22_1->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl22_1 =\n            DFL(network, weightMap, *split22_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.22.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor22_dfl_1[] = {dfl22_1->getOutput(0), split22_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_1 = network->addConcatenation(inputTensor22_dfl_1, 2);\n\n    nvinfer1::IShuffleLayer* shuffle22_2 = network->addShuffle(*cat22_2->getOutput(0));\n    shuffle22_2->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split22_2_0 = network->addSlice(\n            *shuffle22_2->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split22_2_1 = network->addSlice(\n            *shuffle22_2->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl22_2 =\n            DFL(network, weightMap, *split22_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.22.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor22_dfl_2[] = {dfl22_2->getOutput(0), split22_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_2 = network->addConcatenation(inputTensor22_dfl_2, 2);\n\n    nvinfer1::IPluginV2Layer* yolo =\n            addYoLoLayer(network, std::vector<nvinfer1::IConcatenationLayer*>{cat22_dfl_0, cat22_dfl_1, cat22_dfl_2},\n                         strides, stridesLength, kNumClass, false, false, false);\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov8DetP6(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                              nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                              int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);\n    /*******************************************************************************************************\n  ******************************************  YOLOV8 INPUT\n  ***********************************************\n  *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims3{3, kInputH, kInputW});\n    assert(data);\n    /*******************************************************************************************************\n  *****************************************  YOLOV8 BACKBONE\n  *********************************************\n  *******************************************************************************************************/\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), 3, 2, 1, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), 3, 2, 1, \"model.1\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv2 = C2F(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                                             get_width(128, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.2\");\n    nvinfer1::IElementWiseLayer* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), 3, 2, 1, \"model.3\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv4 = C2F(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                                             get_width(256, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 =\n            convBnSiLU(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), 3, 2, 1, \"model.5\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv6 = C2F(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                                             get_width(512, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.6\");\n\n    nvinfer1::IElementWiseLayer* conv7 =\n            convBnSiLU(network, weightMap, *conv6->getOutput(0), get_width(768, gw, max_channels), 3, 2, 1, \"model.7\");\n    nvinfer1::IElementWiseLayer* conv8 = C2F(network, weightMap, *conv7->getOutput(0), get_width(768, gw, max_channels),\n                                             get_width(768, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.8\");\n\n    nvinfer1::IElementWiseLayer* conv9 =\n            convBnSiLU(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels), 3, 2, 1, \"model.9\");\n    nvinfer1::IElementWiseLayer* conv10 =\n            C2F(network, weightMap, *conv9->getOutput(0), get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.10\");\n\n    nvinfer1::IElementWiseLayer* conv11 =\n            SPPF(network, weightMap, *conv10->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 5, \"model.11\");\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 HEAD\n  *********************************************\n  *******************************************************************************************************/\n    // Head\n    float scale[] = {1.0, 2.0, 2.0};  // scale used for upsampling\n\n    // P5\n    nvinfer1::IResizeLayer* upsample12 = network->addResize(*conv11->getOutput(0));\n    upsample12->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample12->setScales(scale, 3);\n    nvinfer1::ITensor* concat13_inputs[] = {upsample12->getOutput(0), conv8->getOutput(0)};\n    nvinfer1::IConcatenationLayer* concat13 = network->addConcatenation(concat13_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv14 =\n            C2(network, weightMap, *concat13->getOutput(0), get_width(768, gw, max_channels),\n               get_width(768, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.14\");\n\n    // P4\n    nvinfer1::IResizeLayer* upsample15 = network->addResize(*conv14->getOutput(0));\n    upsample15->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample15->setScales(scale, 3);\n    nvinfer1::ITensor* concat16_inputs[] = {upsample15->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* concat16 = network->addConcatenation(concat16_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv17 =\n            C2(network, weightMap, *concat16->getOutput(0), get_width(512, gw, max_channels),\n               get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.17\");\n\n    // P3\n    nvinfer1::IResizeLayer* upsample18 = network->addResize(*conv17->getOutput(0));\n    upsample18->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample18->setScales(scale, 3);\n    nvinfer1::ITensor* concat19_inputs[] = {upsample18->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* concat19 = network->addConcatenation(concat19_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv20 =\n            C2(network, weightMap, *concat19->getOutput(0), get_width(256, gw, max_channels),\n               get_width(256, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.20\");\n\n    // Additional layers for P4, P5, P6\n    // P4/16-medium\n    nvinfer1::IElementWiseLayer* conv21 = convBnSiLU(network, weightMap, *conv20->getOutput(0),\n                                                     get_width(256, gw, max_channels), 3, 2, 1, \"model.21\");\n    nvinfer1::ITensor* concat22_inputs[] = {conv21->getOutput(0), conv17->getOutput(0)};\n    nvinfer1::IConcatenationLayer* concat22 = network->addConcatenation(concat22_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv23 =\n            C2(network, weightMap, *concat22->getOutput(0), get_width(512, gw, max_channels),\n               get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.23\");\n\n    // P5/32-large\n    nvinfer1::IElementWiseLayer* conv24 = convBnSiLU(network, weightMap, *conv23->getOutput(0),\n                                                     get_width(512, gw, max_channels), 3, 2, 1, \"model.24\");\n    nvinfer1::ITensor* concat25_inputs[] = {conv24->getOutput(0), conv14->getOutput(0)};\n    nvinfer1::IConcatenationLayer* concat25 = network->addConcatenation(concat25_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv26 =\n            C2(network, weightMap, *concat25->getOutput(0), get_width(768, gw, max_channels),\n               get_width(768, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.26\");\n\n    // P6/64-xlarge\n    nvinfer1::IElementWiseLayer* conv27 = convBnSiLU(network, weightMap, *conv26->getOutput(0),\n                                                     get_width(768, gw, max_channels), 3, 2, 1, \"model.27\");\n    nvinfer1::ITensor* concat28_inputs[] = {conv27->getOutput(0), conv11->getOutput(0)};\n    nvinfer1::IConcatenationLayer* concat28 = network->addConcatenation(concat28_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv29 =\n            C2(network, weightMap, *concat28->getOutput(0), get_width(1024, gw, max_channels),\n               get_width(1024, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.29\");\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 OUTPUT\n  *******************************************\n  *******************************************************************************************************/\n    int base_in_channel = (gw == 1.25) ? 80 : 64;\n    int base_out_channel = (gw == 0.25) ? std::max(64, std::min(kNumClass, 100)) : get_width(256, gw, max_channels);\n\n    // output0\n    nvinfer1::IElementWiseLayer* conv30_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv20->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv30_cv2_0_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv2_0_2 =\n            network->addConvolutionNd(*conv30_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.30.cv2.0.2.weight\"], weightMap[\"model.30.cv2.0.2.bias\"]);\n    conv30_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n\n    conv30_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    nvinfer1::IElementWiseLayer* conv30_cv3_0_0 =\n            convBnSiLU(network, weightMap, *conv20->getOutput(0), base_out_channel, 3, 1, 1, \"model.30.cv3.0.0\");\n\n    nvinfer1::IElementWiseLayer* conv30_cv3_0_1 = convBnSiLU(network, weightMap, *conv30_cv3_0_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.30.cv3.0.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv3_0_2 =\n            network->addConvolutionNd(*conv30_cv3_0_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.30.cv3.0.2.weight\"], weightMap[\"model.30.cv3.0.2.bias\"]);\n    conv30_cv3_0_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv30_cv3_0_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor30_0[] = {conv30_cv2_0_2->getOutput(0), conv30_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_0 = network->addConcatenation(inputTensor30_0, 2);\n\n    // output1\n    nvinfer1::IElementWiseLayer* conv30_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv23->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv30_cv2_1_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv2_1_2 =\n            network->addConvolutionNd(*conv30_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.30.cv2.1.2.weight\"], weightMap[\"model.30.cv2.1.2.bias\"]);\n    conv30_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv30_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv30_cv3_1_0 =\n            convBnSiLU(network, weightMap, *conv23->getOutput(0), base_out_channel, 3, 1, 1, \"model.30.cv3.1.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv3_1_1 = convBnSiLU(network, weightMap, *conv30_cv3_1_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.30.cv3.1.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv3_1_2 =\n            network->addConvolutionNd(*conv30_cv3_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.30.cv3.1.2.weight\"], weightMap[\"model.30.cv3.1.2.bias\"]);\n    conv30_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv30_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor30_1[] = {conv30_cv2_1_2->getOutput(0), conv30_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_1 = network->addConcatenation(inputTensor30_1, 2);\n\n    // output2\n    nvinfer1::IElementWiseLayer* conv30_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv26->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv30_cv2_2_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv2_2_2 =\n            network->addConvolution(*conv30_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.30.cv2.2.2.weight\"], weightMap[\"model.30.cv2.2.2.bias\"]);\n    conv30_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv30_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv30_cv3_2_0 =\n            convBnSiLU(network, weightMap, *conv26->getOutput(0), base_out_channel, 3, 1, 1, \"model.30.cv3.2.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv3_2_1 = convBnSiLU(network, weightMap, *conv30_cv3_2_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.30.cv3.2.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv3_2_2 =\n            network->addConvolution(*conv30_cv3_2_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.30.cv3.2.2.weight\"], weightMap[\"model.30.cv3.2.2.bias\"]);\n    conv30_cv3_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv30_cv3_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor30_2[] = {conv30_cv2_2_2->getOutput(0), conv30_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_2 = network->addConcatenation(inputTensor30_2, 2);\n\n    // output3\n    nvinfer1::IElementWiseLayer* conv30_cv2_3_0 =\n            convBnSiLU(network, weightMap, *conv29->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.3.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv2_3_1 =\n            convBnSiLU(network, weightMap, *conv30_cv2_3_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.3.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv2_3_2 =\n            network->addConvolution(*conv30_cv2_3_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.30.cv2.3.2.weight\"], weightMap[\"model.30.cv2.3.2.bias\"]);\n    conv30_cv2_3_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv30_cv2_3_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv30_cv3_3_0 =\n            convBnSiLU(network, weightMap, *conv29->getOutput(0), base_out_channel, 3, 1, 1, \"model.30.cv3.3.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv3_3_1 = convBnSiLU(network, weightMap, *conv30_cv3_3_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.30.cv3.3.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv3_3_2 =\n            network->addConvolution(*conv30_cv3_3_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.30.cv3.3.2.weight\"], weightMap[\"model.30.cv3.3.2.bias\"]);\n    conv30_cv3_3_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv30_cv3_3_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor30_3[] = {conv30_cv2_3_2->getOutput(0), conv30_cv3_3_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_3 = network->addConcatenation(inputTensor30_3, 2);\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 DETECT\n  *******************************************\n  *******************************************************************************************************/\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7, conv9};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    // P3 processing steps (remains unchanged)\n    nvinfer1::IShuffleLayer* shuffle30_0 =\n            network->addShuffle(*cat30_0->getOutput(0));  // Reusing the previous cat30_0 as P3 concatenation layer\n    shuffle30_0->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split30_0_0 = network->addSlice(\n            *shuffle30_0->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split30_0_1 = network->addSlice(\n            *shuffle30_0->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl30_0 =\n            DFL(network, weightMap, *split30_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.30.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor30_dfl_0[] = {dfl30_0->getOutput(0), split30_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_dfl_0 = network->addConcatenation(inputTensor30_dfl_0, 2);\n\n    // P4 processing steps (remains unchanged)\n    nvinfer1::IShuffleLayer* shuffle30_1 =\n            network->addShuffle(*cat30_1->getOutput(0));  // Reusing the previous cat30_1 as P4 concatenation layer\n    shuffle30_1->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split30_1_0 = network->addSlice(\n            *shuffle30_1->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split30_1_1 = network->addSlice(\n            *shuffle30_1->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl30_1 =\n            DFL(network, weightMap, *split30_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.30.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor30_dfl_1[] = {dfl30_1->getOutput(0), split30_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_dfl_1 = network->addConcatenation(inputTensor30_dfl_1, 2);\n\n    // P5 processing steps (remains unchanged)\n    nvinfer1::IShuffleLayer* shuffle30_2 =\n            network->addShuffle(*cat30_2->getOutput(0));  // Reusing the previous cat30_2 as P5 concatenation layer\n    shuffle30_2->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split30_2_0 = network->addSlice(\n            *shuffle30_2->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split30_2_1 = network->addSlice(\n            *shuffle30_2->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl30_2 =\n            DFL(network, weightMap, *split30_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.30.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor30_dfl_2[] = {dfl30_2->getOutput(0), split30_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_dfl_2 = network->addConcatenation(inputTensor30_dfl_2, 2);\n\n    // P6 processing steps\n    nvinfer1::IShuffleLayer* shuffle30_3 = network->addShuffle(*cat30_3->getOutput(0));\n    shuffle30_3->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[3]) * (kInputW / strides[3])});\n    nvinfer1::ISliceLayer* split30_3_0 = network->addSlice(\n            *shuffle30_3->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[3]) * (kInputW / strides[3])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split30_3_1 = network->addSlice(\n            *shuffle30_3->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[3]) * (kInputW / strides[3])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl30_3 =\n            DFL(network, weightMap, *split30_3_0->getOutput(0), 4, (kInputH / strides[3]) * (kInputW / strides[3]), 1,\n                1, 0, \"model.30.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor30_dfl_3[] = {dfl30_3->getOutput(0), split30_3_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_dfl_3 = network->addConcatenation(inputTensor30_dfl_3, 2);\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(\n            network, std::vector<nvinfer1::IConcatenationLayer*>{cat30_dfl_0, cat30_dfl_1, cat30_dfl_2, cat30_dfl_3},\n            strides, stridesLength, kNumClass, false, false, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov8DetP2(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                              nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                              int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    /*******************************************************************************************************\n  ******************************************  YOLOV8 INPUT\n  ***********************************************\n  *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims3{3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n  *****************************************  YOLOV8 BACKBONE\n  *********************************************\n  *******************************************************************************************************/\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), 3, 2, 1, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), 3, 2, 1, \"model.1\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv2 = C2F(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                                             get_width(128, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.2\");\n    nvinfer1::IElementWiseLayer* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), 3, 2, 1, \"model.3\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv4 = C2F(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                                             get_width(256, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 =\n            convBnSiLU(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), 3, 2, 1, \"model.5\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv6 = C2F(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                                             get_width(512, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.6\");\n    nvinfer1::IElementWiseLayer* conv7 =\n            convBnSiLU(network, weightMap, *conv6->getOutput(0), get_width(1024, gw, max_channels), 3, 2, 1, \"model.7\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv8 =\n            C2F(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.8\");\n    nvinfer1::IElementWiseLayer* conv9 =\n            SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 5, \"model.9\");\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 HEAD\n  *********************************************\n  *******************************************************************************************************/\n    // Head\n    float scale[] = {1.0, 2.0, 2.0};  // scale used for upsampling\n\n    // P4\n    nvinfer1::IResizeLayer* upsample10 =\n            network->addResize(*conv9->getOutput(0));  // Assuming conv9 is the last layer of the backbone\n                                                       // as per P5 in your first section.\n    upsample10->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample10->setScales(scale, 3);\n    nvinfer1::ITensor* concat11_inputs[] = {upsample10->getOutput(0),\n                                            conv6->getOutput(0)};  // Assuming conv6 corresponds to \"backbone P4\" as\n                                                                   // per your pseudocode\n    nvinfer1::IConcatenationLayer* concat11 = network->addConcatenation(concat11_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv12 =\n            C2F(network, weightMap, *concat11->getOutput(0), get_width(512, gw, max_channels),\n                get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.12\");\n\n    // P3\n    nvinfer1::IResizeLayer* upsample13 = network->addResize(*conv12->getOutput(0));\n    upsample13->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample13->setScales(scale, 3);\n    nvinfer1::ITensor* concat14_inputs[] = {upsample13->getOutput(0),\n                                            conv4->getOutput(0)};  // Assuming conv4 corresponds to \"backbone P3\"\n    nvinfer1::IConcatenationLayer* concat14 = network->addConcatenation(concat14_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv15 =\n            C2F(network, weightMap, *concat14->getOutput(0), get_width(256, gw, max_channels),\n                get_width(256, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.15\");\n\n    // P2\n    nvinfer1::IResizeLayer* upsample16 = network->addResize(*conv15->getOutput(0));\n    upsample16->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample16->setScales(scale, 3);\n    nvinfer1::ITensor* concat17_inputs[] = {upsample16->getOutput(0),\n                                            conv2->getOutput(0)};  // Assuming conv2 corresponds to \"backbone P2\"\n    nvinfer1::IConcatenationLayer* concat17 = network->addConcatenation(concat17_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv18 =\n            C2F(network, weightMap, *concat17->getOutput(0), get_width(128, gw, max_channels),\n                get_width(128, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.18\");\n\n    // Additional layers for P3, P4, P5\n    // Downsample and concatenate for P3\n    nvinfer1::IElementWiseLayer* conv19 = convBnSiLU(network, weightMap, *conv18->getOutput(0),\n                                                     get_width(128, gw, max_channels), 3, 2, 1, \"model.19\");\n    nvinfer1::ITensor* concat20_inputs[] = {\n            conv19->getOutput(0), conv15->getOutput(0)};  // concatenate with higher-resolution feature map from P3\n    nvinfer1::IConcatenationLayer* concat20 = network->addConcatenation(concat20_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv21 =\n            C2F(network, weightMap, *concat20->getOutput(0), get_width(256, gw, max_channels),\n                get_width(256, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.21\");\n\n    // Downsample and concatenate for P4\n    nvinfer1::IElementWiseLayer* conv22 = convBnSiLU(network, weightMap, *conv21->getOutput(0),\n                                                     get_width(256, gw, max_channels), 3, 2, 1, \"model.22\");\n    nvinfer1::ITensor* concat23_inputs[] = {\n            conv22->getOutput(0), conv12->getOutput(0)};  // concatenate with higher-resolution feature map from P4\n    nvinfer1::IConcatenationLayer* concat23 = network->addConcatenation(concat23_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv24 =\n            C2F(network, weightMap, *concat23->getOutput(0), get_width(512, gw, max_channels),\n                get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.24\");\n\n    // Downsample and concatenate for P5\n    nvinfer1::IElementWiseLayer* conv25 = convBnSiLU(network, weightMap, *conv24->getOutput(0),\n                                                     get_width(512, gw, max_channels), 3, 2, 1, \"model.25\");\n    nvinfer1::ITensor* concat26_inputs[] = {\n            conv25->getOutput(0), conv9->getOutput(0)};  // concatenate with higher-resolution feature map from P5\n    nvinfer1::IConcatenationLayer* concat26 = network->addConcatenation(concat26_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv27 =\n            C2F(network, weightMap, *concat26->getOutput(0), get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.27\");\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 OUTPUT\n  *******************************************\n  *******************************************************************************************************/\n    int base_in_channel = 64;\n    int base_out_channel = (gw == 0.25) ? std::max(64, std::min(kNumClass, 100)) : get_width(128, gw, max_channels);\n\n    // output0\n    nvinfer1::IElementWiseLayer* conv28_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv18->getOutput(0), base_in_channel, 3, 1, 1, \"model.28.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv28_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv28_cv2_0_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.28.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv28_cv2_0_2 =\n            network->addConvolutionNd(*conv28_cv2_0_1->getOutput(0), base_in_channel, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.28.cv2.0.2.weight\"], weightMap[\"model.28.cv2.0.2.bias\"]);\n    conv28_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv28_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv28_cv3_0_0 =\n            convBnSiLU(network, weightMap, *conv18->getOutput(0), base_out_channel, 3, 1, 1, \"model.28.cv3.0.0\");\n    nvinfer1::IElementWiseLayer* conv28_cv3_0_1 = convBnSiLU(network, weightMap, *conv28_cv3_0_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.28.cv3.0.1\");\n    nvinfer1::IConvolutionLayer* conv28_cv3_0_2 =\n            network->addConvolutionNd(*conv28_cv3_0_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.28.cv3.0.2.weight\"], weightMap[\"model.28.cv3.0.2.bias\"]);\n    conv28_cv3_0_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv28_cv3_0_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor28_0[] = {conv28_cv2_0_2->getOutput(0), conv28_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat28_0 = network->addConcatenation(inputTensor28_0, 2);\n\n    // output1\n    nvinfer1::IElementWiseLayer* conv28_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv21->getOutput(0), base_in_channel, 3, 1, 1, \"model.28.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv28_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv28_cv2_1_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.28.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv28_cv2_1_2 =\n            network->addConvolutionNd(*conv28_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.28.cv2.1.2.weight\"], weightMap[\"model.28.cv2.1.2.bias\"]);\n    conv28_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv28_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv28_cv3_1_0 =\n            convBnSiLU(network, weightMap, *conv21->getOutput(0), base_out_channel, 3, 1, 1, \"model.28.cv3.1.0\");\n    nvinfer1::IElementWiseLayer* conv28_cv3_1_1 = convBnSiLU(network, weightMap, *conv28_cv3_1_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.28.cv3.1.1\");\n    nvinfer1::IConvolutionLayer* conv28_cv3_1_2 =\n            network->addConvolutionNd(*conv28_cv3_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.28.cv3.1.2.weight\"], weightMap[\"model.28.cv3.1.2.bias\"]);\n    conv28_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv28_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor28_1[] = {conv28_cv2_1_2->getOutput(0), conv28_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat28_1 = network->addConcatenation(inputTensor28_1, 2);\n\n    // output2\n    nvinfer1::IElementWiseLayer* conv28_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv24->getOutput(0), base_in_channel, 3, 1, 1, \"model.28.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv28_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv28_cv2_2_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.28.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv28_cv2_2_2 =\n            network->addConvolution(*conv28_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.28.cv2.2.2.weight\"], weightMap[\"model.28.cv2.2.2.bias\"]);\n    conv28_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv28_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv28_cv3_2_0 =\n            convBnSiLU(network, weightMap, *conv24->getOutput(0), base_out_channel, 3, 1, 1, \"model.28.cv3.2.0\");\n    nvinfer1::IElementWiseLayer* conv28_cv3_2_1 = convBnSiLU(network, weightMap, *conv28_cv3_2_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.28.cv3.2.1\");\n    nvinfer1::IConvolutionLayer* conv28_cv3_2_2 =\n            network->addConvolution(*conv28_cv3_2_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.28.cv3.2.2.weight\"], weightMap[\"model.28.cv3.2.2.bias\"]);\n    conv28_cv3_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv28_cv3_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor28_2[] = {conv28_cv2_2_2->getOutput(0), conv28_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat28_2 = network->addConcatenation(inputTensor28_2, 2);\n\n    // output3\n    nvinfer1::IElementWiseLayer* conv28_cv2_3_0 =\n            convBnSiLU(network, weightMap, *conv27->getOutput(0), base_in_channel, 3, 1, 1, \"model.28.cv2.3.0\");\n    nvinfer1::IElementWiseLayer* conv28_cv2_3_1 =\n            convBnSiLU(network, weightMap, *conv28_cv2_3_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.28.cv2.3.1\");\n    nvinfer1::IConvolutionLayer* conv28_cv2_3_2 =\n            network->addConvolution(*conv28_cv2_3_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.28.cv2.3.2.weight\"], weightMap[\"model.28.cv2.3.2.bias\"]);\n    conv28_cv2_3_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv28_cv2_3_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv28_cv3_3_0 =\n            convBnSiLU(network, weightMap, *conv27->getOutput(0), base_out_channel, 3, 1, 1, \"model.28.cv3.3.0\");\n    nvinfer1::IElementWiseLayer* conv28_cv3_3_1 = convBnSiLU(network, weightMap, *conv28_cv3_3_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.28.cv3.3.1\");\n    nvinfer1::IConvolutionLayer* conv28_cv3_3_2 =\n            network->addConvolution(*conv28_cv3_3_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.28.cv3.3.2.weight\"], weightMap[\"model.28.cv3.3.2.bias\"]);\n    conv28_cv3_3_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv28_cv3_3_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor28_3[] = {conv28_cv2_3_2->getOutput(0), conv28_cv3_3_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat28_3 = network->addConcatenation(inputTensor28_3, 2);\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 DETECT\n  *******************************************\n  *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv1, conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    // P2 processing steps (remains unchanged)\n    nvinfer1::IShuffleLayer* shuffle28_0 = network->addShuffle(*cat28_0->getOutput(0));\n    shuffle28_0->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split28_0_0 = network->addSlice(\n            *shuffle28_0->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split28_0_1 = network->addSlice(\n            *shuffle28_0->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl28_0 =\n            DFL(network, weightMap, *split28_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.28.dfl.conv.weight\");\n\n    nvinfer1::ITensor* inputTensor28_dfl_0[] = {dfl28_0->getOutput(0), split28_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat28_dfl_0 = network->addConcatenation(inputTensor28_dfl_0, 2);\n\n    // P3 processing steps (remains unchanged)\n    nvinfer1::IShuffleLayer* shuffle28_1 = network->addShuffle(*cat28_1->getOutput(0));\n    shuffle28_1->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split28_1_0 = network->addSlice(\n            *shuffle28_1->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split28_1_1 = network->addSlice(\n            *shuffle28_1->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl28_1 =\n            DFL(network, weightMap, *split28_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.28.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor28_dfl_1[] = {dfl28_1->getOutput(0), split28_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat28_dfl_1 = network->addConcatenation(inputTensor28_dfl_1, 2);\n\n    // P4 processing steps (remains unchanged)\n    nvinfer1::IShuffleLayer* shuffle28_2 = network->addShuffle(*cat28_2->getOutput(0));\n    shuffle28_2->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split28_2_0 = network->addSlice(\n            *shuffle28_2->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split28_2_1 = network->addSlice(\n            *shuffle28_2->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl28_2 =\n            DFL(network, weightMap, *split28_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.28.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor28_dfl_2[] = {dfl28_2->getOutput(0), split28_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat28_dfl_2 = network->addConcatenation(inputTensor28_dfl_2, 2);\n\n    // P5 processing steps\n    nvinfer1::IShuffleLayer* shuffle28_3 = network->addShuffle(*cat28_3->getOutput(0));\n    shuffle28_3->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[3]) * (kInputW / strides[3])});\n    nvinfer1::ISliceLayer* split28_3_0 = network->addSlice(\n            *shuffle28_3->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[3]) * (kInputW / strides[3])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split28_3_1 = network->addSlice(\n            *shuffle28_3->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[3]) * (kInputW / strides[3])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl28_3 =\n            DFL(network, weightMap, *split28_3_0->getOutput(0), 4, (kInputH / strides[3]) * (kInputW / strides[3]), 1,\n                1, 0, \"model.28.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor28_dfl_3[] = {dfl28_3->getOutput(0), split28_3_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat28_dfl_3 = network->addConcatenation(inputTensor28_dfl_3, 2);\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(\n            network, std::vector<nvinfer1::IConcatenationLayer*>{cat28_dfl_0, cat28_dfl_1, cat28_dfl_2, cat28_dfl_3},\n            strides, stridesLength, kNumClass, false, false, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov8Cls(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);\n    int max_channels = 1280;\n    // ****************************************** YOLOV8 INPUT\n    // **********************************************\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims3{3, kClsInputH, kClsInputW});\n    assert(data);\n\n    // ***************************************** YOLOV8 BACKBONE\n    // ********************************************\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), 3, 2, 1, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), 3, 2, 1, \"model.1\");\n    // C2 Block (11233)\n    nvinfer1::IElementWiseLayer* conv2 = C2F(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                                             get_width(128, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.2\");\n    nvinfer1::IElementWiseLayer* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), 3, 2, 1, \"model.3\");\n    // C2 Block Sequence (22466)\n    nvinfer1::IElementWiseLayer* conv4 = C2F(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                                             get_width(256, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 =\n            convBnSiLU(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), 3, 2, 1, \"model.5\");\n    // C2 Block Sequence (22466)\n    nvinfer1::IElementWiseLayer* conv6 = C2F(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                                             get_width(512, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.6\");\n    nvinfer1::IElementWiseLayer* conv7 =\n            convBnSiLU(network, weightMap, *conv6->getOutput(0), get_width(1024, gw, max_channels), 3, 2, 1, \"model.7\");\n    // C2 Block (11233)\n    nvinfer1::IElementWiseLayer* conv8 =\n            C2F(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.8\");\n\n    // ********************************************* YOLOV8 HEAD\n    // *********************************************\n\n    auto conv_class = convBnSiLU(network, weightMap, *conv8->getOutput(0), 1280, 1, 1, 1, \"model.9.conv\");\n    // Adjusted code\n    nvinfer1::Dims dims = conv_class->getOutput(0)->getDimensions();  // Obtain the dimensions of the\n                                                                      // output of conv_class\n    assert(dims.nbDims == 3);  // Make sure there are exactly 3 dimensions (channels, height, width)\n\n    nvinfer1::IPoolingLayer* pool2 = network->addPoolingNd(*conv_class->getOutput(0), nvinfer1::PoolingType::kAVERAGE,\n                                                           nvinfer1::DimsHW{dims.d[1], dims.d[2]});\n    assert(pool2);\n\n    // Fully connected layer declaration\n    nvinfer1::IFullyConnectedLayer* yolo = network->addFullyConnected(\n            *pool2->getOutput(0), kClsNumClass, weightMap[\"model.9.linear.weight\"], weightMap[\"model.9.linear.bias\"]);\n    assert(yolo);\n\n    // Set the name for the output tensor and mark it as network output\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    // Set the maximum batch size and workspace size\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n    // Configuration according to the precision mode being used\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform supports int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kClsInputW, kClsInputH, kInputQuantizationFolder,\n                                                  \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    // Begin building the engine; this may take a while\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    // Cleanup the network definition and allocated weights\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov8Seg(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    /*******************************************************************************************************\n  ******************************************  YOLOV8 INPUT\n  ***********************************************\n  *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims3{3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n  *****************************************  YOLOV8 BACKBONE\n  *********************************************\n  *******************************************************************************************************/\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), 3, 2, 1, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), 3, 2, 1, \"model.1\");\n    nvinfer1::IElementWiseLayer* conv2 = C2F(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                                             get_width(128, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.2\");\n    nvinfer1::IElementWiseLayer* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), 3, 2, 1, \"model.3\");\n    nvinfer1::IElementWiseLayer* conv4 = C2F(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                                             get_width(256, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 =\n            convBnSiLU(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), 3, 2, 1, \"model.5\");\n    nvinfer1::IElementWiseLayer* conv6 = C2F(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                                             get_width(512, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.6\");\n    nvinfer1::IElementWiseLayer* conv7 =\n            convBnSiLU(network, weightMap, *conv6->getOutput(0), get_width(1024, gw, max_channels), 3, 2, 1, \"model.7\");\n    nvinfer1::IElementWiseLayer* conv8 =\n            C2F(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.8\");\n    nvinfer1::IElementWiseLayer* conv9 =\n            SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 5, \"model.9\");\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 HEAD\n  *********************************************\n  *******************************************************************************************************/\n    float scale[] = {1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample10 = network->addResize(*conv9->getOutput(0));\n    assert(upsample10);\n    upsample10->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample10->setScales(scale, 3);\n\n    nvinfer1::ITensor* inputTensor11[] = {upsample10->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat11 = network->addConcatenation(inputTensor11, 2);\n    nvinfer1::IElementWiseLayer* conv12 =\n            C2F(network, weightMap, *cat11->getOutput(0), get_width(512, gw, max_channels),\n                get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.12\");\n\n    nvinfer1::IResizeLayer* upsample13 = network->addResize(*conv12->getOutput(0));\n    assert(upsample13);\n    upsample13->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample13->setScales(scale, 3);\n\n    nvinfer1::ITensor* inputTensor14[] = {upsample13->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat14 = network->addConcatenation(inputTensor14, 2);\n    nvinfer1::IElementWiseLayer* conv15 =\n            C2F(network, weightMap, *cat14->getOutput(0), get_width(256, gw, max_channels),\n                get_width(256, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.15\");\n    nvinfer1::IElementWiseLayer* conv16 = convBnSiLU(network, weightMap, *conv15->getOutput(0),\n                                                     get_width(256, gw, max_channels), 3, 2, 1, \"model.16\");\n    nvinfer1::ITensor* inputTensor17[] = {conv16->getOutput(0), conv12->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat17 = network->addConcatenation(inputTensor17, 2);\n    nvinfer1::IElementWiseLayer* conv18 =\n            C2F(network, weightMap, *cat17->getOutput(0), get_width(512, gw, max_channels),\n                get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.18\");\n    nvinfer1::IElementWiseLayer* conv19 = convBnSiLU(network, weightMap, *conv18->getOutput(0),\n                                                     get_width(512, gw, max_channels), 3, 2, 1, \"model.19\");\n    nvinfer1::ITensor* inputTensor20[] = {conv19->getOutput(0), conv9->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat20 = network->addConcatenation(inputTensor20, 2);\n    nvinfer1::IElementWiseLayer* conv21 =\n            C2F(network, weightMap, *cat20->getOutput(0), get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.21\");\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 OUTPUT\n  *******************************************\n  *******************************************************************************************************/\n    int base_in_channel = (gw == 1.25) ? 80 : 64;\n    int base_out_channel = (gw == 0.25) ? std::max(64, std::min(kNumClass, 100)) : get_width(256, gw, max_channels);\n\n    // output0\n    nvinfer1::IElementWiseLayer* conv22_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv15->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv22_cv2_0_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv2_0_2 =\n            network->addConvolutionNd(*conv22_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv2.0.2.weight\"], weightMap[\"model.22.cv2.0.2.bias\"]);\n    conv22_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv22_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv22_cv3_0_0 =\n            convBnSiLU(network, weightMap, *conv15->getOutput(0), base_out_channel, 3, 1, 1, \"model.22.cv3.0.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv3_0_1 = convBnSiLU(network, weightMap, *conv22_cv3_0_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.22.cv3.0.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv3_0_2 =\n            network->addConvolutionNd(*conv22_cv3_0_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv3.0.2.weight\"], weightMap[\"model.22.cv3.0.2.bias\"]);\n    conv22_cv3_0_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv22_cv3_0_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor22_0[] = {conv22_cv2_0_2->getOutput(0), conv22_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_0 = network->addConcatenation(inputTensor22_0, 2);\n\n    // output1\n    nvinfer1::IElementWiseLayer* conv22_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv18->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv22_cv2_1_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv2_1_2 =\n            network->addConvolutionNd(*conv22_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv2.1.2.weight\"], weightMap[\"model.22.cv2.1.2.bias\"]);\n    conv22_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv22_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv22_cv3_1_0 =\n            convBnSiLU(network, weightMap, *conv18->getOutput(0), base_out_channel, 3, 1, 1, \"model.22.cv3.1.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv3_1_1 = convBnSiLU(network, weightMap, *conv22_cv3_1_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.22.cv3.1.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv3_1_2 =\n            network->addConvolutionNd(*conv22_cv3_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv3.1.2.weight\"], weightMap[\"model.22.cv3.1.2.bias\"]);\n    conv22_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv22_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor22_1[] = {conv22_cv2_1_2->getOutput(0), conv22_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_1 = network->addConcatenation(inputTensor22_1, 2);\n\n    // output2\n    nvinfer1::IElementWiseLayer* conv22_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv21->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv22_cv2_2_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv2_2_2 =\n            network->addConvolution(*conv22_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.22.cv2.2.2.weight\"], weightMap[\"model.22.cv2.2.2.bias\"]);\n    nvinfer1::IElementWiseLayer* conv22_cv3_2_0 =\n            convBnSiLU(network, weightMap, *conv21->getOutput(0), base_out_channel, 3, 1, 1, \"model.22.cv3.2.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv3_2_1 = convBnSiLU(network, weightMap, *conv22_cv3_2_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.22.cv3.2.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv3_2_2 =\n            network->addConvolution(*conv22_cv3_2_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.22.cv3.2.2.weight\"], weightMap[\"model.22.cv3.2.2.bias\"]);\n    nvinfer1::ITensor* inputTensor22_2[] = {conv22_cv2_2_2->getOutput(0), conv22_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_2 = network->addConcatenation(inputTensor22_2, 2);\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 DETECT\n  *******************************************\n  *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle22_0 = network->addShuffle(*cat22_0->getOutput(0));\n    shuffle22_0->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split22_0_0 = network->addSlice(\n            *shuffle22_0->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split22_0_1 = network->addSlice(\n            *shuffle22_0->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl22_0 =\n            DFL(network, weightMap, *split22_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.22.dfl.conv.weight\");\n\n    nvinfer1::IShuffleLayer* shuffle22_1 = network->addShuffle(*cat22_1->getOutput(0));\n    shuffle22_1->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split22_1_0 = network->addSlice(\n            *shuffle22_1->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split22_1_1 = network->addSlice(\n            *shuffle22_1->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl22_1 =\n            DFL(network, weightMap, *split22_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.22.dfl.conv.weight\");\n\n    nvinfer1::IShuffleLayer* shuffle22_2 = network->addShuffle(*cat22_2->getOutput(0));\n    shuffle22_2->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split22_2_0 = network->addSlice(\n            *shuffle22_2->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split22_2_1 = network->addSlice(\n            *shuffle22_2->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl22_2 =\n            DFL(network, weightMap, *split22_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.22.dfl.conv.weight\");\n\n    // det0\n    auto proto_coef_0 = cv4_conv_combined(network, weightMap, *conv15->getOutput(0), \"model.22.cv4.0\",\n                                          (kInputH / strides[0]) * (kInputW / strides[0]), gw, \"seg\");\n    nvinfer1::ITensor* inputTensor22_dfl_0[] = {dfl22_0->getOutput(0), split22_0_1->getOutput(0),\n                                                proto_coef_0->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_0 = network->addConcatenation(inputTensor22_dfl_0, 3);\n\n    // det1\n    auto proto_coef_1 = cv4_conv_combined(network, weightMap, *conv18->getOutput(0), \"model.22.cv4.1\",\n                                          (kInputH / strides[1]) * (kInputW / strides[1]), gw, \"seg\");\n    nvinfer1::ITensor* inputTensor22_dfl_1[] = {dfl22_1->getOutput(0), split22_1_1->getOutput(0),\n                                                proto_coef_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_1 = network->addConcatenation(inputTensor22_dfl_1, 3);\n\n    // det2\n    auto proto_coef_2 = cv4_conv_combined(network, weightMap, *conv21->getOutput(0), \"model.22.cv4.2\",\n                                          (kInputH / strides[2]) * (kInputW / strides[2]), gw, \"seg\");\n    nvinfer1::ITensor* inputTensor22_dfl_2[] = {dfl22_2->getOutput(0), split22_2_1->getOutput(0),\n                                                proto_coef_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_2 = network->addConcatenation(inputTensor22_dfl_2, 3);\n\n    nvinfer1::IPluginV2Layer* yolo =\n            addYoLoLayer(network, std::vector<nvinfer1::IConcatenationLayer*>{cat22_dfl_0, cat22_dfl_1, cat22_dfl_2},\n                         strides, stridesLength, kNumClass, true, false, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    auto proto = Proto(network, weightMap, *conv15->getOutput(0), \"model.22.proto\", gw, max_channels);\n    proto->getOutput(0)->setName(\"proto\");\n    network->markOutput(*proto->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov8Pose(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                             nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                             int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    /*******************************************************************************************************\n  ******************************************  YOLOV8 INPUT\n  ***********************************************\n  *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims3{3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n  *****************************************  YOLOV8 BACKBONE\n  *********************************************\n  *******************************************************************************************************/\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), 3, 2, 1, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), 3, 2, 1, \"model.1\");\n    nvinfer1::IElementWiseLayer* conv2 = C2F(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                                             get_width(128, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.2\");\n    nvinfer1::IElementWiseLayer* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), 3, 2, 1, \"model.3\");\n    nvinfer1::IElementWiseLayer* conv4 = C2F(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                                             get_width(256, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 =\n            convBnSiLU(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), 3, 2, 1, \"model.5\");\n    nvinfer1::IElementWiseLayer* conv6 = C2F(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                                             get_width(512, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.6\");\n    nvinfer1::IElementWiseLayer* conv7 =\n            convBnSiLU(network, weightMap, *conv6->getOutput(0), get_width(1024, gw, max_channels), 3, 2, 1, \"model.7\");\n    nvinfer1::IElementWiseLayer* conv8 =\n            C2F(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.8\");\n    nvinfer1::IElementWiseLayer* conv9 =\n            SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 5, \"model.9\");\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 HEAD\n  *********************************************\n  *******************************************************************************************************/\n    float scale[] = {1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample10 = network->addResize(*conv9->getOutput(0));\n    assert(upsample10);\n    upsample10->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample10->setScales(scale, 3);\n\n    nvinfer1::ITensor* inputTensor11[] = {upsample10->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat11 = network->addConcatenation(inputTensor11, 2);\n    nvinfer1::IElementWiseLayer* conv12 =\n            C2F(network, weightMap, *cat11->getOutput(0), get_width(512, gw, max_channels),\n                get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.12\");\n\n    nvinfer1::IResizeLayer* upsample13 = network->addResize(*conv12->getOutput(0));\n    assert(upsample13);\n    upsample13->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample13->setScales(scale, 3);\n\n    nvinfer1::ITensor* inputTensor14[] = {upsample13->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat14 = network->addConcatenation(inputTensor14, 2);\n    nvinfer1::IElementWiseLayer* conv15 =\n            C2F(network, weightMap, *cat14->getOutput(0), get_width(256, gw, max_channels),\n                get_width(256, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.15\");\n    nvinfer1::IElementWiseLayer* conv16 = convBnSiLU(network, weightMap, *conv15->getOutput(0),\n                                                     get_width(256, gw, max_channels), 3, 2, 1, \"model.16\");\n    nvinfer1::ITensor* inputTensor17[] = {conv16->getOutput(0), conv12->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat17 = network->addConcatenation(inputTensor17, 2);\n    nvinfer1::IElementWiseLayer* conv18 =\n            C2F(network, weightMap, *cat17->getOutput(0), get_width(512, gw, max_channels),\n                get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.18\");\n    nvinfer1::IElementWiseLayer* conv19 = convBnSiLU(network, weightMap, *conv18->getOutput(0),\n                                                     get_width(512, gw, max_channels), 3, 2, 1, \"model.19\");\n    nvinfer1::ITensor* inputTensor20[] = {conv19->getOutput(0), conv9->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat20 = network->addConcatenation(inputTensor20, 2);\n    nvinfer1::IElementWiseLayer* conv21 =\n            C2F(network, weightMap, *cat20->getOutput(0), get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.21\");\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 OUTPUT\n  *******************************************\n  *******************************************************************************************************/\n    int base_in_channel = (gw == 1.25) ? 80 : 64;\n    int base_out_channel = (gw == 0.25) ? std::max(64, std::min(kPoseNumClass, 100)) : get_width(256, gw, max_channels);\n\n    // output0\n    nvinfer1::IElementWiseLayer* conv22_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv15->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv22_cv2_0_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv2_0_2 =\n            network->addConvolutionNd(*conv22_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv2.0.2.weight\"], weightMap[\"model.22.cv2.0.2.bias\"]);\n    conv22_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv22_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv22_cv3_0_0 =\n            convBnSiLU(network, weightMap, *conv15->getOutput(0), base_out_channel, 3, 1, 1, \"model.22.cv3.0.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv3_0_1 = convBnSiLU(network, weightMap, *conv22_cv3_0_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.22.cv3.0.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv3_0_2 =\n            network->addConvolutionNd(*conv22_cv3_0_1->getOutput(0), kPoseNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv3.0.2.weight\"], weightMap[\"model.22.cv3.0.2.bias\"]);\n    conv22_cv3_0_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv22_cv3_0_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor22_0[] = {conv22_cv2_0_2->getOutput(0), conv22_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_0 = network->addConcatenation(inputTensor22_0, 2);\n\n    // output1\n    nvinfer1::IElementWiseLayer* conv22_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv18->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv22_cv2_1_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv2_1_2 =\n            network->addConvolutionNd(*conv22_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv2.1.2.weight\"], weightMap[\"model.22.cv2.1.2.bias\"]);\n    conv22_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv22_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv22_cv3_1_0 =\n            convBnSiLU(network, weightMap, *conv18->getOutput(0), base_out_channel, 3, 1, 1, \"model.22.cv3.1.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv3_1_1 = convBnSiLU(network, weightMap, *conv22_cv3_1_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.22.cv3.1.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv3_1_2 =\n            network->addConvolutionNd(*conv22_cv3_1_1->getOutput(0), kPoseNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv3.1.2.weight\"], weightMap[\"model.22.cv3.1.2.bias\"]);\n    conv22_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv22_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor22_1[] = {conv22_cv2_1_2->getOutput(0), conv22_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_1 = network->addConcatenation(inputTensor22_1, 2);\n\n    // output2\n    nvinfer1::IElementWiseLayer* conv22_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv21->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv22_cv2_2_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv2_2_2 =\n            network->addConvolution(*conv22_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.22.cv2.2.2.weight\"], weightMap[\"model.22.cv2.2.2.bias\"]);\n    nvinfer1::IElementWiseLayer* conv22_cv3_2_0 =\n            convBnSiLU(network, weightMap, *conv21->getOutput(0), base_out_channel, 3, 1, 1, \"model.22.cv3.2.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv3_2_1 = convBnSiLU(network, weightMap, *conv22_cv3_2_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.22.cv3.2.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv3_2_2 =\n            network->addConvolution(*conv22_cv3_2_1->getOutput(0), kPoseNumClass, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.22.cv3.2.2.weight\"], weightMap[\"model.22.cv3.2.2.bias\"]);\n    nvinfer1::ITensor* inputTensor22_2[] = {conv22_cv2_2_2->getOutput(0), conv22_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_2 = network->addConcatenation(inputTensor22_2, 2);\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 DETECT\n  *******************************************\n  *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    /**************************************************************************************P3****************************************************************************************************************************************/\n    nvinfer1::IShuffleLayer* shuffle22_0 = network->addShuffle(*cat22_0->getOutput(0));\n    shuffle22_0->setReshapeDimensions(\n            nvinfer1::Dims2{64 + kPoseNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split22_0_0 = network->addSlice(\n            *shuffle22_0->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split22_0_1 = network->addSlice(\n            *shuffle22_0->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kPoseNumClass, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl22_0 =\n            DFL(network, weightMap, *split22_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.22.dfl.conv.weight\");\n\n    // det0\n    auto shuffle_conv15 = cv4_conv_combined(network, weightMap, *conv15->getOutput(0), \"model.22.cv4.0\",\n                                            (kInputH / strides[0]) * (kInputW / strides[0]), gw, \"pose\");\n\n    nvinfer1::ITensor* inputTensor22_dfl_0[] = {dfl22_0->getOutput(0), split22_0_1->getOutput(0),\n                                                shuffle_conv15->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_0 = network->addConcatenation(inputTensor22_dfl_0, 3);\n\n    /********************************************************************************************P4**********************************************************************************************************************************/\n    nvinfer1::IShuffleLayer* shuffle22_1 = network->addShuffle(*cat22_1->getOutput(0));\n    shuffle22_1->setReshapeDimensions(\n            nvinfer1::Dims2{64 + kPoseNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split22_1_0 = network->addSlice(\n            *shuffle22_1->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split22_1_1 = network->addSlice(\n            *shuffle22_1->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kPoseNumClass, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl22_1 =\n            DFL(network, weightMap, *split22_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.22.dfl.conv.weight\");\n\n    // det1\n    auto shuffle_conv18 = cv4_conv_combined(network, weightMap, *conv18->getOutput(0), \"model.22.cv4.1\",\n                                            (kInputH / strides[1]) * (kInputW / strides[1]), gw, \"pose\");\n\n    nvinfer1::ITensor* inputTensor22_dfl_1[] = {dfl22_1->getOutput(0), split22_1_1->getOutput(0),\n                                                shuffle_conv18->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_1 = network->addConcatenation(inputTensor22_dfl_1, 3);\n\n    /********************************************************************************************P5**********************************************************************************************************************************/\n    nvinfer1::IShuffleLayer* shuffle22_2 = network->addShuffle(*cat22_2->getOutput(0));\n    shuffle22_2->setReshapeDimensions(\n            nvinfer1::Dims2{64 + kPoseNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split22_2_0 = network->addSlice(\n            *shuffle22_2->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split22_2_1 = network->addSlice(\n            *shuffle22_2->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kPoseNumClass, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl22_2 =\n            DFL(network, weightMap, *split22_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.22.dfl.conv.weight\");\n\n    // det2\n    auto shuffle_conv21 = cv4_conv_combined(network, weightMap, *conv21->getOutput(0), \"model.22.cv4.2\",\n                                            (kInputH / strides[2]) * (kInputW / strides[2]), gw, \"pose\");\n    nvinfer1::ITensor* inputTensor22_dfl_2[] = {dfl22_2->getOutput(0), split22_2_1->getOutput(0),\n                                                shuffle_conv21->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_2 = network->addConcatenation(inputTensor22_dfl_2, 3);\n\n    nvinfer1::IPluginV2Layer* yolo =\n            addYoLoLayer(network, std::vector<nvinfer1::IConcatenationLayer*>{cat22_dfl_0, cat22_dfl_1, cat22_dfl_2},\n                         strides, stridesLength, kPoseNumClass, false, true, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov8PoseP6(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                               nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                               int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);\n    /*******************************************************************************************************\n  ******************************************  YOLOV8 INPUT\n  ***********************************************\n  *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims3{3, kInputH, kInputW});\n    assert(data);\n    /*******************************************************************************************************\n  *****************************************  YOLOV8 BACKBONE\n  *********************************************\n  *******************************************************************************************************/\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), 3, 2, 1, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), 3, 2, 1, \"model.1\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv2 = C2F(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                                             get_width(128, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.2\");\n    nvinfer1::IElementWiseLayer* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), 3, 2, 1, \"model.3\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv4 = C2F(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                                             get_width(256, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 =\n            convBnSiLU(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), 3, 2, 1, \"model.5\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv6 = C2F(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                                             get_width(512, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.6\");\n\n    nvinfer1::IElementWiseLayer* conv7 =\n            convBnSiLU(network, weightMap, *conv6->getOutput(0), get_width(768, gw, max_channels), 3, 2, 1, \"model.7\");\n    nvinfer1::IElementWiseLayer* conv8 = C2F(network, weightMap, *conv7->getOutput(0), get_width(768, gw, max_channels),\n                                             get_width(768, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.8\");\n\n    nvinfer1::IElementWiseLayer* conv9 =\n            convBnSiLU(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels), 3, 2, 1, \"model.9\");\n    nvinfer1::IElementWiseLayer* conv10 =\n            C2F(network, weightMap, *conv9->getOutput(0), get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.10\");\n\n    nvinfer1::IElementWiseLayer* conv11 =\n            SPPF(network, weightMap, *conv10->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 5, \"model.11\");\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 HEAD\n  *********************************************\n  *******************************************************************************************************/\n    // Head\n    float scale[] = {1.0, 2.0, 2.0};  // scale used for upsampling\n\n    // P5\n    nvinfer1::IResizeLayer* upsample12 = network->addResize(*conv11->getOutput(0));\n    upsample12->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample12->setScales(scale, 3);\n    nvinfer1::ITensor* concat13_inputs[] = {upsample12->getOutput(0), conv8->getOutput(0)};\n    nvinfer1::IConcatenationLayer* concat13 = network->addConcatenation(concat13_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv14 =\n            C2(network, weightMap, *concat13->getOutput(0), get_width(768, gw, max_channels),\n               get_width(768, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.14\");\n\n    // P4\n    nvinfer1::IResizeLayer* upsample15 = network->addResize(*conv14->getOutput(0));\n    upsample15->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample15->setScales(scale, 3);\n    nvinfer1::ITensor* concat16_inputs[] = {upsample15->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* concat16 = network->addConcatenation(concat16_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv17 =\n            C2(network, weightMap, *concat16->getOutput(0), get_width(512, gw, max_channels),\n               get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.17\");\n\n    // P3\n    nvinfer1::IResizeLayer* upsample18 = network->addResize(*conv17->getOutput(0));\n    upsample18->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample18->setScales(scale, 3);\n    nvinfer1::ITensor* concat19_inputs[] = {upsample18->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* concat19 = network->addConcatenation(concat19_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv20 =\n            C2(network, weightMap, *concat19->getOutput(0), get_width(256, gw, max_channels),\n               get_width(256, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.20\");\n\n    // Additional layers for P4, P5, P6\n    // P4/16-medium\n    nvinfer1::IElementWiseLayer* conv21 = convBnSiLU(network, weightMap, *conv20->getOutput(0),\n                                                     get_width(256, gw, max_channels), 3, 2, 1, \"model.21\");\n    nvinfer1::ITensor* concat22_inputs[] = {conv21->getOutput(0), conv17->getOutput(0)};\n    nvinfer1::IConcatenationLayer* concat22 = network->addConcatenation(concat22_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv23 =\n            C2(network, weightMap, *concat22->getOutput(0), get_width(512, gw, max_channels),\n               get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.23\");\n\n    // P5/32-large\n    nvinfer1::IElementWiseLayer* conv24 = convBnSiLU(network, weightMap, *conv23->getOutput(0),\n                                                     get_width(512, gw, max_channels), 3, 2, 1, \"model.24\");\n    nvinfer1::ITensor* concat25_inputs[] = {conv24->getOutput(0), conv14->getOutput(0)};\n    nvinfer1::IConcatenationLayer* concat25 = network->addConcatenation(concat25_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv26 =\n            C2(network, weightMap, *concat25->getOutput(0), get_width(768, gw, max_channels),\n               get_width(768, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.26\");\n\n    // P6/64-xlarge\n    nvinfer1::IElementWiseLayer* conv27 = convBnSiLU(network, weightMap, *conv26->getOutput(0),\n                                                     get_width(768, gw, max_channels), 3, 2, 1, \"model.27\");\n    nvinfer1::ITensor* concat28_inputs[] = {conv27->getOutput(0), conv11->getOutput(0)};\n    nvinfer1::IConcatenationLayer* concat28 = network->addConcatenation(concat28_inputs, 2);\n    nvinfer1::IElementWiseLayer* conv29 =\n            C2(network, weightMap, *concat28->getOutput(0), get_width(1024, gw, max_channels),\n               get_width(1024, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.29\");\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 OUTPUT\n  *******************************************\n  *******************************************************************************************************/\n    int base_in_channel = (gw == 1.25) ? 80 : 64;\n    int base_out_channel = (gw == 0.25) ? std::max(64, std::min(kPoseNumClass, 100)) : get_width(256, gw, max_channels);\n\n    // output0\n    nvinfer1::IElementWiseLayer* conv30_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv20->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv30_cv2_0_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv2_0_2 =\n            network->addConvolutionNd(*conv30_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.30.cv2.0.2.weight\"], weightMap[\"model.30.cv2.0.2.bias\"]);\n    conv30_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n\n    conv30_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    nvinfer1::IElementWiseLayer* conv30_cv3_0_0 =\n            convBnSiLU(network, weightMap, *conv20->getOutput(0), base_out_channel, 3, 1, 1, \"model.30.cv3.0.0\");\n\n    nvinfer1::IElementWiseLayer* conv30_cv3_0_1 = convBnSiLU(network, weightMap, *conv30_cv3_0_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.30.cv3.0.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv3_0_2 =\n            network->addConvolutionNd(*conv30_cv3_0_1->getOutput(0), kPoseNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.30.cv3.0.2.weight\"], weightMap[\"model.30.cv3.0.2.bias\"]);\n    conv30_cv3_0_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv30_cv3_0_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor30_0[] = {conv30_cv2_0_2->getOutput(0), conv30_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_0 = network->addConcatenation(inputTensor30_0, 2);\n\n    // output1\n    nvinfer1::IElementWiseLayer* conv30_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv23->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv30_cv2_1_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv2_1_2 =\n            network->addConvolutionNd(*conv30_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.30.cv2.1.2.weight\"], weightMap[\"model.30.cv2.1.2.bias\"]);\n    conv30_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv30_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv30_cv3_1_0 =\n            convBnSiLU(network, weightMap, *conv23->getOutput(0), base_out_channel, 3, 1, 1, \"model.30.cv3.1.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv3_1_1 = convBnSiLU(network, weightMap, *conv30_cv3_1_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.30.cv3.1.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv3_1_2 =\n            network->addConvolutionNd(*conv30_cv3_1_1->getOutput(0), kPoseNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.30.cv3.1.2.weight\"], weightMap[\"model.30.cv3.1.2.bias\"]);\n    conv30_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv30_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor30_1[] = {conv30_cv2_1_2->getOutput(0), conv30_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_1 = network->addConcatenation(inputTensor30_1, 2);\n\n    // output2\n    nvinfer1::IElementWiseLayer* conv30_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv26->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv30_cv2_2_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv2_2_2 =\n            network->addConvolution(*conv30_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.30.cv2.2.2.weight\"], weightMap[\"model.30.cv2.2.2.bias\"]);\n    conv30_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv30_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv30_cv3_2_0 =\n            convBnSiLU(network, weightMap, *conv26->getOutput(0), base_out_channel, 3, 1, 1, \"model.30.cv3.2.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv3_2_1 = convBnSiLU(network, weightMap, *conv30_cv3_2_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.30.cv3.2.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv3_2_2 =\n            network->addConvolution(*conv30_cv3_2_1->getOutput(0), kPoseNumClass, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.30.cv3.2.2.weight\"], weightMap[\"model.30.cv3.2.2.bias\"]);\n    conv30_cv3_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv30_cv3_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor30_2[] = {conv30_cv2_2_2->getOutput(0), conv30_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_2 = network->addConcatenation(inputTensor30_2, 2);\n\n    // output3\n    nvinfer1::IElementWiseLayer* conv30_cv2_3_0 =\n            convBnSiLU(network, weightMap, *conv29->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.3.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv2_3_1 =\n            convBnSiLU(network, weightMap, *conv30_cv2_3_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.30.cv2.3.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv2_3_2 =\n            network->addConvolution(*conv30_cv2_3_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.30.cv2.3.2.weight\"], weightMap[\"model.30.cv2.3.2.bias\"]);\n    conv30_cv2_3_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv30_cv2_3_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv30_cv3_3_0 =\n            convBnSiLU(network, weightMap, *conv29->getOutput(0), base_out_channel, 3, 1, 1, \"model.30.cv3.3.0\");\n    nvinfer1::IElementWiseLayer* conv30_cv3_3_1 = convBnSiLU(network, weightMap, *conv30_cv3_3_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.30.cv3.3.1\");\n    nvinfer1::IConvolutionLayer* conv30_cv3_3_2 =\n            network->addConvolution(*conv30_cv3_3_1->getOutput(0), kPoseNumClass, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.30.cv3.3.2.weight\"], weightMap[\"model.30.cv3.3.2.bias\"]);\n    conv30_cv3_3_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv30_cv3_3_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor30_3[] = {conv30_cv2_3_2->getOutput(0), conv30_cv3_3_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_3 = network->addConcatenation(inputTensor30_3, 2);\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV8 DETECT\n  *******************************************\n  *******************************************************************************************************/\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7, conv9};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    // P3 processing steps (remains unchanged)\n    nvinfer1::IShuffleLayer* shuffle30_0 =\n            network->addShuffle(*cat30_0->getOutput(0));  // Reusing the previous cat30_0 as P3 concatenation layer\n    shuffle30_0->setReshapeDimensions(\n            nvinfer1::Dims2{64 + kPoseNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split30_0_0 = network->addSlice(\n            *shuffle30_0->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split30_0_1 = network->addSlice(\n            *shuffle30_0->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kPoseNumClass, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl30_0 =\n            DFL(network, weightMap, *split30_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.30.dfl.conv.weight\");\n\n    // det0\n    auto shuffle_conv20 = cv4_conv_combined(network, weightMap, *conv20->getOutput(0), \"model.30.cv4.0\",\n                                            (kInputH / strides[0]) * (kInputW / strides[0]), gw, \"pose\");\n    nvinfer1::ITensor* inputTensor30_dfl_0[] = {dfl30_0->getOutput(0), split30_0_1->getOutput(0),\n                                                shuffle_conv20->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_dfl_0 = network->addConcatenation(inputTensor30_dfl_0, 3);\n\n    // P4 processing steps (remains unchanged)\n    nvinfer1::IShuffleLayer* shuffle30_1 =\n            network->addShuffle(*cat30_1->getOutput(0));  // Reusing the previous cat30_1 as P4 concatenation layer\n    shuffle30_1->setReshapeDimensions(\n            nvinfer1::Dims2{64 + kPoseNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split30_1_0 = network->addSlice(\n            *shuffle30_1->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split30_1_1 = network->addSlice(\n            *shuffle30_1->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kPoseNumClass, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl30_1 =\n            DFL(network, weightMap, *split30_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.30.dfl.conv.weight\");\n\n    // det1\n    auto shuffle_conv23 = cv4_conv_combined(network, weightMap, *conv23->getOutput(0), \"model.30.cv4.1\",\n                                            (kInputH / strides[1]) * (kInputW / strides[1]), gw, \"pose\");\n    nvinfer1::ITensor* inputTensor30_dfl_1[] = {dfl30_1->getOutput(0), split30_1_1->getOutput(0),\n                                                shuffle_conv23->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_dfl_1 = network->addConcatenation(inputTensor30_dfl_1, 3);\n\n    // P5 processing steps (remains unchanged)\n    nvinfer1::IShuffleLayer* shuffle30_2 =\n            network->addShuffle(*cat30_2->getOutput(0));  // Reusing the previous cat30_2 as P5 concatenation layer\n    shuffle30_2->setReshapeDimensions(\n            nvinfer1::Dims2{64 + kPoseNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split30_2_0 = network->addSlice(\n            *shuffle30_2->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split30_2_1 = network->addSlice(\n            *shuffle30_2->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kPoseNumClass, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl30_2 =\n            DFL(network, weightMap, *split30_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.30.dfl.conv.weight\");\n\n    // det2\n    auto shuffle_conv26 = cv4_conv_combined(network, weightMap, *conv26->getOutput(0), \"model.30.cv4.2\",\n                                            (kInputH / strides[2]) * (kInputW / strides[2]), gw, \"pose\");\n    nvinfer1::ITensor* inputTensor30_dfl_2[] = {dfl30_2->getOutput(0), split30_2_1->getOutput(0),\n                                                shuffle_conv26->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_dfl_2 = network->addConcatenation(inputTensor30_dfl_2, 3);\n\n    // P6 processing steps\n    nvinfer1::IShuffleLayer* shuffle30_3 = network->addShuffle(*cat30_3->getOutput(0));\n    shuffle30_3->setReshapeDimensions(\n            nvinfer1::Dims2{64 + kPoseNumClass, (kInputH / strides[3]) * (kInputW / strides[3])});\n    nvinfer1::ISliceLayer* split30_3_0 = network->addSlice(\n            *shuffle30_3->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[3]) * (kInputW / strides[3])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split30_3_1 = network->addSlice(\n            *shuffle30_3->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kPoseNumClass, (kInputH / strides[3]) * (kInputW / strides[3])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl30_3 =\n            DFL(network, weightMap, *split30_3_0->getOutput(0), 4, (kInputH / strides[3]) * (kInputW / strides[3]), 1,\n                1, 0, \"model.30.dfl.conv.weight\");\n\n    // det3\n    auto shuffle_conv29 = cv4_conv_combined(network, weightMap, *conv29->getOutput(0), \"model.30.cv4.3\",\n                                            (kInputH / strides[3]) * (kInputW / strides[3]), gw, \"pose\");\n    nvinfer1::ITensor* inputTensor30_dfl_3[] = {dfl30_3->getOutput(0), split30_3_1->getOutput(0),\n                                                shuffle_conv29->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat30_dfl_3 = network->addConcatenation(inputTensor30_dfl_3, 3);\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(\n            network, std::vector<nvinfer1::IConcatenationLayer*>{cat30_dfl_0, cat30_dfl_1, cat30_dfl_2, cat30_dfl_3},\n            strides, stridesLength, kPoseNumClass, false, true, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov8_5uDet(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                               nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                               int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    /*******************************************************************************************************\n  ******************************************  YOLOV5U INPUT\n  ***********************************************\n  *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims3{3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n  *****************************************  YOLOV5U BACKBONE\n  *********************************************\n  *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width_5u(64, gw), 6, 2, calculateP(6), \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width_5u(128, gw), 3, 2, calculateP(3), \"model.1\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv2 = C3(network, weightMap, *conv1->getOutput(0), get_width_5u(128, gw),\n                                            get_width_5u(128, gw), get_depth(3, gd), true, 0.5, \"model.2\");\n\n    nvinfer1::IElementWiseLayer* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width_5u(256, gw), 3, 2, calculateP(3), \"model.3\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv4 = C3(network, weightMap, *conv3->getOutput(0), get_width_5u(256, gw),\n                                            get_width_5u(256, gw), get_depth(6, gd), true, 0.5, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 =\n            convBnSiLU(network, weightMap, *conv4->getOutput(0), get_width_5u(512, gw), 3, 2, calculateP(3), \"model.5\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv6 = C3(network, weightMap, *conv5->getOutput(0), get_width_5u(512, gw),\n                                            get_width_5u(512, gw), get_depth(6, gd), true, 0.5, \"model.6\");\n    nvinfer1::IElementWiseLayer* conv7 = convBnSiLU(network, weightMap, *conv6->getOutput(0), get_width_5u(1024, gw), 3,\n                                                    2, calculateP(3), \"model.7\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv8 = C3(network, weightMap, *conv7->getOutput(0), get_width_5u(1024, gw),\n                                            get_width_5u(1024, gw), get_depth(3, gd), true, 0.5, \"model.8\");\n    nvinfer1::IElementWiseLayer* conv9 = SPPF(network, weightMap, *conv8->getOutput(0), get_width_5u(1024, gw),\n                                              get_width_5u(1024, gw), 5, \"model.9\");\n    /*******************************************************************************************************\n  *********************************************  YOLOV5U HEAD\n  *********************************************\n  *******************************************************************************************************/\n\n    //    auto conv10 = convBlock(network, weightMap, *spp9->getOutput(0),\n    //    get_width_5u(512, gw), 1, 1, 1, \"model.10\");\n\n    //*********************************************  cat backbone P4\n    //********************************************\n    nvinfer1::IElementWiseLayer* conv10 = convBnSiLU(network, weightMap, *conv9->getOutput(0), get_width_5u(512, gw), 1,\n                                                     1, calculateP(1), \"model.10\");\n    nvinfer1::IResizeLayer* upsample11 = network->addResize(*conv10->getOutput(0));\n    assert(upsample11);\n    upsample11->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample11->setOutputDimensions(conv6->getOutput(0)->getDimensions());\n    nvinfer1::ITensor* inputTensor12[] = {upsample11->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat12 = network->addConcatenation(inputTensor12, 2);\n    nvinfer1::IElementWiseLayer* conv13 = C3(network, weightMap, *cat12->getOutput(0), get_width_5u(512, gw),\n                                             get_width_5u(512, gw), get_depth(3, gd), false, 0.5, \"model.13\");\n    //*********************************************  cat backbone P4\n    //********************************************\n\n    //*********************************************  cat backbone P3\n    //********************************************\n    nvinfer1::IElementWiseLayer* conv14 = convBnSiLU(network, weightMap, *conv13->getOutput(0), get_width_5u(256, gw),\n                                                     1, 1, calculateP(1), \"model.14\");\n    nvinfer1::IResizeLayer* upsample15 = network->addResize(*conv14->getOutput(0));\n    assert(upsample15);\n    upsample15->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample15->setOutputDimensions(conv4->getOutput(0)->getDimensions());\n    nvinfer1::ITensor* inputTensor16[] = {upsample15->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat16 = network->addConcatenation(inputTensor16, 2);\n    nvinfer1::IElementWiseLayer* conv17 = C3(network, weightMap, *cat16->getOutput(0), get_width_5u(256, gw),\n                                             get_width_5u(256, gw), get_depth(3, gd), false, 0.5, \"model.17\");\n    //*********************************************  cat backbone P3\n    //********************************************\n\n    //*********************************************  cat head P4\n    //********************************************\n    nvinfer1::IElementWiseLayer* conv18 = convBnSiLU(network, weightMap, *conv17->getOutput(0), get_width_5u(256, gw),\n                                                     3, 2, calculateP(3), \"model.18\");\n    nvinfer1::ITensor* inputTensor19[] = {conv18->getOutput(0), conv14->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat19 = network->addConcatenation(inputTensor19, 2);\n    nvinfer1::IElementWiseLayer* conv20 = C3(network, weightMap, *cat19->getOutput(0), get_width_5u(512, gw),\n                                             get_width_5u(512, gw), get_depth(3, gd), false, 0.5, \"model.20\");\n    //*********************************************  cat head P4\n    //********************************************\n\n    //*********************************************  cat head P3\n    //********************************************\n    nvinfer1::IElementWiseLayer* conv21 = convBnSiLU(network, weightMap, *conv20->getOutput(0), get_width_5u(512, gw),\n                                                     3, 2, calculateP(3), \"model.21\");\n    nvinfer1::ITensor* inputTensor22[] = {conv21->getOutput(0), conv10->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22 = network->addConcatenation(inputTensor22, 2);\n    nvinfer1::IElementWiseLayer* conv23 = C3(network, weightMap, *cat22->getOutput(0), get_width_5u(1024, gw),\n                                             get_width_5u(1024, gw), get_depth(3, gd), false, 0.5, \"model.23\");\n    //*********************************************  cat head P3\n    //********************************************\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV5U OUTPUT\n  *******************************************\n  *******************************************************************************************************/\n    int base_in_channel = (gw == 1.25) ? 80 : 64;\n    int base_out_channel = (gw == 0.25) ? std::max(64, std::min(kNumClass, 100)) : get_width_5u(256, gw);\n\n    // output0\n    nvinfer1::IElementWiseLayer* conv24_cv2_0_0 = convBnSiLU(network, weightMap, *conv17->getOutput(0), base_in_channel,\n                                                             3, 1, calculateP(3), \"model.24.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv24_cv2_0_1 = convBnSiLU(network, weightMap, *conv24_cv2_0_0->getOutput(0),\n                                                             base_in_channel, 3, 1, calculateP(3), \"model.24.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv24_cv2_0_2 =\n            network->addConvolutionNd(*conv24_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.24.cv2.0.2.weight\"], weightMap[\"model.24.cv2.0.2.bias\"]);\n    conv24_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv24_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv24_cv3_0_0 = convBnSiLU(network, weightMap, *conv17->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.24.cv3.0.0\");\n    nvinfer1::IElementWiseLayer* conv24_cv3_0_1 = convBnSiLU(network, weightMap, *conv24_cv3_0_0->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.24.cv3.0.1\");\n    nvinfer1::IConvolutionLayer* conv24_cv3_0_2 =\n            network->addConvolutionNd(*conv24_cv3_0_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.24.cv3.0.2.weight\"], weightMap[\"model.24.cv3.0.2.bias\"]);\n    conv24_cv3_0_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv24_cv3_0_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor24_0[] = {conv24_cv2_0_2->getOutput(0), conv24_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat24_0 = network->addConcatenation(inputTensor24_0, 2);\n\n    // output1\n    nvinfer1::IElementWiseLayer* conv24_cv2_1_0 = convBnSiLU(network, weightMap, *conv20->getOutput(0), base_in_channel,\n                                                             3, 1, calculateP(3), \"model.24.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv24_cv2_1_1 = convBnSiLU(network, weightMap, *conv24_cv2_1_0->getOutput(0),\n                                                             base_in_channel, 3, 1, calculateP(3), \"model.24.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv24_cv2_1_2 =\n            network->addConvolutionNd(*conv24_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.24.cv2.1.2.weight\"], weightMap[\"model.24.cv2.1.2.bias\"]);\n    conv24_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv24_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv24_cv3_1_0 = convBnSiLU(network, weightMap, *conv20->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.24.cv3.1.0\");\n    nvinfer1::IElementWiseLayer* conv24_cv3_1_1 = convBnSiLU(network, weightMap, *conv24_cv3_1_0->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.24.cv3.1.1\");\n    nvinfer1::IConvolutionLayer* conv24_cv3_1_2 =\n            network->addConvolutionNd(*conv24_cv3_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.24.cv3.1.2.weight\"], weightMap[\"model.24.cv3.1.2.bias\"]);\n    conv24_cv3_1_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv24_cv3_1_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor24_1[] = {conv24_cv2_1_2->getOutput(0), conv24_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat24_1 = network->addConcatenation(inputTensor24_1, 2);\n\n    // output2\n    nvinfer1::IElementWiseLayer* conv24_cv2_2_0 = convBnSiLU(network, weightMap, *conv23->getOutput(0), base_in_channel,\n                                                             3, 1, calculateP(3), \"model.24.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv24_cv2_2_1 = convBnSiLU(network, weightMap, *conv24_cv2_2_0->getOutput(0),\n                                                             base_in_channel, 3, 1, calculateP(3), \"model.24.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv24_cv2_2_2 =\n            network->addConvolutionNd(*conv24_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.24.cv2.2.2.weight\"], weightMap[\"model.24.cv2.2.2.bias\"]);\n    conv24_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv24_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv24_cv3_2_0 = convBnSiLU(network, weightMap, *conv23->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.24.cv3.2.0\");\n    nvinfer1::IElementWiseLayer* conv24_cv3_2_1 = convBnSiLU(network, weightMap, *conv24_cv3_2_0->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.24.cv3.2.1\");\n    nvinfer1::IConvolutionLayer* conv24_cv3_2_2 =\n            network->addConvolutionNd(*conv24_cv3_2_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.24.cv3.2.2.weight\"], weightMap[\"model.24.cv3.2.2.bias\"]);\n    conv24_cv3_2_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv24_cv3_2_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor24_2[] = {conv24_cv2_2_2->getOutput(0), conv24_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat24_2 = network->addConcatenation(inputTensor24_2, 2);\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV5U DETECT\n  *******************************************\n  *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    // det0\n    nvinfer1::IShuffleLayer* shuffle24_0 = network->addShuffle(*cat24_0->getOutput(0));\n    shuffle24_0->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split24_0_0 = network->addSlice(\n            *shuffle24_0->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split24_0_1 = network->addSlice(\n            *shuffle24_0->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl24_0 =\n            DFL(network, weightMap, *split24_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.24.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor24_dfl_0[] = {dfl24_0->getOutput(0), split24_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat24_dfl_0 = network->addConcatenation(inputTensor24_dfl_0, 2);\n\n    // det1\n    nvinfer1::IShuffleLayer* shuffle24_1 = network->addShuffle(*cat24_1->getOutput(0));\n    shuffle24_1->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split24_1_0 = network->addSlice(\n            *shuffle24_1->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split24_1_1 = network->addSlice(\n            *shuffle24_1->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl24_1 =\n            DFL(network, weightMap, *split24_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.24.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor24_dfl_1[] = {dfl24_1->getOutput(0), split24_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat24_dfl_1 = network->addConcatenation(inputTensor24_dfl_1, 2);\n\n    // det2\n    nvinfer1::IShuffleLayer* shuffle24_2 = network->addShuffle(*cat24_2->getOutput(0));\n    shuffle24_2->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split24_2_0 = network->addSlice(\n            *shuffle24_2->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split24_2_1 = network->addSlice(\n            *shuffle24_2->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl24_2 =\n            DFL(network, weightMap, *split24_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.24.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor24_dfl_2[] = {dfl24_2->getOutput(0), split24_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat24_dfl_2 = network->addConcatenation(inputTensor24_dfl_2, 2);\n\n    nvinfer1::IPluginV2Layer* yolo =\n            addYoLoLayer(network, std::vector<nvinfer1::IConcatenationLayer*>{cat24_dfl_0, cat24_dfl_1, cat24_dfl_2},\n                         strides, stridesLength, kNumClass, false, false, false);\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov8_5uDetP6(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                                 nvinfer1::DataType dt, const std::string& wts_path, float& gd,\n                                                 float& gw, int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    /*******************************************************************************************************\n  ******************************************  YOLOV5U-P6 INPUT\n  ***********************************************\n  *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims3{3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n  *****************************************  YOLOV5U-P6 BACKBONE\n  *********************************************\n  *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width_5u(64, gw), 6, 2, calculateP(6), \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width_5u(128, gw), 3, 2, calculateP(3), \"model.1\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv2 = C3(network, weightMap, *conv1->getOutput(0), get_width_5u(128, gw),\n                                            get_width_5u(128, gw), get_depth(3, gd), true, 0.5, \"model.2\");\n\n    nvinfer1::IElementWiseLayer* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width_5u(256, gw), 3, 2, calculateP(3), \"model.3\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv4 = C3(network, weightMap, *conv3->getOutput(0), get_width_5u(256, gw),\n                                            get_width_5u(256, gw), get_depth(6, gd), true, 0.5, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 =\n            convBnSiLU(network, weightMap, *conv4->getOutput(0), get_width_5u(512, gw), 3, 2, calculateP(3), \"model.5\");\n    // 22466\n    nvinfer1::IElementWiseLayer* conv6 = C3(network, weightMap, *conv5->getOutput(0), get_width_5u(512, gw),\n                                            get_width_5u(512, gw), get_depth(6, gd), true, 0.5, \"model.6\");\n    nvinfer1::IElementWiseLayer* conv7 =\n            convBnSiLU(network, weightMap, *conv6->getOutput(0), get_width_5u(768, gw), 3, 2, calculateP(3), \"model.7\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv8 = C3(network, weightMap, *conv7->getOutput(0), get_width_5u(768, gw),\n                                            get_width_5u(768, gw), get_depth(3, gd), true, 0.5, \"model.8\");\n\n    nvinfer1::IElementWiseLayer* conv9 = convBnSiLU(network, weightMap, *conv8->getOutput(0), get_width_5u(1024, gw), 3,\n                                                    2, calculateP(3), \"model.9\");\n    // 11233\n    nvinfer1::IElementWiseLayer* conv10 = C3(network, weightMap, *conv9->getOutput(0), get_width_5u(1024, gw),\n                                             get_width_5u(1024, gw), get_depth(3, gd), true, 0.5, \"model.10\");\n\n    nvinfer1::IElementWiseLayer* conv11 = SPPF(network, weightMap, *conv10->getOutput(0), get_width_5u(1024, gw),\n                                               get_width_5u(1024, gw), 5, \"model.11\");\n    /*******************************************************************************************************\n  *********************************************  YOLOV5U-P6 HEAD\n  *********************************************\n  *******************************************************************************************************/\n\n    //*********************************************  cat backbone P5\n    //********************************************\n    nvinfer1::IElementWiseLayer* conv12 = convBnSiLU(network, weightMap, *conv11->getOutput(0), get_width_5u(768, gw),\n                                                     1, 1, calculateP(1), \"model.12\");\n    nvinfer1::IResizeLayer* upsample13 = network->addResize(*conv12->getOutput(0));\n    assert(upsample13);\n    upsample13->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample13->setOutputDimensions(conv8->getOutput(0)->getDimensions());\n    nvinfer1::ITensor* inputTensor14[] = {upsample13->getOutput(0), conv8->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat14 = network->addConcatenation(inputTensor14, 2);\n    nvinfer1::IElementWiseLayer* conv15 = C3(network, weightMap, *cat14->getOutput(0), get_width_5u(768, gw),\n                                             get_width_5u(768, gw), get_depth(3, gd), false, 0.5, \"model.15\");\n    //*********************************************  cat backbone P5\n    //********************************************\n\n    //*********************************************  cat backbone P4\n    //********************************************\n    nvinfer1::IElementWiseLayer* conv16 = convBnSiLU(network, weightMap, *conv15->getOutput(0), get_width_5u(512, gw),\n                                                     1, 1, calculateP(1), \"model.16\");\n    nvinfer1::IResizeLayer* upsample17 = network->addResize(*conv16->getOutput(0));\n    assert(upsample17);\n    upsample17->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample17->setOutputDimensions(conv6->getOutput(0)->getDimensions());\n    nvinfer1::ITensor* inputTensor18[] = {upsample17->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat18 = network->addConcatenation(inputTensor18, 2);\n    nvinfer1::IElementWiseLayer* conv19 = C3(network, weightMap, *cat18->getOutput(0), get_width_5u(512, gw),\n                                             get_width_5u(512, gw), get_depth(3, gd), false, 0.5, \"model.19\");\n    //*********************************************  cat backbone P4\n    //********************************************\n\n    //*********************************************  cat backbone P3\n    //********************************************\n    nvinfer1::IElementWiseLayer* conv20 = convBnSiLU(network, weightMap, *conv19->getOutput(0), get_width_5u(256, gw),\n                                                     1, 1, calculateP(1), \"model.20\");\n    nvinfer1::IResizeLayer* upsample21 = network->addResize(*conv20->getOutput(0));\n    assert(upsample21);\n    upsample21->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample21->setOutputDimensions(conv4->getOutput(0)->getDimensions());\n    nvinfer1::ITensor* inputTensor22[] = {upsample21->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22 = network->addConcatenation(inputTensor22, 2);\n    nvinfer1::IElementWiseLayer* conv23 = C3(network, weightMap, *cat22->getOutput(0), get_width_5u(256, gw),\n                                             get_width_5u(256, gw), get_depth(3, gd), false, 0.5, \"model.23\");\n    //*********************************************  cat backbone P3\n    //********************************************\n\n    //*********************************************  cat head P4\n    //********************************************\n    nvinfer1::IElementWiseLayer* conv24 = convBnSiLU(network, weightMap, *conv23->getOutput(0), get_width_5u(256, gw),\n                                                     3, 2, calculateP(3), \"model.24\");\n    nvinfer1::ITensor* inputTensor25[] = {conv24->getOutput(0), conv20->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat25 = network->addConcatenation(inputTensor25, 2);\n    nvinfer1::IElementWiseLayer* conv26 = C3(network, weightMap, *cat25->getOutput(0), get_width_5u(512, gw),\n                                             get_width_5u(512, gw), get_depth(3, gd), false, 0.5, \"model.26\");\n    //*********************************************  cat head P4\n    //********************************************\n\n    //*********************************************  cat head P5\n    //********************************************\n    nvinfer1::IElementWiseLayer* conv27 = convBnSiLU(network, weightMap, *conv26->getOutput(0), get_width_5u(512, gw),\n                                                     3, 2, calculateP(3), \"model.27\");\n    nvinfer1::ITensor* inputTensor28[] = {conv27->getOutput(0), conv16->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat28 = network->addConcatenation(inputTensor28, 2);\n    nvinfer1::IElementWiseLayer* conv29 = C3(network, weightMap, *cat28->getOutput(0), get_width_5u(768, gw),\n                                             get_width_5u(768, gw), get_depth(3, gd), false, 0.5, \"model.29\");\n    //*********************************************  cat head P5\n    //********************************************\n\n    //*********************************************  cat head P6\n    //********************************************\n    nvinfer1::IElementWiseLayer* conv30 = convBnSiLU(network, weightMap, *conv29->getOutput(0), get_width_5u(768, gw),\n                                                     3, 2, calculateP(3), \"model.30\");\n    nvinfer1::ITensor* inputTensor31[] = {conv30->getOutput(0), conv12->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat31 = network->addConcatenation(inputTensor31, 2);\n    nvinfer1::IElementWiseLayer* conv32 = C3(network, weightMap, *cat31->getOutput(0), get_width_5u(768, gw),\n                                             get_width_5u(1024, gw), get_depth(3, gd), false, 0.5, \"model.32\");\n    //*********************************************  cat head P6\n    //********************************************\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV5U-P6 OUTPUT\n  *******************************************\n  *******************************************************************************************************/\n    int base_in_channel = (gw == 1.25) ? 80 : 64;\n    int base_out_channel = (gw == 0.25) ? std::max(64, std::min(kNumClass, 100)) : get_width_5u(256, gw);\n\n    // output0\n    nvinfer1::IElementWiseLayer* conv33_cv2_0_0 = convBnSiLU(network, weightMap, *conv23->getOutput(0), base_in_channel,\n                                                             3, 1, calculateP(3), \"model.33.cv2.0.0\");\n    nvinfer1::IElementWiseLayer* conv33_cv2_0_1 = convBnSiLU(network, weightMap, *conv33_cv2_0_0->getOutput(0),\n                                                             base_in_channel, 3, 1, calculateP(3), \"model.33.cv2.0.1\");\n    nvinfer1::IConvolutionLayer* conv33_cv2_0_2 =\n            network->addConvolutionNd(*conv33_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.33.cv2.0.2.weight\"], weightMap[\"model.33.cv2.0.2.bias\"]);\n    conv33_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv33_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv33_cv3_0_0 = convBnSiLU(network, weightMap, *conv23->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.33.cv3.0.0\");\n    nvinfer1::IElementWiseLayer* conv33_cv3_0_1 = convBnSiLU(network, weightMap, *conv33_cv3_0_0->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.33.cv3.0.1\");\n    nvinfer1::IConvolutionLayer* conv33_cv3_0_2 =\n            network->addConvolutionNd(*conv33_cv3_0_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.33.cv3.0.2.weight\"], weightMap[\"model.33.cv3.0.2.bias\"]);\n    conv33_cv3_0_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv33_cv3_0_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor33_0[] = {conv33_cv2_0_2->getOutput(0), conv33_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat33_0 = network->addConcatenation(inputTensor33_0, 2);\n\n    // output1\n    nvinfer1::IElementWiseLayer* conv33_cv2_1_0 = convBnSiLU(network, weightMap, *conv26->getOutput(0), base_in_channel,\n                                                             3, 1, calculateP(3), \"model.33.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv33_cv2_1_1 = convBnSiLU(network, weightMap, *conv33_cv2_1_0->getOutput(0),\n                                                             base_in_channel, 3, 1, calculateP(3), \"model.33.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv33_cv2_1_2 =\n            network->addConvolutionNd(*conv33_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.33.cv2.1.2.weight\"], weightMap[\"model.33.cv2.1.2.bias\"]);\n    conv33_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv33_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv33_cv3_1_0 = convBnSiLU(network, weightMap, *conv26->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.33.cv3.1.0\");\n    nvinfer1::IElementWiseLayer* conv33_cv3_1_1 = convBnSiLU(network, weightMap, *conv33_cv3_1_0->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.33.cv3.1.1\");\n    nvinfer1::IConvolutionLayer* conv33_cv3_1_2 =\n            network->addConvolutionNd(*conv33_cv3_1_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.33.cv3.1.2.weight\"], weightMap[\"model.33.cv3.1.2.bias\"]);\n    conv33_cv3_1_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv33_cv3_1_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor33_1[] = {conv33_cv2_1_2->getOutput(0), conv33_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat33_1 = network->addConcatenation(inputTensor33_1, 2);\n\n    // output2\n    nvinfer1::IElementWiseLayer* conv33_cv2_2_0 = convBnSiLU(network, weightMap, *conv29->getOutput(0), base_in_channel,\n                                                             3, 1, calculateP(3), \"model.33.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv33_cv2_2_1 = convBnSiLU(network, weightMap, *conv33_cv2_2_0->getOutput(0),\n                                                             base_in_channel, 3, 1, calculateP(3), \"model.33.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv33_cv2_2_2 =\n            network->addConvolutionNd(*conv33_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.33.cv2.2.2.weight\"], weightMap[\"model.33.cv2.2.2.bias\"]);\n    conv33_cv2_2_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv33_cv2_2_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv33_cv3_2_0 = convBnSiLU(network, weightMap, *conv29->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.33.cv3.2.0\");\n    nvinfer1::IElementWiseLayer* conv33_cv3_2_1 = convBnSiLU(network, weightMap, *conv33_cv3_2_0->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.33.cv3.2.1\");\n    nvinfer1::IConvolutionLayer* conv33_cv3_2_2 =\n            network->addConvolutionNd(*conv33_cv3_2_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.33.cv3.2.2.weight\"], weightMap[\"model.33.cv3.2.2.bias\"]);\n    conv33_cv3_2_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv33_cv3_2_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor33_2[] = {conv33_cv2_2_2->getOutput(0), conv33_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat33_2 = network->addConcatenation(inputTensor33_2, 2);\n\n    // output3\n    nvinfer1::IElementWiseLayer* conv33_cv2_3_0 = convBnSiLU(network, weightMap, *conv32->getOutput(0), base_in_channel,\n                                                             3, 1, calculateP(3), \"model.33.cv2.3.0\");\n    nvinfer1::IElementWiseLayer* conv33_cv2_3_1 = convBnSiLU(network, weightMap, *conv33_cv2_3_0->getOutput(0),\n                                                             base_in_channel, 3, 1, calculateP(3), \"model.33.cv2.3.1\");\n    nvinfer1::IConvolutionLayer* conv33_cv2_3_2 =\n            network->addConvolutionNd(*conv33_cv2_3_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.33.cv2.3.2.weight\"], weightMap[\"model.33.cv2.3.2.bias\"]);\n    conv33_cv2_3_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv33_cv2_3_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv33_cv3_3_0 = convBnSiLU(network, weightMap, *conv32->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.33.cv3.3.0\");\n    nvinfer1::IElementWiseLayer* conv33_cv3_3_1 = convBnSiLU(network, weightMap, *conv33_cv3_3_0->getOutput(0),\n                                                             base_out_channel, 3, 1, calculateP(3), \"model.33.cv3.3.1\");\n    nvinfer1::IConvolutionLayer* conv33_cv3_3_2 =\n            network->addConvolutionNd(*conv33_cv3_3_1->getOutput(0), kNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.33.cv3.3.2.weight\"], weightMap[\"model.33.cv3.3.2.bias\"]);\n    conv33_cv3_3_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv33_cv3_3_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor33_3[] = {conv33_cv2_3_2->getOutput(0), conv33_cv3_3_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat33_3 = network->addConcatenation(inputTensor33_3, 2);\n\n    /*******************************************************************************************************\n  *********************************************  YOLOV5U-P6 DETECT\n  *******************************************\n  *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7, conv9};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    // det0\n    nvinfer1::IShuffleLayer* shuffle33_0 = network->addShuffle(*cat33_0->getOutput(0));\n    shuffle33_0->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n    nvinfer1::ISliceLayer* split33_0_0 = network->addSlice(\n            *shuffle33_0->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split33_0_1 = network->addSlice(\n            *shuffle33_0->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl33_0 =\n            DFL(network, weightMap, *split33_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.33.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor33_dfl_0[] = {dfl33_0->getOutput(0), split33_0_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat33_dfl_0 = network->addConcatenation(inputTensor33_dfl_0, 2);\n\n    // det1\n    nvinfer1::IShuffleLayer* shuffle33_1 = network->addShuffle(*cat33_1->getOutput(0));\n    shuffle33_1->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split33_1_0 = network->addSlice(\n            *shuffle33_1->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split33_1_1 = network->addSlice(\n            *shuffle33_1->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl33_1 =\n            DFL(network, weightMap, *split33_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.33.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor33_dfl_1[] = {dfl33_1->getOutput(0), split33_1_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat33_dfl_1 = network->addConcatenation(inputTensor33_dfl_1, 2);\n\n    // det2\n    nvinfer1::IShuffleLayer* shuffle33_2 = network->addShuffle(*cat33_2->getOutput(0));\n    shuffle33_2->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split33_2_0 = network->addSlice(\n            *shuffle33_2->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split33_2_1 = network->addSlice(\n            *shuffle33_2->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl33_2 =\n            DFL(network, weightMap, *split33_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.33.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor33_dfl_2[] = {dfl33_2->getOutput(0), split33_2_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat33_dfl_2 = network->addConcatenation(inputTensor33_dfl_2, 2);\n\n    // det3\n    nvinfer1::IShuffleLayer* shuffle33_3 = network->addShuffle(*cat33_3->getOutput(0));\n    shuffle33_3->setReshapeDimensions(nvinfer1::Dims2{64 + kNumClass, (kInputH / strides[3]) * (kInputW / strides[3])});\n    nvinfer1::ISliceLayer* split33_3_0 = network->addSlice(\n            *shuffle33_3->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[3]) * (kInputW / strides[3])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split33_3_1 = network->addSlice(\n            *shuffle33_3->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kNumClass, (kInputH / strides[3]) * (kInputW / strides[3])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl33_3 =\n            DFL(network, weightMap, *split33_3_0->getOutput(0), 4, (kInputH / strides[3]) * (kInputW / strides[3]), 1,\n                1, 0, \"model.33.dfl.conv.weight\");\n    nvinfer1::ITensor* inputTensor33_dfl_3[] = {dfl33_3->getOutput(0), split33_3_1->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat33_dfl_3 = network->addConcatenation(inputTensor33_dfl_3, 2);\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(\n            network, std::vector<nvinfer1::IConcatenationLayer*>{cat33_dfl_0, cat33_dfl_1, cat33_dfl_2, cat33_dfl_3},\n            strides, stridesLength, kNumClass, false, false, false);\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n\nnvinfer1::IHostMemory* buildEngineYolov8Obb(nvinfer1::IBuilder* builder, nvinfer1::IBuilderConfig* config,\n                                            nvinfer1::DataType dt, const std::string& wts_path, float& gd, float& gw,\n                                            int& max_channels) {\n    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);\n    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    /*******************************************************************************************************\n    ******************************************  YOLOV8 INPUT  **********************************************\n    *******************************************************************************************************/\n    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims3{3, kInputH, kInputW});\n    assert(data);\n\n    /*******************************************************************************************************\n    *****************************************  YOLOV8 BACKBONE  ********************************************\n    *******************************************************************************************************/\n    nvinfer1::IElementWiseLayer* conv0 =\n            convBnSiLU(network, weightMap, *data, get_width(64, gw, max_channels), 3, 2, 1, \"model.0\");\n    nvinfer1::IElementWiseLayer* conv1 =\n            convBnSiLU(network, weightMap, *conv0->getOutput(0), get_width(128, gw, max_channels), 3, 2, 1, \"model.1\");\n    nvinfer1::IElementWiseLayer* conv2 = C2F(network, weightMap, *conv1->getOutput(0), get_width(128, gw, max_channels),\n                                             get_width(128, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.2\");\n    nvinfer1::IElementWiseLayer* conv3 =\n            convBnSiLU(network, weightMap, *conv2->getOutput(0), get_width(256, gw, max_channels), 3, 2, 1, \"model.3\");\n    nvinfer1::IElementWiseLayer* conv4 = C2F(network, weightMap, *conv3->getOutput(0), get_width(256, gw, max_channels),\n                                             get_width(256, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.4\");\n    nvinfer1::IElementWiseLayer* conv5 =\n            convBnSiLU(network, weightMap, *conv4->getOutput(0), get_width(512, gw, max_channels), 3, 2, 1, \"model.5\");\n    nvinfer1::IElementWiseLayer* conv6 = C2F(network, weightMap, *conv5->getOutput(0), get_width(512, gw, max_channels),\n                                             get_width(512, gw, max_channels), get_depth(6, gd), true, 0.5, \"model.6\");\n    nvinfer1::IElementWiseLayer* conv7 =\n            convBnSiLU(network, weightMap, *conv6->getOutput(0), get_width(1024, gw, max_channels), 3, 2, 1, \"model.7\");\n    nvinfer1::IElementWiseLayer* conv8 =\n            C2F(network, weightMap, *conv7->getOutput(0), get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels), get_depth(3, gd), true, 0.5, \"model.8\");\n    nvinfer1::IElementWiseLayer* conv9 =\n            SPPF(network, weightMap, *conv8->getOutput(0), get_width(1024, gw, max_channels),\n                 get_width(1024, gw, max_channels), 5, \"model.9\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLOV8 HEAD  ********************************************\n    *******************************************************************************************************/\n    float scale[] = {1.0, 2.0, 2.0};\n    nvinfer1::IResizeLayer* upsample10 = network->addResize(*conv9->getOutput(0));\n    assert(upsample10);\n    upsample10->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample10->setScales(scale, 3);\n\n    nvinfer1::ITensor* inputTensor11[] = {upsample10->getOutput(0), conv6->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat11 = network->addConcatenation(inputTensor11, 2);\n    nvinfer1::IElementWiseLayer* conv12 =\n            C2F(network, weightMap, *cat11->getOutput(0), get_width(512, gw, max_channels),\n                get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.12\");\n\n    nvinfer1::IResizeLayer* upsample13 = network->addResize(*conv12->getOutput(0));\n    assert(upsample13);\n    upsample13->setResizeMode(nvinfer1::ResizeMode::kNEAREST);\n    upsample13->setScales(scale, 3);\n\n    nvinfer1::ITensor* inputTensor14[] = {upsample13->getOutput(0), conv4->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat14 = network->addConcatenation(inputTensor14, 2);\n    nvinfer1::IElementWiseLayer* conv15 =\n            C2F(network, weightMap, *cat14->getOutput(0), get_width(256, gw, max_channels),\n                get_width(256, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.15\");\n    nvinfer1::IElementWiseLayer* conv16 = convBnSiLU(network, weightMap, *conv15->getOutput(0),\n                                                     get_width(256, gw, max_channels), 3, 2, 1, \"model.16\");\n    nvinfer1::ITensor* inputTensor17[] = {conv16->getOutput(0), conv12->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat17 = network->addConcatenation(inputTensor17, 2);\n    nvinfer1::IElementWiseLayer* conv18 =\n            C2F(network, weightMap, *cat17->getOutput(0), get_width(512, gw, max_channels),\n                get_width(512, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.18\");\n    nvinfer1::IElementWiseLayer* conv19 = convBnSiLU(network, weightMap, *conv18->getOutput(0),\n                                                     get_width(512, gw, max_channels), 3, 2, 1, \"model.19\");\n    nvinfer1::ITensor* inputTensor20[] = {conv19->getOutput(0), conv9->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat20 = network->addConcatenation(inputTensor20, 2);\n    nvinfer1::IElementWiseLayer* conv21 =\n            C2F(network, weightMap, *cat20->getOutput(0), get_width(1024, gw, max_channels),\n                get_width(1024, gw, max_channels), get_depth(3, gd), false, 0.5, \"model.21\");\n\n    /*******************************************************************************************************\n    *********************************************  YOLOV8 OUTPUT  ******************************************\n    *******************************************************************************************************/\n    int base_in_channel = (gw == 1.25) ? 80 : 64;\n    int base_out_channel = (gw == 0.25) ? std::max(64, std::min(kObbNumClass, 100)) : get_width(256, gw, max_channels);\n\n    // output0\n    nvinfer1::IElementWiseLayer* conv22_cv2_0_0 =\n            convBnSiLU(network, weightMap, *conv15->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.0.0\");\n\n    nvinfer1::IElementWiseLayer* conv22_cv2_0_1 =\n            convBnSiLU(network, weightMap, *conv22_cv2_0_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.0.1\");\n\n    nvinfer1::IConvolutionLayer* conv22_cv2_0_2 =\n            network->addConvolutionNd(*conv22_cv2_0_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv2.0.2.weight\"], weightMap[\"model.22.cv2.0.2.bias\"]);\n    conv22_cv2_0_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv22_cv2_0_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n\n    nvinfer1::IElementWiseLayer* conv22_cv3_0_0 =\n            convBnSiLU(network, weightMap, *conv15->getOutput(0), base_out_channel, 3, 1, 1, \"model.22.cv3.0.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv3_0_1 = convBnSiLU(network, weightMap, *conv22_cv3_0_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.22.cv3.0.1\");\n\n    nvinfer1::IConvolutionLayer* conv22_cv3_0_2 =\n            network->addConvolutionNd(*conv22_cv3_0_1->getOutput(0), kObbNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv3.0.2.weight\"], weightMap[\"model.22.cv3.0.2.bias\"]);\n    conv22_cv3_0_2->setStride(nvinfer1::DimsHW{1, 1});\n    conv22_cv3_0_2->setPadding(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor22_0[] = {conv22_cv2_0_2->getOutput(0), conv22_cv3_0_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_0 = network->addConcatenation(inputTensor22_0, 2);\n\n    // output1\n    nvinfer1::IElementWiseLayer* conv22_cv2_1_0 =\n            convBnSiLU(network, weightMap, *conv18->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.1.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv2_1_1 =\n            convBnSiLU(network, weightMap, *conv22_cv2_1_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.1.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv2_1_2 =\n            network->addConvolutionNd(*conv22_cv2_1_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv2.1.2.weight\"], weightMap[\"model.22.cv2.1.2.bias\"]);\n    conv22_cv2_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv22_cv2_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::IElementWiseLayer* conv22_cv3_1_0 =\n            convBnSiLU(network, weightMap, *conv18->getOutput(0), base_out_channel, 3, 1, 1, \"model.22.cv3.1.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv3_1_1 = convBnSiLU(network, weightMap, *conv22_cv3_1_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.22.cv3.1.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv3_1_2 =\n            network->addConvolutionNd(*conv22_cv3_1_1->getOutput(0), kObbNumClass, nvinfer1::DimsHW{1, 1},\n                                      weightMap[\"model.22.cv3.1.2.weight\"], weightMap[\"model.22.cv3.1.2.bias\"]);\n    conv22_cv3_1_2->setStrideNd(nvinfer1::DimsHW{1, 1});\n    conv22_cv3_1_2->setPaddingNd(nvinfer1::DimsHW{0, 0});\n    nvinfer1::ITensor* inputTensor22_1[] = {conv22_cv2_1_2->getOutput(0), conv22_cv3_1_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_1 = network->addConcatenation(inputTensor22_1, 2);\n\n    // output2\n    nvinfer1::IElementWiseLayer* conv22_cv2_2_0 =\n            convBnSiLU(network, weightMap, *conv21->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.2.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv2_2_1 =\n            convBnSiLU(network, weightMap, *conv22_cv2_2_0->getOutput(0), base_in_channel, 3, 1, 1, \"model.22.cv2.2.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv2_2_2 =\n            network->addConvolution(*conv22_cv2_2_1->getOutput(0), 64, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.22.cv2.2.2.weight\"], weightMap[\"model.22.cv2.2.2.bias\"]);\n    nvinfer1::IElementWiseLayer* conv22_cv3_2_0 =\n            convBnSiLU(network, weightMap, *conv21->getOutput(0), base_out_channel, 3, 1, 1, \"model.22.cv3.2.0\");\n    nvinfer1::IElementWiseLayer* conv22_cv3_2_1 = convBnSiLU(network, weightMap, *conv22_cv3_2_0->getOutput(0),\n                                                             base_out_channel, 3, 1, 1, \"model.22.cv3.2.1\");\n    nvinfer1::IConvolutionLayer* conv22_cv3_2_2 =\n            network->addConvolution(*conv22_cv3_2_1->getOutput(0), kObbNumClass, nvinfer1::DimsHW{1, 1},\n                                    weightMap[\"model.22.cv3.2.2.weight\"], weightMap[\"model.22.cv3.2.2.bias\"]);\n    nvinfer1::ITensor* inputTensor22_2[] = {conv22_cv2_2_2->getOutput(0), conv22_cv3_2_2->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_2 = network->addConcatenation(inputTensor22_2, 2);\n\n    /*******************************************************************************************************\n    *********************************************  YOLOV8 DETECT  ******************************************\n    *******************************************************************************************************/\n\n    nvinfer1::IElementWiseLayer* conv_layers[] = {conv3, conv5, conv7};\n    int strides[sizeof(conv_layers) / sizeof(conv_layers[0])];\n    calculateStrides(conv_layers, sizeof(conv_layers) / sizeof(conv_layers[0]), kInputH, strides);\n    int stridesLength = sizeof(strides) / sizeof(int);\n\n    nvinfer1::IShuffleLayer* shuffle22_0 = network->addShuffle(*cat22_0->getOutput(0));\n    shuffle22_0->setReshapeDimensions(\n            nvinfer1::Dims2{64 + kObbNumClass, (kInputH / strides[0]) * (kInputW / strides[0])});\n\n    nvinfer1::ISliceLayer* split22_0_0 = network->addSlice(\n            *shuffle22_0->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split22_0_1 = network->addSlice(\n            *shuffle22_0->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kObbNumClass, (kInputH / strides[0]) * (kInputW / strides[0])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl22_0 =\n            DFL(network, weightMap, *split22_0_0->getOutput(0), 4, (kInputH / strides[0]) * (kInputW / strides[0]), 1,\n                1, 0, \"model.22.dfl.conv.weight\");\n\n    nvinfer1::IShuffleLayer* shuffle22_1 = network->addShuffle(*cat22_1->getOutput(0));\n    shuffle22_1->setReshapeDimensions(\n            nvinfer1::Dims2{64 + kObbNumClass, (kInputH / strides[1]) * (kInputW / strides[1])});\n    nvinfer1::ISliceLayer* split22_1_0 = network->addSlice(\n            *shuffle22_1->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split22_1_1 = network->addSlice(\n            *shuffle22_1->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kObbNumClass, (kInputH / strides[1]) * (kInputW / strides[1])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl22_1 =\n            DFL(network, weightMap, *split22_1_0->getOutput(0), 4, (kInputH / strides[1]) * (kInputW / strides[1]), 1,\n                1, 0, \"model.22.dfl.conv.weight\");\n\n    nvinfer1::IShuffleLayer* shuffle22_2 = network->addShuffle(*cat22_2->getOutput(0));\n    shuffle22_2->setReshapeDimensions(\n            nvinfer1::Dims2{64 + kObbNumClass, (kInputH / strides[2]) * (kInputW / strides[2])});\n    nvinfer1::ISliceLayer* split22_2_0 = network->addSlice(\n            *shuffle22_2->getOutput(0), nvinfer1::Dims2{0, 0},\n            nvinfer1::Dims2{64, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::ISliceLayer* split22_2_1 = network->addSlice(\n            *shuffle22_2->getOutput(0), nvinfer1::Dims2{64, 0},\n            nvinfer1::Dims2{kObbNumClass, (kInputH / strides[2]) * (kInputW / strides[2])}, nvinfer1::Dims2{1, 1});\n    nvinfer1::IShuffleLayer* dfl22_2 =\n            DFL(network, weightMap, *split22_2_0->getOutput(0), 4, (kInputH / strides[2]) * (kInputW / strides[2]), 1,\n                1, 0, \"model.22.dfl.conv.weight\");\n\n    // det0\n    auto shuffle_conv15 = cv4_conv_combined(network, weightMap, *conv15->getOutput(0), \"model.22.cv4.0\",\n                                            (kInputH / strides[0]) * (kInputW / strides[0]), gw, \"obb\");\n    nvinfer1::ITensor* inputTensor22_dfl_0[] = {dfl22_0->getOutput(0), split22_0_1->getOutput(0),\n                                                shuffle_conv15->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_0 = network->addConcatenation(inputTensor22_dfl_0, 3);\n\n    // det1\n    auto shuffle_conv18 = cv4_conv_combined(network, weightMap, *conv18->getOutput(0), \"model.22.cv4.1\",\n                                            (kInputH / strides[1]) * (kInputW / strides[1]), gw, \"obb\");\n    nvinfer1::ITensor* inputTensor22_dfl_1[] = {dfl22_1->getOutput(0), split22_1_1->getOutput(0),\n                                                shuffle_conv18->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_1 = network->addConcatenation(inputTensor22_dfl_1, 3);\n\n    // det2\n    auto shuffle_conv21 = cv4_conv_combined(network, weightMap, *conv21->getOutput(0), \"model.22.cv4.2\",\n                                            (kInputH / strides[2]) * (kInputW / strides[2]), gw, \"obb\");\n    nvinfer1::ITensor* inputTensor22_dfl_2[] = {dfl22_2->getOutput(0), split22_2_1->getOutput(0),\n                                                shuffle_conv21->getOutput(0)};\n    nvinfer1::IConcatenationLayer* cat22_dfl_2 = network->addConcatenation(inputTensor22_dfl_2, 3);\n\n    nvinfer1::IPluginV2Layer* yolo =\n            addYoLoLayer(network, std::vector<nvinfer1::IConcatenationLayer*>{cat22_dfl_0, cat22_dfl_1, cat22_dfl_2},\n                         strides, stridesLength, kObbNumClass, false, false, true);\n\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator = new Int8EntropyCalibrator2(1, kInputW, kInputH, kInputQuantizationFolder, \"int8calib.table\",\n                                                  kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    nvinfer1::IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n    return serialized_model;\n}\n"
  },
  {
    "path": "yolov8/src/postprocess.cpp",
    "content": "#include \"postprocess.h\"\n#include <algorithm>\n#include <iostream>  // Include this header for printing\n#include \"utils.h\"\n\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    float l, r, t, b;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n\n    if (r_h > r_w) {\n        l = bbox[0];\n        r = bbox[2];\n        t = bbox[1] - (kInputH - r_w * img.rows) / 2;\n        b = bbox[3] - (kInputH - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - (kInputW - r_h * img.cols) / 2;\n        r = bbox[2] - (kInputW - r_h * img.cols) / 2;\n        t = bbox[1];\n        b = bbox[3];\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\ncv::Rect get_rect_adapt_landmark(cv::Mat& img, float bbox[4], float lmk[kNumberOfPoints * 3]) {\n    float l, r, t, b;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n    if (r_h > r_w) {\n        l = bbox[0] / r_w;\n        r = bbox[2] / r_w;\n        t = (bbox[1] - (kInputH - r_w * img.rows) / 2) / r_w;\n        b = (bbox[3] - (kInputH - r_w * img.rows) / 2) / r_w;\n        for (int i = 0; i < kNumberOfPoints * 3; i += 3) {\n            lmk[i] /= r_w;\n            lmk[i + 1] = (lmk[i + 1] - (kInputH - r_w * img.rows) / 2) / r_w;\n            // lmk[i + 2]\n        }\n    } else {\n        l = (bbox[0] - (kInputW - r_h * img.cols) / 2) / r_h;\n        r = (bbox[2] - (kInputW - r_h * img.cols) / 2) / r_h;\n        t = bbox[1] / r_h;\n        b = bbox[3] / r_h;\n        for (int i = 0; i < kNumberOfPoints * 3; i += 3) {\n            lmk[i] = (lmk[i] - (kInputW - r_h * img.cols) / 2) / r_h;\n            lmk[i + 1] /= r_h;\n            // lmk[i + 2]\n        }\n    }\n    l = std::max(0.0f, l);\n    t = std::max(0.0f, t);\n    int width = std::max(0, std::min(int(round(r - l)), img.cols - int(round(l))));\n    int height = std::max(0, std::min(int(round(b - t)), img.rows - int(round(t))));\n\n    return cv::Rect(int(round(l)), int(round(t)), width, height);\n}\n\nstatic float iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n            (std::max)(lbox[0], rbox[0]),\n            (std::min)(lbox[2], rbox[2]),\n            (std::max)(lbox[1], rbox[1]),\n            (std::min)(lbox[3], rbox[3]),\n    };\n\n    if (interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS = (interBox[1] - interBox[0]) * (interBox[3] - interBox[2]);\n    float unionBoxS = (lbox[2] - lbox[0]) * (lbox[3] - lbox[1]) + (rbox[2] - rbox[0]) * (rbox[3] - rbox[1]) - interBoxS;\n    return interBoxS / unionBoxS;\n}\n\nstatic bool cmp(const Detection& a, const Detection& b) {\n    if (a.conf == b.conf) {\n        return a.bbox[0] < b.bbox[0];\n    }\n    return a.conf > b.conf;\n}\n\nvoid nms(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh) {\n    int det_size = sizeof(Detection) / sizeof(float);\n    std::map<float, std::vector<Detection>> m;\n\n    for (int i = 0; i < output[0]; i++) {\n        if (output[1 + det_size * i + 4] <= conf_thresh || isnan(output[1 + det_size * i + 4]))\n            continue;\n        Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        if (m.count(det.class_id) == 0)\n            m.emplace(det.class_id, std::vector<Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                    dets.erase(dets.begin() + n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\nvoid batch_nms(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size,\n               float conf_thresh, float nms_thresh) {\n    res_batch.resize(batch_size);\n    for (int i = 0; i < batch_size; i++) {\n        nms(res_batch[i], &output[i * output_size], conf_thresh, nms_thresh);\n    }\n}\n\nvoid process_decode_ptr_host(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element, cv::Mat& img,\n                             int count) {\n    Detection det;\n    for (int i = 0; i < count; i++) {\n        int basic_pos = 1 + i * bbox_element;\n        int keep_flag = decode_ptr_host[basic_pos + 6];\n        if (keep_flag == 1) {\n            det.bbox[0] = decode_ptr_host[basic_pos + 0];\n            det.bbox[1] = decode_ptr_host[basic_pos + 1];\n            det.bbox[2] = decode_ptr_host[basic_pos + 2];\n            det.bbox[3] = decode_ptr_host[basic_pos + 3];\n            det.conf = decode_ptr_host[basic_pos + 4];\n            det.class_id = decode_ptr_host[basic_pos + 5];\n            res.push_back(det);\n        }\n    }\n}\n\nvoid batch_process(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                   int bbox_element, const std::vector<cv::Mat>& img_batch) {\n    res_batch.resize(batch_size);\n    int count = static_cast<int>(*decode_ptr_host);\n    count = std::min(count, kMaxNumOutputBbox);\n    for (int i = 0; i < batch_size; i++) {\n        auto& img = const_cast<cv::Mat&>(img_batch[i]);\n        process_decode_ptr_host(res_batch[i], &decode_ptr_host[i * count], bbox_element, img, count);\n    }\n}\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        cv::Mat img = img_batch[i];\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect(img, res[j].bbox);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                        cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n        }\n    }\n}\n\nvoid draw_bbox_keypoints_line(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    const std::vector<std::pair<int, int>> skeleton_pairs = {\n            {0, 1}, {0, 2},  {0, 5}, {0, 6},  {1, 2},   {1, 3},   {2, 4},   {5, 6},   {5, 7},  {5, 11},\n            {6, 8}, {6, 12}, {7, 9}, {8, 10}, {11, 12}, {11, 13}, {12, 14}, {13, 15}, {14, 16}};\n\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        cv::Mat img = img_batch[i];\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect_adapt_landmark(img, res[j].bbox, res[j].keypoints);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                        cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n\n            for (int k = 0; k < kNumberOfPoints * 3; k += 3) {\n                if (res[j].keypoints[k + 2] > 0.5) {\n                    cv::circle(img, cv::Point((int)res[j].keypoints[k], (int)res[j].keypoints[k + 1]), 3,\n                               cv::Scalar(0, 0x27, 0xC1), -1);\n                }\n            }\n\n            for (const auto& bone : skeleton_pairs) {\n                int kp1_idx = bone.first * 3;\n                int kp2_idx = bone.second * 3;\n                if (res[j].keypoints[kp1_idx + 2] > 0.5 && res[j].keypoints[kp2_idx + 2] > 0.5) {\n                    cv::Point p1((int)res[j].keypoints[kp1_idx], (int)res[j].keypoints[kp1_idx + 1]);\n                    cv::Point p2((int)res[j].keypoints[kp2_idx], (int)res[j].keypoints[kp2_idx + 1]);\n                    cv::line(img, p1, p2, cv::Scalar(0, 0x27, 0xC1), 2);\n                }\n            }\n        }\n    }\n}\n\ncv::Mat scale_mask(cv::Mat mask, cv::Mat img) {\n    int x, y, w, h;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = kInputW;\n        h = r_w * img.rows;\n        x = 0;\n        y = (kInputH - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = kInputH;\n        x = (kInputW - w) / 2;\n        y = 0;\n    }\n    cv::Rect r(x, y, w, h);\n    cv::Mat res;\n    cv::resize(mask(r), res, img.size());\n    return res;\n}\n\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks,\n                    std::unordered_map<int, std::string>& labels_map) {\n    static std::vector<uint32_t> colors = {0xFF3838, 0xFF9D97, 0xFF701F, 0xFFB21D, 0xCFD231, 0x48F90A, 0x92CC17,\n                                           0x3DDB86, 0x1A9334, 0x00D4BB, 0x2C99A8, 0x00C2FF, 0x344593, 0x6473FF,\n                                           0x0018EC, 0x8438FF, 0x520085, 0xCB38FF, 0xFF95C8, 0xFF37C7};\n    for (size_t i = 0; i < dets.size(); i++) {\n        cv::Mat img_mask = scale_mask(masks[i], img);\n        auto color = colors[(int)dets[i].class_id % colors.size()];\n        auto bgr = cv::Scalar(color & 0xFF, color >> 8 & 0xFF, color >> 16 & 0xFF);\n\n        cv::Rect r = get_rect(img, dets[i].bbox);\n        for (int x = r.x; x < r.x + r.width; x++) {\n            for (int y = r.y; y < r.y + r.height; y++) {\n                float val = img_mask.at<float>(y, x);\n                if (val <= 0.5)\n                    continue;\n                img.at<cv::Vec3b>(y, x)[0] = img.at<cv::Vec3b>(y, x)[0] / 2 + bgr[0] / 2;\n                img.at<cv::Vec3b>(y, x)[1] = img.at<cv::Vec3b>(y, x)[1] / 2 + bgr[1] / 2;\n                img.at<cv::Vec3b>(y, x)[2] = img.at<cv::Vec3b>(y, x)[2] / 2 + bgr[2] / 2;\n            }\n        }\n\n        cv::rectangle(img, r, bgr, 2);\n\n        // Get the size of the text\n        cv::Size textSize =\n                cv::getTextSize(labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf),\n                                cv::FONT_HERSHEY_PLAIN, 1.2, 2, NULL);\n        // Set the top left corner of the rectangle\n        cv::Point topLeft(r.x, r.y - textSize.height);\n\n        // Set the bottom right corner of the rectangle\n        cv::Point bottomRight(r.x + textSize.width, r.y + textSize.height);\n\n        // Set the thickness of the rectangle lines\n        int lineThickness = 2;\n\n        // Draw the rectangle on the image\n        cv::rectangle(img, topLeft, bottomRight, bgr, -1);\n\n        cv::putText(img, labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf),\n                    cv::Point(r.x, r.y + 4), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar::all(0xFF), 2);\n    }\n}\n\nvoid process_decode_ptr_host_obb(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element,\n                                 cv::Mat& img, int count) {\n    Detection det;\n    for (int i = 0; i < count; i++) {\n        int basic_pos = 1 + i * bbox_element;\n        int keep_flag = decode_ptr_host[basic_pos + 6];\n        if (keep_flag == 1) {\n            det.bbox[0] = decode_ptr_host[basic_pos + 0];\n            det.bbox[1] = decode_ptr_host[basic_pos + 1];\n            det.bbox[2] = decode_ptr_host[basic_pos + 2];\n            det.bbox[3] = decode_ptr_host[basic_pos + 3];\n            det.conf = decode_ptr_host[basic_pos + 4];\n            det.class_id = decode_ptr_host[basic_pos + 5];\n            det.angle = decode_ptr_host[basic_pos + 7];\n            res.push_back(det);\n        }\n    }\n}\n\nvoid batch_process_obb(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                       int bbox_element, const std::vector<cv::Mat>& img_batch) {\n    res_batch.resize(batch_size);\n    int count = static_cast<int>(*decode_ptr_host);\n    count = std::min(count, kMaxNumOutputBbox);\n    for (int i = 0; i < batch_size; i++) {\n        auto& img = const_cast<cv::Mat&>(img_batch[i]);\n        process_decode_ptr_host_obb(res_batch[i], &decode_ptr_host[i * count], bbox_element, img, count);\n    }\n}\n\nstd::tuple<float, float, float> convariance_matrix(Detection res) {\n    float w = res.bbox[2];\n    float h = res.bbox[3];\n\n    float a = w * w / 12.0;\n    float b = h * h / 12.0;\n    float c = res.angle;\n\n    float cos_r = std::cos(c);\n    float sin_r = std::sin(c);\n\n    float cos_r2 = cos_r * cos_r;\n    float sin_r2 = sin_r * sin_r;\n\n    float a_val = a * cos_r2 + b * sin_r2;\n    float b_val = a * sin_r2 + b * cos_r2;\n    float c_val = (a - b) * cos_r * sin_r;\n\n    return std::make_tuple(a_val, b_val, c_val);\n}\n\nstatic float probiou(const Detection& res1, const Detection& res2, float eps = 1e-7) {\n    // Calculate the prob iou between oriented bounding boxes, https://arxiv.org/pdf/2106.06072v1.pdf.\n    float a1, b1, c1, a2, b2, c2;\n    std::tuple<float, float, float> matrix1 = {a1, b1, c1};\n    std::tuple<float, float, float> matrix2 = {a2, b2, c2};\n    matrix1 = convariance_matrix(res1);\n    matrix2 = convariance_matrix(res2);\n    a1 = std::get<0>(matrix1);\n    b1 = std::get<1>(matrix1);\n    c1 = std::get<2>(matrix1);\n    a2 = std::get<0>(matrix2);\n    b2 = std::get<1>(matrix2);\n    c2 = std::get<2>(matrix2);\n\n    float x1 = res1.bbox[0], y1 = res1.bbox[1];\n    float x2 = res2.bbox[0], y2 = res2.bbox[1];\n\n    float t1 = ((a1 + a2) * std::pow(y1 - y2, 2) + (b1 + b2) * std::pow(x1 - x2, 2)) /\n               ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2) + eps);\n    float t2 = ((c1 + c2) * (x2 - x1) * (y1 - y2)) / ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2) + eps);\n    float t3 = std::log(\n            ((a1 + a2) * (b1 + b2) - std::pow(c1 + c2, 2)) /\n                    (4 * std::sqrt(std::max(a1 * b1 - c1 * c1, 0.0f)) * std::sqrt(std::max(a2 * b2 - c2 * c2, 0.0f)) +\n                     eps) +\n            eps);\n\n    float bd = 0.25f * t1 + 0.5f * t2 + 0.5f * t3;\n    bd = std::max(std::min(bd, 100.0f), eps);\n    float hd = std::sqrt(1.0 - std::exp(-bd) + eps);\n\n    return 1 - hd;\n}\n\nvoid nms_obb(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh) {\n    int det_size = sizeof(Detection) / sizeof(float);\n    std::map<float, std::vector<Detection>> m;\n\n    for (int i = 0; i < output[0]; i++) {\n\n        if (output[1 + det_size * i + 4] <= conf_thresh)\n            continue;\n        Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        if (m.count(det.class_id) == 0)\n            m.emplace(det.class_id, std::vector<Detection>());\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (probiou(item, dets[n]) >= nms_thresh) {\n                    dets.erase(dets.begin() + n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\nvoid batch_nms_obb(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size,\n                   float conf_thresh, float nms_thresh) {\n    res_batch.resize(batch_size);\n    for (int i = 0; i < batch_size; i++) {\n        nms_obb(res_batch[i], &output[i * output_size], conf_thresh, nms_thresh);\n    }\n}\n\nstatic std::vector<cv::Point> get_corner(cv::Mat& img, const Detection& box) {\n    float cos_value, sin_value;\n\n    // Calculate center point and width/height\n    float x1 = box.bbox[0];\n    float y1 = box.bbox[1];\n    float w = box.bbox[2];\n    float h = box.bbox[3];\n    float angle = box.angle * 180.0f / CV_PI;  // Convert radians to degrees\n\n    // Print original angle\n    std::cout << \"Original angle: \" << angle << std::endl;\n\n    // Swap width and height if height is greater than or equal to width\n    if (h >= w) {\n        std::swap(w, h);\n        angle = fmod(angle + 90.0f, 180.0f);  // Adjust angle to be within [0, 180)\n    }\n\n    // Ensure the angle is between 0 and 180 degrees\n    if (angle < 0) {\n        angle += 360.0f;  // Convert to positive value\n    }\n    if (angle > 180.0f) {\n        angle -= 180.0f;  // Subtract 180 from angles greater than 180\n    }\n\n    // Print adjusted angle\n    std::cout << \"Adjusted angle: \" << angle << std::endl;\n\n    // Convert to normal angle value\n    float normal_angle = fmod(angle, 180.0f);\n    if (normal_angle < 0) {\n        normal_angle += 180.0f;  // Ensure it's a positive value\n    }\n\n    // Print normal angle value\n    std::cout << \"Normal angle: \" << normal_angle << std::endl;\n\n    cos_value = std::cos(angle * CV_PI / 180.0f);  // Convert to radians\n    sin_value = std::sin(angle * CV_PI / 180.0f);\n\n    // Calculate each corner point\n    float l = x1 - w / 2;  // Left boundary\n    float r = x1 + w / 2;  // Right boundary\n    float t = y1 - h / 2;  // Top boundary\n    float b = y1 + h / 2;  // Bottom boundary\n\n    // Use get_rect function to scale the coordinates\n    float bbox[4] = {l, t, r, b};\n    cv::Rect rect = get_rect(img, bbox);\n\n    float x_ = (rect.x + rect.x + rect.width) / 2;   // Center x\n    float y_ = (rect.y + rect.y + rect.height) / 2;  // Center y\n    float width = rect.width;                        // Width\n    float height = rect.height;                      // Height\n\n    // Calculate each corner point\n    std::vector<cv::Point> corner_points(4);\n    float vec1x = width / 2 * cos_value;\n    float vec1y = width / 2 * sin_value;\n    float vec2x = -height / 2 * sin_value;\n    float vec2y = height / 2 * cos_value;\n\n    corner_points[0] = cv::Point(int(round(x_ + vec1x + vec2x)), int(round(y_ + vec1y + vec2y)));  // Top-left corner\n    corner_points[1] = cv::Point(int(round(x_ + vec1x - vec2x)), int(round(y_ + vec1y - vec2y)));  // Top-right corner\n    corner_points[2] =\n            cv::Point(int(round(x_ - vec1x - vec2x)), int(round(y_ - vec1y - vec2y)));  // Bottom-right corner\n    corner_points[3] = cv::Point(int(round(x_ - vec1x + vec2x)), int(round(y_ - vec1y + vec2y)));  // Bottom-left corner\n\n    // Check and adjust corner points to ensure the rectangle is parallel to image boundaries\n    for (auto& point : corner_points) {\n        point.x = std::max(0, std::min(point.x, img.cols - 1));\n        point.y = std::max(0, std::min(point.y, img.rows - 1));\n    }\n\n    return corner_points;\n}\n\nvoid draw_bbox_obb(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    static std::vector<uint32_t> colors = {0xFF3838, 0xFF9D97, 0xFF701F, 0xFFB21D, 0xCFD231, 0x48F90A, 0x92CC17,\n                                           0x3DDB86, 0x1A9334, 0x00D4BB, 0x2C99A8, 0x00C2FF, 0x344593, 0x6473FF,\n                                           0x0018EC, 0x8438FF, 0x520085, 0xCB38FF, 0xFF95C8, 0xFF37C7};\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        auto& img = img_batch[i];\n        for (auto& obj : res) {\n            auto color = colors[(int)obj.class_id % colors.size()];\n            auto bgr = cv::Scalar(color & 0xFF, color >> 8 & 0xFF, color >> 16 & 0xFF);\n            auto corner_points = get_corner(img, obj);\n            cv::polylines(img, std::vector<std::vector<cv::Point>>{corner_points}, true, bgr, 1);\n\n            auto text = (std::to_string((int)(obj.class_id)) + \":\" + to_string_with_precision(obj.conf));\n            cv::Size textsize = cv::getTextSize(text, 0, 0.3, 1, nullptr);\n\n            int width = textsize.width;\n            int height = textsize.height;\n            bool outside = (corner_points[0].y - height >= 3) ? true : false;\n            cv::Point p1(corner_points[0].x, corner_points[0].y), p2;\n            p2.x = corner_points[0].x + width;\n            if (outside) {\n                p2.y = corner_points[0].y - height - 3;\n            } else {\n                p2.y = corner_points[0].y + height + 3;\n            }\n            cv::rectangle(img, p1, p2, bgr, -1, cv::LINE_AA);\n            cv::putText(\n                    img, text,\n                    cv::Point(corner_points[0].x, (outside ? corner_points[0].y - 2 : corner_points[0].y + height + 2)),\n                    0, 0.3, cv::Scalar::all(255), 1, cv::LINE_AA);\n        }\n    }\n}\n"
  },
  {
    "path": "yolov8/src/postprocess.cu",
    "content": "//\n// Created by lindsay on 23-7-17.\n//\n#include \"postprocess.h\"\n#include \"types.h\"\n\nstatic __global__ void decode_kernel_obb(float* predict, int num_bboxes, float confidence_threshold, float* parray,\n                                         int max_objects) {\n    float count = predict[0];\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    if (position >= count)\n        return;\n\n    float* pitem = predict + 1 + position * (sizeof(Detection) / sizeof(float));\n    int index = atomicAdd(parray, 1);\n    if (index >= max_objects)\n        return;\n\n    float confidence = pitem[4];\n\n    if (confidence < confidence_threshold)\n        return;\n    //[center_x center_y w h conf class_id  mask[32] keypoints[51] angle]\n    float cx = pitem[0];\n    float cy = pitem[1];\n    float width = pitem[2];\n    float height = pitem[3];\n    float label = pitem[5];\n    float angle = pitem[89];\n\n    float* pout_item = parray + 1 + index * bbox_element;\n    *pout_item++ = cx;\n    *pout_item++ = cy;\n    *pout_item++ = width;\n    *pout_item++ = height;\n    *pout_item++ = confidence;\n    *pout_item++ = label;\n    *pout_item++ = 1;  // 1 = keep, 0 = ignore\n    *pout_item++ = angle;\n}\n\nstatic __global__ void decode_kernel(float* predict, int num_bboxes, float confidence_threshold, float* parray,\n                                     int max_objects) {\n    float count = predict[0];\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    if (position >= count)\n        return;\n\n    float* pitem = predict + 1 + position * (sizeof(Detection) / sizeof(float));\n    int index = atomicAdd(parray, 1);\n    if (index >= max_objects)\n        return;\n\n    float confidence = pitem[4];\n    if (confidence < confidence_threshold)\n        return;\n\n    float left = pitem[0];\n    float top = pitem[1];\n    float right = pitem[2];\n    float bottom = pitem[3];\n    float label = pitem[5];\n\n    float* pout_item = parray + 1 + index * bbox_element;\n    *pout_item++ = left;\n    *pout_item++ = top;\n    *pout_item++ = right;\n    *pout_item++ = bottom;\n    *pout_item++ = confidence;\n    *pout_item++ = label;\n    *pout_item++ = 1;  // 1 = keep, 0 = ignore\n}\n\nstatic __device__ float box_iou(float aleft, float atop, float aright, float abottom, float bleft, float btop,\n                                float bright, float bbottom) {\n    float cleft = max(aleft, bleft);\n    float ctop = max(atop, btop);\n    float cright = min(aright, bright);\n    float cbottom = min(abottom, bbottom);\n    float c_area = max(cright - cleft, 0.0f) * max(cbottom - ctop, 0.0f);\n    if (c_area == 0.0f)\n        return 0.0f;\n\n    float a_area = max(0.0f, aright - aleft) * max(0.0f, abottom - atop);\n    float b_area = max(0.0f, bright - bleft) * max(0.0f, bbottom - btop);\n    return c_area / (a_area + b_area - c_area);\n}\n\nstatic __global__ void nms_kernel(float* bboxes, int max_objects, float threshold) {\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    int count = min(static_cast<int>(bboxes[0]), max_objects);\n    if (position >= count)\n        return;\n\n    float* pcurrent = bboxes + 1 + position * bbox_element;\n    for (int i = 0; i < count; ++i) {\n        float* pitem = bboxes + 1 + i * bbox_element;\n        if (i == position || pcurrent[5] != pitem[5])\n            continue;\n        if (pitem[4] >= pcurrent[4]) {\n            if (pitem[4] == pcurrent[4] && i < position)\n                continue;\n            float iou =\n                    box_iou(pcurrent[0], pcurrent[1], pcurrent[2], pcurrent[3], pitem[0], pitem[1], pitem[2], pitem[3]);\n            if (iou > threshold) {\n                pcurrent[6] = 0;\n                return;\n            }\n        }\n    }\n}\n\nstatic __device__ void convariance_matrix(float w, float h, float r, float& a, float& b, float& c) {\n    float a_val = w * w / 12.0f;\n    float b_val = h * h / 12.0f;\n    float cos_r = cosf(r);\n    float sin_r = sinf(r);\n\n    a = a_val * cos_r * cos_r + b_val * sin_r * sin_r;\n    b = a_val * sin_r * sin_r + b_val * cos_r * cos_r;\n    c = (a_val - b_val) * sin_r * cos_r;\n}\n\nstatic __device__ float box_probiou(float cx1, float cy1, float w1, float h1, float r1, float cx2, float cy2, float w2,\n                                    float h2, float r2, float eps = 1e-7) {\n\n    // Calculate the prob iou between oriented bounding boxes, https://arxiv.org/pdf/2106.06072v1.pdf.\n    float a1, b1, c1, a2, b2, c2;\n    convariance_matrix(w1, h1, r1, a1, b1, c1);\n    convariance_matrix(w2, h2, r2, a2, b2, c2);\n\n    float t1 = ((a1 + a2) * powf(cy1 - cy2, 2) + (b1 + b2) * powf(cx1 - cx2, 2)) /\n               ((a1 + a2) * (b1 + b2) - powf(c1 + c2, 2) + eps);\n    float t2 = ((c1 + c2) * (cx2 - cx1) * (cy1 - cy2)) / ((a1 + a2) * (b1 + b2) - powf(c1 + c2, 2) + eps);\n    float t3 = logf(((a1 + a2) * (b1 + b2) - powf(c1 + c2, 2)) /\n                            (4 * sqrtf(fmaxf(a1 * b1 - c1 * c1, 0.0f)) * sqrtf(fmaxf(a2 * b2 - c2 * c2, 0.0f)) + eps) +\n                    eps);\n    float bd = 0.25f * t1 + 0.5f * t2 + 0.5f * t3;\n    bd = fmaxf(fminf(bd, 100.0f), eps);\n    float hd = sqrtf(1.0f - expf(-bd) + eps);\n    return 1 - hd;\n}\n\nstatic __global__ void nms_kernel_obb(float* bboxes, int max_objects, float threshold) {\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    int count = min(static_cast<int>(bboxes[0]), max_objects);\n    if (position >= count)\n        return;\n\n    float* pcurrent = bboxes + 1 + position * bbox_element;\n    for (int i = 0; i < count; ++i) {\n        float* pitem = bboxes + 1 + i * bbox_element;\n        if (i == position || pcurrent[5] != pitem[5])\n            continue;\n        if (pitem[4] >= pcurrent[4]) {\n            if (pitem[4] == pcurrent[4] && i < position)\n                continue;\n            float iou = box_probiou(pcurrent[0], pcurrent[1], pcurrent[2], pcurrent[3], pcurrent[7], pitem[0], pitem[1],\n                                    pitem[2], pitem[3], pitem[7]);\n            if (iou > threshold) {\n                pcurrent[6] = 0;\n                return;\n            }\n        }\n    }\n}\n\nvoid cuda_decode(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                 cudaStream_t stream) {\n    int block = 256;\n    int grid = ceil(num_bboxes / (float)block);\n    decode_kernel<<<grid, block, 0, stream>>>((float*)predict, num_bboxes, confidence_threshold, parray, max_objects);\n}\n\nvoid cuda_nms(float* parray, float nms_threshold, int max_objects, cudaStream_t stream) {\n    int block = max_objects < 256 ? max_objects : 256;\n    int grid = ceil(max_objects / (float)block);\n    nms_kernel<<<grid, block, 0, stream>>>(parray, max_objects, nms_threshold);\n}\n\nvoid cuda_decode_obb(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                     cudaStream_t stream) {\n    int block = 256;\n    int grid = ceil(num_bboxes / (float)block);\n    decode_kernel_obb<<<grid, block, 0, stream>>>((float*)predict, num_bboxes, confidence_threshold, parray,\n                                                  max_objects);\n}\n\nvoid cuda_nms_obb(float* parray, float nms_threshold, int max_objects, cudaStream_t stream) {\n    int block = max_objects < 256 ? max_objects : 256;\n    int grid = ceil(max_objects / (float)block);\n    nms_kernel_obb<<<grid, block, 0, stream>>>(parray, max_objects, nms_threshold);\n}\n"
  },
  {
    "path": "yolov8/src/preprocess.cu",
    "content": "#include \"cuda_utils.h\"\n#include \"preprocess.h\"\n\nstatic uint8_t* img_buffer_host = nullptr;\nstatic uint8_t* img_buffer_device = nullptr;\n\n__global__ void warpaffine_kernel(uint8_t* src, int src_line_size, int src_width, int src_height, float* dst,\n                                  int dst_width, int dst_height, uint8_t const_value_st, AffineMatrix d2s, int edge) {\n    int position = blockDim.x * blockIdx.x + threadIdx.x;\n    if (position >= edge)\n        return;\n\n    float m_x1 = d2s.value[0];\n    float m_y1 = d2s.value[1];\n    float m_z1 = d2s.value[2];\n    float m_x2 = d2s.value[3];\n    float m_y2 = d2s.value[4];\n    float m_z2 = d2s.value[5];\n\n    int dx = position % dst_width;\n    int dy = position / dst_width;\n    float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;\n    float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;\n    float c0, c1, c2;\n\n    if (src_x <= -1 || src_x >= src_width || src_y <= -1 || src_y >= src_height) {\n        // out of range\n        c0 = const_value_st;\n        c1 = const_value_st;\n        c2 = const_value_st;\n    } else {\n        int y_low = floorf(src_y);\n        int x_low = floorf(src_x);\n        int y_high = y_low + 1;\n        int x_high = x_low + 1;\n\n        uint8_t const_value[] = {const_value_st, const_value_st, const_value_st};\n        float ly = src_y - y_low;\n        float lx = src_x - x_low;\n        float hy = 1 - ly;\n        float hx = 1 - lx;\n        float w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;\n        uint8_t* v1 = const_value;\n        uint8_t* v2 = const_value;\n        uint8_t* v3 = const_value;\n        uint8_t* v4 = const_value;\n\n        if (y_low >= 0) {\n            if (x_low >= 0)\n                v1 = src + y_low * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v2 = src + y_low * src_line_size + x_high * 3;\n        }\n\n        if (y_high < src_height) {\n            if (x_low >= 0)\n                v3 = src + y_high * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v4 = src + y_high * src_line_size + x_high * 3;\n        }\n\n        c0 = w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0];\n        c1 = w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1];\n        c2 = w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2];\n    }\n\n    // bgr to rgb\n    float t = c2;\n    c2 = c0;\n    c0 = t;\n\n    // normalization\n    c0 = c0 / 255.0f;\n    c1 = c1 / 255.0f;\n    c2 = c2 / 255.0f;\n\n    // rgbrgbrgb to rrrgggbbb\n    int area = dst_width * dst_height;\n    float* pdst_c0 = dst + dy * dst_width + dx;\n    float* pdst_c1 = pdst_c0 + area;\n    float* pdst_c2 = pdst_c1 + area;\n    *pdst_c0 = c0;\n    *pdst_c1 = c1;\n    *pdst_c2 = c2;\n}\n\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height, float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream) {\n    int img_size = src_width * src_height * 3;\n    // copy data to pinned memory\n    memcpy(img_buffer_host, src, img_size);\n    // copy data to device memory\n    CUDA_CHECK(cudaMemcpyAsync(img_buffer_device, img_buffer_host, img_size, cudaMemcpyHostToDevice, stream));\n\n    AffineMatrix s2d, d2s;\n    float scale = std::min(dst_height / (float)src_height, dst_width / (float)src_width);\n\n    s2d.value[0] = scale;\n    s2d.value[1] = 0;\n    s2d.value[2] = -scale * src_width * 0.5 + dst_width * 0.5;\n    s2d.value[3] = 0;\n    s2d.value[4] = scale;\n    s2d.value[5] = -scale * src_height * 0.5 + dst_height * 0.5;\n    cv::Mat m2x3_s2d(2, 3, CV_32F, s2d.value);\n    cv::Mat m2x3_d2s(2, 3, CV_32F, d2s.value);\n    cv::invertAffineTransform(m2x3_s2d, m2x3_d2s);\n\n    memcpy(d2s.value, m2x3_d2s.ptr<float>(0), sizeof(d2s.value));\n\n    int jobs = dst_height * dst_width;\n    int threads = 256;\n    int blocks = ceil(jobs / (float)threads);\n    warpaffine_kernel<<<blocks, threads, 0, stream>>>(img_buffer_device, src_width * 3, src_width, src_height, dst,\n                                                      dst_width, dst_height, 128, d2s, jobs);\n}\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch, float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream) {\n    int dst_size = dst_width * dst_height * 3;\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        cuda_preprocess(img_batch[i].ptr(), img_batch[i].cols, img_batch[i].rows, &dst[dst_size * i], dst_width,\n                        dst_height, stream);\n        CUDA_CHECK(cudaStreamSynchronize(stream));\n    }\n}\n\nvoid cuda_preprocess_init(int max_image_size) {\n    // prepare input data in pinned memory\n    CUDA_CHECK(cudaMallocHost((void**)&img_buffer_host, max_image_size * 3));\n    // prepare input data in device memory\n    CUDA_CHECK(cudaMalloc((void**)&img_buffer_device, max_image_size * 3));\n}\n\nvoid cuda_preprocess_destroy() {\n    CUDA_CHECK(cudaFree(img_buffer_device));\n    CUDA_CHECK(cudaFreeHost(img_buffer_host));\n}\n"
  },
  {
    "path": "yolov8/yolov8_5u_det.cpp",
    "content": "\n#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"utils.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\n\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, int& is_p, std::string& sub_type, float& gd,\n                      float& gw, int& max_channels) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine = nullptr;\n\n    if (is_p == 6) {\n        serialized_engine =\n                buildEngineYolov8_5uDetP6(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\n    } else {\n        serialized_engine = buildEngineYolov8_5uDet(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\n    }\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_buffer_host, float** decode_ptr_host, float** decode_ptr_device,\n                    std::string cuda_post_process) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and\n    // output tensors. Note that indices are guaranteed to be less than\n    // IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n    if (cuda_post_process == \"c\") {\n        *output_buffer_host = new float[kBatchSize * kOutputSize];\n    } else if (cuda_post_process == \"g\") {\n        if (kBatchSize > 1) {\n            std::cerr << \"Do not yet support GPU post processing for multiple batches\" << std::endl;\n            exit(0);\n        }\n        // Allocate memory for decode_ptr_host and copy to device\n        *decode_ptr_host = new float[1 + kMaxNumOutputBbox * bbox_element];\n        CUDA_CHECK(cudaMalloc((void**)decode_ptr_device, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element)));\n    }\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchsize,\n           float* decode_ptr_host, float* decode_ptr_device, int model_bboxes, std::string cuda_post_process) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueue(batchsize, buffers, stream, nullptr);\n    if (cuda_post_process == \"c\") {\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n    } else if (cuda_post_process == \"g\") {\n        CUDA_CHECK(\n                cudaMemsetAsync(decode_ptr_device, 0, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), stream));\n        cuda_decode((float*)buffers[1], model_bboxes, kConfThresh, decode_ptr_device, kMaxNumOutputBbox, stream);\n        cuda_nms(decode_ptr_device, kNmsThresh, kMaxNumOutputBbox,\n                 stream);  // cuda nms\n        CUDA_CHECK(cudaMemcpyAsync(decode_ptr_host, decode_ptr_device,\n                                   sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference and gpu postprocess time: \"\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, int& is_p, std::string& img_dir,\n                std::string& sub_type, std::string& cuda_post_process, float& gd, float& gw, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && (argc == 5 || argc == 7)) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        auto sub_type = std::string(argv[4]);\n\n        if (sub_type[0] == 'n') {\n            gd = 0.33;\n            gw = 0.25;\n            max_channels = 1024;\n        } else if (sub_type[0] == 's') {\n            gd = 0.33;\n            gw = 0.50;\n            max_channels = 1024;\n        } else if (sub_type[0] == 'm') {\n            gd = 0.67;\n            gw = 0.75;\n            max_channels = 576;\n        } else if (sub_type[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n        } else if (sub_type[0] == 'x') {\n            gd = 1.33;\n            gw = 1.25;\n            max_channels = 640;\n        } else {\n            return false;\n        }\n        if (sub_type.size() == 2 && sub_type[1] == '6') {\n            is_p = 6;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 5) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n        cuda_post_process = std::string(argv[4]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(kGpuId);\n    std::string wts_name = \"\";\n    std::string engine_name = \"\";\n    std::string img_dir;\n    std::string sub_type = \"\";\n    std::string cuda_post_process = \"\";\n    int model_bboxes;\n    int is_p = 0;\n    float gd = 0.0f, gw = 0.0f;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, is_p, img_dir, sub_type, cuda_post_process, gd, gw,\n                    max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolov8_5u_det -s [.wts] [.engine] \"\n                     \"[n/s/m/l/x//n6/s6/m6/l6/x6]  // serialize model to \"\n                     \"plan file\"\n                  << std::endl;\n        std::cerr << \"./yolov8_5u_det -d [.engine] ../samples  [c/g]// deserialize \"\n                     \"plan file and run inference\"\n                  << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, is_p, sub_type, gd, gw, max_channels);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* output_buffer_host = nullptr;\n    float* decode_ptr_host = nullptr;\n    float* decode_ptr_device = nullptr;\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host, &decode_ptr_host,\n                   &decode_ptr_device, cuda_post_process);\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n        // Run inference\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize, decode_ptr_host,\n              decode_ptr_device, model_bboxes, cuda_post_process);\n        std::vector<std::vector<Detection>> res_batch;\n        if (cuda_post_process == \"c\") {\n            // NMS\n            batch_nms(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n        } else if (cuda_post_process == \"g\") {\n            // Process gpu decode and nms results\n            batch_process(res_batch, decode_ptr_host, img_batch.size(), bbox_element, img_batch);\n        }\n        // Draw bounding boxes\n        draw_bbox(img_batch, res_batch);\n        // Save images\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    CUDA_CHECK(cudaFree(decode_ptr_device));\n    delete[] decode_ptr_host;\n    delete[] output_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    // Print histogram of the output distribution\n    // std::cout << \"\\nOutput:\\n\\n\";\n    // for (unsigned int i = 0; i < kOutputSize; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << std::endl;\n    //}\n    // std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov8/yolov8_5u_det_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\nPOSE_NUM = 17 * 3\nDET_NUM = 6\nSEG_NUM = 32\nOBB_NUM = 1\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLov8 project.\n    param:\n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n            line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n            )\n\n\nclass YoLov8TRT(object):\n    \"\"\"\n    description: A YOLOv8 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n        self.det_output_length = host_outputs[0].shape[0]\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            result_boxes, result_scores, result_classid = self.post_process(\n                output[i * self.det_output_length: (i + 1) * self.det_output_length], batch_origin_h[i],\n                batch_origin_w[i]\n            )\n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0]\n            y[:, 2] = x[:, 2]\n            y[:, 1] = x[:, 1] - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 3] - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 2] - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1]\n            y[:, 3] = x[:, 3]\n            y /= r_h\n\n        return y\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...]\n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        num_values_per_detection = DET_NUM + SEG_NUM + POSE_NUM + OBB_NUM\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        # pred = np.reshape(output[1:], (-1, 38))[:num, :]\n        pred = np.reshape(output[1:], (-1, num_values_per_detection))[:num, :]\n        # Do nms\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        return result_boxes, result_scores, result_classid\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = (np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None)\n                      * np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None))\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\n        # clip the coordinates\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1)\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1)\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1)\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1)\n        # Object confidence\n        confs = boxes[:, 4]\n        # Sort by the confs\n        boxes = boxes[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_boxes = []\n        while boxes.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\n            label_match = boxes[0, -1] == boxes[:, -1]\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\n            invalid = large_overlap & label_match\n            keep_boxes += [boxes[0]]\n            boxes = boxes[~invalid]\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\n        return boxes\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov8_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov8_wrapper = yolov8_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(self.yolov8_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov8_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov8_wrapper = yolov8_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(self.yolov8_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"./build/libmyplugins.so\"\n    engine_file_path = \"yolov5xu.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load coco labels\n\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\",\n                  \"traffic light\",\n                  \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n                  \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\n                  \"frisbee\",\n                  \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\",\n                  \"surfboard\",\n                  \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n                  \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n                  \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\n                  \"cell phone\",\n                  \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\n                  \"teddy bear\",\n                  \"hair drier\", \"toothbrush\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov8TRT instance\n    yolov8_wrapper = YoLov8TRT(engine_file_path)\n    try:\n        print('batch size is', yolov8_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolov8_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov8_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov8_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov8_wrapper.destroy()\n"
  },
  {
    "path": "yolov8/yolov8_cls.cpp",
    "content": "#include \"calibrator.h\"\n#include \"config.h\"\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"utils.h\"\n\n#include <chrono>\n#include <cmath>\n#include <iostream>\n#include <numeric>\n#include <opencv2/opencv.hpp>\n\nusing namespace nvinfer1;\n\nstatic Logger gLogger;\nconst static int kOutputSize = kClsNumClass;\n\nvoid batch_preprocess(std::vector<cv::Mat>& imgs, float* output, int dst_width = 224, int dst_height = 224) {\n    for (size_t b = 0; b < imgs.size(); b++) {\n        int h = imgs[b].rows;\n        int w = imgs[b].cols;\n        int m = std::min(h, w);\n        int top = (h - m) / 2;\n        int left = (w - m) / 2;\n        cv::Mat img = imgs[b](cv::Rect(left, top, m, m));\n        cv::resize(img, img, cv::Size(dst_width, dst_height), 0, 0, cv::INTER_LINEAR);\n        cv::cvtColor(img, img, cv::COLOR_BGR2RGB);\n        img.convertTo(img, CV_32F, 1 / 255.0);\n\n        std::vector<cv::Mat> channels(3);\n        cv::split(img, channels);\n\n        // CHW format\n        for (int c = 0; c < 3; ++c) {\n            int i = 0;\n            for (int row = 0; row < dst_height; ++row) {\n                for (int col = 0; col < dst_width; ++col) {\n                    output[b * 3 * dst_height * dst_width + c * dst_height * dst_width + i] =\n                            channels[c].at<float>(row, col);\n                    ++i;\n                }\n            }\n        }\n    }\n}\n\nstd::vector<float> softmax(float* prob, int n) {\n    std::vector<float> res;\n    float sum = 0.0f;\n    float t;\n    for (int i = 0; i < n; i++) {\n        t = expf(prob[i]);\n        res.push_back(t);\n        sum += t;\n    }\n    for (int i = 0; i < n; i++) {\n        res[i] /= sum;\n    }\n    return res;\n}\n\nstd::vector<int> topk(const std::vector<float>& vec, int k) {\n    std::vector<int> topk_index;\n    std::vector<size_t> vec_index(vec.size());\n    std::iota(vec_index.begin(), vec_index.end(), 0);\n\n    std::sort(vec_index.begin(), vec_index.end(),\n              [&vec](size_t index_1, size_t index_2) { return vec[index_1] > vec[index_2]; });\n\n    int k_num = std::min<int>(vec.size(), k);\n\n    for (int i = 0; i < k_num; ++i) {\n        topk_index.push_back(vec_index[i]);\n    }\n\n    return topk_index;\n}\n\nstd::vector<std::string> read_classes(std::string file_name) {\n    std::vector<std::string> classes;\n    std::ifstream ifs(file_name, std::ios::in);\n    if (!ifs.is_open()) {\n        std::cerr << file_name << \" is not found, pls refer to README and download it.\" << std::endl;\n        assert(0);\n    }\n    std::string s;\n    while (std::getline(ifs, s)) {\n        classes.push_back(s);\n    }\n    ifs.close();\n    return classes;\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, float& gd, float& gw,\n                std::string& img_dir) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && (argc == 5)) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        auto net = std::string(argv[4]);\n        if (net[0] == 'n') {\n            gd = 0.33;\n            gw = 0.25;\n        } else if (net[0] == 's') {\n            gd = 0.33;\n            gw = 0.50;\n        } else if (net[0] == 'm') {\n            gd = 0.67;\n            gw = 0.75;\n        } else if (net[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n        } else if (net[0] == 'x') {\n            gd = 1.0;\n            gw = 1.25;\n        } else {\n            return false;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 4) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nvoid prepare_buffers(ICudaEngine* engine, float** gpu_input_buffer, float** gpu_output_buffer, float** cpu_input_buffer,\n                     float** output_buffer_host) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)gpu_input_buffer, kBatchSize * 3 * kClsInputH * kClsInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)gpu_output_buffer, kBatchSize * kOutputSize * sizeof(float)));\n\n    *cpu_input_buffer = new float[kBatchSize * 3 * kClsInputH * kClsInputW];\n    *output_buffer_host = new float[kBatchSize * kOutputSize];\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* input, float* output,\n           int batchSize) {\n    CUDA_CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * 3 * kClsInputH * kClsInputW * sizeof(float),\n                               cudaMemcpyHostToDevice, stream));\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                               stream));\n    cudaStreamSynchronize(stream);\n}\n\nvoid serialize_engine(unsigned int max_batchsize, float& gd, float& gw, std::string& wts_name,\n                      std::string& engine_name) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    // Create model to populate the network, then set the outputs and create an engine\n    IHostMemory* serialized_engine = nullptr;\n    //engine = buildEngineYolov8Cls(max_batchsize, builder, config, DataType::kFLOAT, gd, gw, wts_name);\n    serialized_engine = buildEngineYolov8Cls(builder, config, DataType::kFLOAT, wts_name, gd, gw);\n    assert(serialized_engine);\n    // Save engine to file\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cerr << \"Could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    // Close everything down\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(kGpuId);\n\n    std::string wts_name = \"\";\n    std::string engine_name = \"\";\n    float gd = 0.0f, gw = 0.0f;\n    std::string img_dir;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, gd, gw, img_dir)) {\n        std::cerr << \"arguments not right!\" << std::endl;\n        std::cerr << \"./yolov8_cls -s [.wts] [.engine] [n/s/m/l/x or c gd gw]  // serialize model to plan file\"\n                  << std::endl;\n        std::cerr << \"./yolov8_cls -d [.engine] ../samples  // deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(kBatchSize, gd, gw, wts_name, engine_name);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* cpu_input_buffer = nullptr;\n    float* output_buffer_host = nullptr;\n    prepare_buffers(engine, &device_buffers[0], &device_buffers[1], &cpu_input_buffer, &output_buffer_host);\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    // Read imagenet labels\n    auto classes = read_classes(\"imagenet_classes.txt\");\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n\n        // Preprocess\n        batch_preprocess(img_batch, cpu_input_buffer);\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n        infer(*context, stream, (void**)device_buffers, cpu_input_buffer, output_buffer_host, kBatchSize);\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n\n        // Postprocess and get top-k result\n        for (size_t b = 0; b < img_name_batch.size(); b++) {\n            float* p = &output_buffer_host[b * kOutputSize];\n            auto res = softmax(p, kOutputSize);\n            auto topk_idx = topk(res, 3);\n            std::cout << img_name_batch[b] << std::endl;\n            for (auto idx : topk_idx) {\n                std::cout << \"  \" << classes[idx] << \" \" << res[idx] << std::endl;\n            }\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    delete[] cpu_input_buffer;\n    delete[] output_buffer_host;\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n    return 0;\n}\n"
  },
  {
    "path": "yolov8/yolov8_cls_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport os\nimport shutil\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport torch\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\nwith open(\"imagenet_classes.txt\") as f:\n    classes = [line.strip() for line in f.readlines()]\n\n\nclass YoLov8TRT(object):\n    \"\"\"\n    description: A YOLOv5 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n        self.mean = (0.485, 0.456, 0.406)\n        self.std = (0.229, 0.224, 0.225)\n\n        for binding in engine:\n            print('binding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(\n                binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_input_image = np.empty(\n            shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            batch_image_raw.append(image_raw)\n            input_image = self.preprocess_cls_image(image_raw)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size,\n                              bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            classes_ls, predicted_conf_ls, category_id_ls = self.postprocess_cls(\n                output)\n            cv2.putText(batch_image_raw[i], str(\n                classes_ls), (10, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 1, cv2.LINE_AA)\n            print(classes_ls, predicted_conf_ls)\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_cls_image(self, raw_bgr_image, dst_width=224, dst_height=224):\n\n        \"\"\"\n            description: Convert BGR image to RGB,\n                         crop the center square frame,\n                         resize it to target size, normalize to [0,1],\n                         transform to NCHW format.\n            param:\n                raw_bgr_image: numpy array, raw BGR image\n                dst_width: int, target image width\n                dst_height: int, target image height\n            return:\n                image:  the processed image\n                image_raw: the original image\n                h: original height\n                w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        # Crop the center square frame\n        m = min(h, w)\n        top = (h - m) // 2\n        left = (w - m) // 2\n        image = raw_bgr_image[top:top + m, left:left + m]\n\n        # Resize the image with target size while maintaining ratio\n        image = cv2.resize(image, (dst_width, dst_height), interpolation=cv2.INTER_LINEAR)\n\n        # Convert BGR to RGB\n        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)\n\n        # Normalize to [0,1]\n        image = image.astype(np.float32) / 255.0\n\n        # HWC to CHW format\n        image = image.transpose(2, 0, 1)\n\n        # CHW to NCHW format (add batch dimension)\n        image = np.expand_dims(image, axis=0)\n\n        # Convert the image to row-major order, also known as \"C order\"\n        image = np.ascontiguousarray(image)\n\n        batch_data = np.expand_dims(image, axis=0)\n\n        return batch_data\n\n    def postprocess_cls(self, output_data):\n        classes_ls = []\n        predicted_conf_ls = []\n        category_id_ls = []\n        output_data = output_data.reshape(self.batch_size, -1)\n        output_data = torch.Tensor(output_data)\n        p = torch.nn.functional.softmax(output_data, dim=1)\n        score, index = torch.topk(p, 3)\n        for ind in range(index.shape[0]):\n            input_category_id = index[ind][0].item()  # 716\n            category_id_ls.append(input_category_id)\n            predicted_confidence = score[ind][0].item()\n            predicted_conf_ls.append(predicted_confidence)\n            classes_ls.append(classes[input_category_id])\n        return classes_ls, predicted_conf_ls, category_id_ls\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov8_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov8_wrapper = yolov8_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(\n            self.yolov8_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(\n            self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov8_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov8_wrapper = yolov8_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(\n            self.yolov8_wrapper.get_raw_image_zeros())\n        print(\n            'warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    engine_file_path = \"./yolov8x-cls-fp32.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov8TRT instance\n    yolov8_wrapper = YoLov8TRT(engine_file_path)\n    try:\n        print('batch size is', yolov8_wrapper.batch_size)\n\n        image_dir = \"samples/\"\n        image_path_batches = get_img_path_batches(\n            yolov8_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov8_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov8_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov8_wrapper.destroy()\n"
  },
  {
    "path": "yolov8/yolov8_det.cpp",
    "content": "\n#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"utils.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\n\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, int& is_p, std::string& sub_type, float& gd,\n                      float& gw, int& max_channels) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine = nullptr;\n\n    if (is_p == 6) {\n        serialized_engine = buildEngineYolov8DetP6(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\n    } else if (is_p == 2) {\n        serialized_engine = buildEngineYolov8DetP2(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\n    } else {\n        serialized_engine = buildEngineYolov8Det(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\n    }\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_buffer_host, float** decode_ptr_host, float** decode_ptr_device,\n                    std::string cuda_post_process) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n    if (cuda_post_process == \"c\") {\n        *output_buffer_host = new float[kBatchSize * kOutputSize];\n    } else if (cuda_post_process == \"g\") {\n        if (kBatchSize > 1) {\n            std::cerr << \"Do not yet support GPU post processing for multiple batches\" << std::endl;\n            exit(0);\n        }\n        // Allocate memory for decode_ptr_host and copy to device\n        *decode_ptr_host = new float[1 + kMaxNumOutputBbox * bbox_element];\n        CUDA_CHECK(cudaMalloc((void**)decode_ptr_device, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element)));\n    }\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchsize,\n           float* decode_ptr_host, float* decode_ptr_device, int model_bboxes, std::string cuda_post_process) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueue(batchsize, buffers, stream, nullptr);\n    if (cuda_post_process == \"c\") {\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n    } else if (cuda_post_process == \"g\") {\n        CUDA_CHECK(\n                cudaMemsetAsync(decode_ptr_device, 0, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), stream));\n        cuda_decode((float*)buffers[1], model_bboxes, kConfThresh, decode_ptr_device, kMaxNumOutputBbox, stream);\n        cuda_nms(decode_ptr_device, kNmsThresh, kMaxNumOutputBbox, stream);  //cuda nms\n        CUDA_CHECK(cudaMemcpyAsync(decode_ptr_host, decode_ptr_device,\n                                   sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference and gpu postprocess time: \"\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, int& is_p, std::string& img_dir,\n                std::string& sub_type, std::string& cuda_post_process, float& gd, float& gw, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && (argc == 5 || argc == 7)) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        auto sub_type = std::string(argv[4]);\n\n        if (sub_type[0] == 'n') {\n            gd = 0.33;\n            gw = 0.25;\n            max_channels = 1024;\n        } else if (sub_type[0] == 's') {\n            gd = 0.33;\n            gw = 0.50;\n            max_channels = 1024;\n        } else if (sub_type[0] == 'm') {\n            gd = 0.67;\n            gw = 0.75;\n            max_channels = 576;\n        } else if (sub_type[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n        } else if (sub_type[0] == 'x') {\n            gd = 1.0;\n            gw = 1.25;\n            max_channels = 640;\n        } else {\n            return false;\n        }\n        if (sub_type.size() == 2 && sub_type[1] == '6') {\n            is_p = 6;\n        } else if (sub_type.size() == 2 && sub_type[1] == '2') {\n            is_p = 2;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 5) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n        cuda_post_process = std::string(argv[4]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(kGpuId);\n    std::string wts_name = \"\";\n    std::string engine_name = \"\";\n    std::string img_dir;\n    std::string sub_type = \"\";\n    std::string cuda_post_process = \"\";\n    int model_bboxes;\n    int is_p = 0;\n    float gd = 0.0f, gw = 0.0f;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, is_p, img_dir, sub_type, cuda_post_process, gd, gw,\n                    max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolov8 -s [.wts] [.engine] [n/s/m/l/x/n2/s2/m2/l2/x2/n6/s6/m6/l6/x6]  // serialize model to \"\n                     \"plan file\"\n                  << std::endl;\n        std::cerr << \"./yolov8 -d [.engine] ../samples  [c/g]// deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, is_p, sub_type, gd, gw, max_channels);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* output_buffer_host = nullptr;\n    float* decode_ptr_host = nullptr;\n    float* decode_ptr_device = nullptr;\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host, &decode_ptr_host,\n                   &decode_ptr_device, cuda_post_process);\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n        // Run inference\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize, decode_ptr_host,\n              decode_ptr_device, model_bboxes, cuda_post_process);\n        std::vector<std::vector<Detection>> res_batch;\n        if (cuda_post_process == \"c\") {\n            // NMS\n            batch_nms(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n        } else if (cuda_post_process == \"g\") {\n            //Process gpu decode and nms results\n            batch_process(res_batch, decode_ptr_host, img_batch.size(), bbox_element, img_batch);\n        }\n        // Draw bounding boxes\n        draw_bbox(img_batch, res_batch);\n        // Save images\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    CUDA_CHECK(cudaFree(decode_ptr_device));\n    delete[] decode_ptr_host;\n    delete[] output_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    // Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < kOutputSize; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov8/yolov8_det_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\nPOSE_NUM = 17 * 3\nDET_NUM = 6\nSEG_NUM = 32\nOBB_NUM = 1\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLov8 project.\n    param:\n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n            line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n            )\n\n\nclass YoLov8TRT(object):\n    \"\"\"\n    description: A YOLOv8 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n        self.det_output_length = host_outputs[0].shape[0]\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            result_boxes, result_scores, result_classid = self.post_process(\n                output[i * self.det_output_length: (i + 1) * self.det_output_length], batch_origin_h[i],\n                batch_origin_w[i]\n            )\n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0]\n            y[:, 2] = x[:, 2]\n            y[:, 1] = x[:, 1] - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 3] - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 2] - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1]\n            y[:, 3] = x[:, 3]\n            y /= r_h\n\n        return y\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...]\n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        num_values_per_detection = DET_NUM + SEG_NUM + POSE_NUM + OBB_NUM\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        # pred = np.reshape(output[1:], (-1, 38))[:num, :]\n        pred = np.reshape(output[1:], (-1, num_values_per_detection))[:num, :]\n        # Do nms\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        return result_boxes, result_scores, result_classid\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = (np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None)\n                      * np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None))\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\n        # clip the coordinates\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1)\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1)\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1)\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1)\n        # Object confidence\n        confs = boxes[:, 4]\n        # Sort by the confs\n        boxes = boxes[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_boxes = []\n        while boxes.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\n            label_match = boxes[0, -1] == boxes[:, -1]\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\n            invalid = large_overlap & label_match\n            keep_boxes += [boxes[0]]\n            boxes = boxes[~invalid]\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\n        return boxes\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov8_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov8_wrapper = yolov8_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(self.yolov8_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov8_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov8_wrapper = yolov8_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(self.yolov8_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"./build/libmyplugins.so\"\n    engine_file_path = \"yolov8n.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load coco labels\n\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\",\n                  \"traffic light\",\n                  \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n                  \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\n                  \"frisbee\",\n                  \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\",\n                  \"surfboard\",\n                  \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n                  \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n                  \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\n                  \"cell phone\",\n                  \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\n                  \"teddy bear\",\n                  \"hair drier\", \"toothbrush\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov8TRT instance\n    yolov8_wrapper = YoLov8TRT(engine_file_path)\n    try:\n        print('batch size is', yolov8_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolov8_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov8_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov8_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov8_wrapper.destroy()\n"
  },
  {
    "path": "yolov8/yolov8_obb.cpp",
    "content": "\n#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"utils.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\n\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, int& is_p, std::string& sub_type, float& gd,\n                      float& gw, int& max_channels) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine = nullptr;\n\n    if (is_p == 6) {\n        std::cout << \"p6 is not supported right now\" << std::endl;\n    } else if (is_p == 2) {\n        std::cout << \"p2 is not supported right now\" << std::endl;\n    } else {\n        serialized_engine = buildEngineYolov8Obb(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\n    }\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_buffer_host, float** decode_ptr_host, float** decode_ptr_device,\n                    std::string cuda_post_process) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n    if (cuda_post_process == \"c\") {\n        *output_buffer_host = new float[kBatchSize * kOutputSize];\n    } else if (cuda_post_process == \"g\") {\n        if (kBatchSize > 1) {\n            std::cerr << \"Do not yet support GPU post processing for multiple batches\" << std::endl;\n            exit(0);\n        }\n        // Allocate memory for decode_ptr_host and copy to device\n        *decode_ptr_host = new float[1 + kMaxNumOutputBbox * bbox_element];\n        CUDA_CHECK(cudaMalloc((void**)decode_ptr_device, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element)));\n    }\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchsize,\n           float* decode_ptr_host, float* decode_ptr_device, int model_bboxes, std::string cuda_post_process) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueue(batchsize, buffers, stream, nullptr);\n    if (cuda_post_process == \"c\") {\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n    } else if (cuda_post_process == \"g\") {\n        CUDA_CHECK(\n                cudaMemsetAsync(decode_ptr_device, 0, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), stream));\n        cuda_decode_obb((float*)buffers[1], model_bboxes, kConfThresh, decode_ptr_device, kMaxNumOutputBbox, stream);\n        cuda_nms_obb(decode_ptr_device, kNmsThresh, kMaxNumOutputBbox, stream);  //cuda nms\n        CUDA_CHECK(cudaMemcpyAsync(decode_ptr_host, decode_ptr_device,\n                                   sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference and gpu postprocess time: \"\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, int& is_p, std::string& img_dir,\n                std::string& sub_type, std::string& cuda_post_process, float& gd, float& gw, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && (argc == 5 || argc == 7)) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        auto sub_type = std::string(argv[4]);\n\n        if (sub_type[0] == 'n') {\n            gd = 0.33;\n            gw = 0.25;\n            max_channels = 1024;\n        } else if (sub_type[0] == 's') {\n            gd = 0.33;\n            gw = 0.50;\n            max_channels = 1024;\n        } else if (sub_type[0] == 'm') {\n            gd = 0.67;\n            gw = 0.75;\n            max_channels = 576;\n        } else if (sub_type[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n        } else if (sub_type[0] == 'x') {\n            gd = 1.0;\n            gw = 1.25;\n            max_channels = 640;\n        } else {\n            return false;\n        }\n        if (sub_type.size() == 2 && sub_type[1] == '6') {\n            is_p = 6;\n        } else if (sub_type.size() == 2 && sub_type[1] == '2') {\n            is_p = 2;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 5) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n        cuda_post_process = std::string(argv[4]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(kGpuId);\n    std::string wts_name = \"\";\n    std::string engine_name = \"\";\n    std::string img_dir;\n    std::string sub_type = \"\";\n    std::string cuda_post_process = \"\";\n    int model_bboxes;\n    int is_p = 0;\n    float gd = 0.0f, gw = 0.0f;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, is_p, img_dir, sub_type, cuda_post_process, gd, gw,\n                    max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolov8 -s [.wts] [.engine] [n/s/m/l/x/n2/s2/m2/l2/x2/n6/s6/m6/l6/x6]  // serialize model to \"\n                     \"plan file\"\n                  << std::endl;\n        std::cerr << \"./yolov8 -d [.engine] ../samples  [c/g]// deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, is_p, sub_type, gd, gw, max_channels);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* output_buffer_host = nullptr;\n    float* decode_ptr_host = nullptr;\n    float* decode_ptr_device = nullptr;\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host, &decode_ptr_host,\n                   &decode_ptr_device, cuda_post_process);\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n        // Run inference\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize, decode_ptr_host,\n              decode_ptr_device, model_bboxes, cuda_post_process);\n        std::vector<std::vector<Detection>> res_batch;\n        if (cuda_post_process == \"c\") {\n            // NMS\n            batch_nms_obb(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n        } else if (cuda_post_process == \"g\") {\n            //Process gpu decode and nms results\n            batch_process_obb(res_batch, decode_ptr_host, img_batch.size(), bbox_element, img_batch);\n        }\n        // Draw bounding boxes\n        draw_bbox_obb(img_batch, res_batch);\n        // Save images\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    CUDA_CHECK(cudaFree(decode_ptr_device));\n    delete[] decode_ptr_host;\n    delete[] output_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    // Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < kOutputSize; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov8/yolov8_obb_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport sys\nimport threading\nimport time\nimport cv2\nimport math\nimport numpy as np\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\nPOSE_NUM = 17 * 3\nDET_NUM = 6\nSEG_NUM = 32\nOBB_NUM = 1\n\nINPUT_W = 640\nINPUT_H = 640\n\n\nclass Detection:\n    def __init__(self, bbox, score, class_id, angle):\n        self.bbox = bbox\n        self.score = score\n        self.class_id = class_id\n        self.angle = angle\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\ndef get_corner(img, box: Detection):\n    \"\"\"\n    description: Get the four corner points of the rotated bounding box\n    param:\n        img:    an opencv image object (numpy array)\n        box:    a Detection object containing bbox [cx,cy,w,h] and angle (radians)\n    return:\n        corners: four corner points of the rotated bounding box as numpy array [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]\n    \"\"\"\n    # Extract box parameters\n    cx, cy, w, h = box.bbox\n    angle = box.angle * 180.0 / math.pi  # Convert radians to degrees\n\n    # Swap width and height if height >= width\n    if h >= w:\n        w, h = h, w\n        angle = (angle + 90.0) % 180.0  # Adjust angle\n\n    # Ensure angle is between 0 and 180 degrees\n    if angle < 0:\n        angle += 360.0\n    if angle > 180.0:\n        angle -= 180.0\n\n    # Convert to normalized angle (0-180)\n    normal_angle = angle % 180.0\n    if normal_angle < 0:\n        normal_angle += 180.0\n\n    # Convert back to radians for calculation\n    angle_rad = angle * math.pi / 180.0\n    cos_val = math.cos(angle_rad)\n    sin_val = math.sin(angle_rad)\n\n    # Calculate boundaries\n    l_x = cx - w / 2\n    r_x = cx + w / 2\n    t_y = cy - h / 2\n    b_y = cy + h / 2\n\n    # Scale coordinates using get_rect_obb (matching C++ version)\n    bbox = [l_x, t_y, r_x, b_y]\n    rect = get_rect_obb(img, bbox)\n\n    # Calculate center and dimensions of scaled box\n    x_ = (rect[0] + rect[0] + rect[2]) / 2  # rect.x + rect.width/2\n    y_ = (rect[1] + rect[1] + rect[3]) / 2  # rect.y + rect.height/2\n    width = rect[2]\n    height = rect[3]\n\n    # Calculate vectors\n    vec1x = width / 2 * cos_val\n    vec1y = width / 2 * sin_val\n    vec2x = -height / 2 * sin_val\n    vec2y = height / 2 * cos_val\n\n    # Calculate four corners\n    corners = np.array([\n        [int(round(x_ + vec1x + vec2x)), int(round(y_ + vec1y + vec2y))],  # Top-left\n        [int(round(x_ + vec1x - vec2x)), int(round(y_ + vec1y - vec2y))],  # Top-right\n        [int(round(x_ - vec1x - vec2x)), int(round(y_ - vec1y - vec2y))],  # Bottom-right\n        [int(round(x_ - vec1x + vec2x)), int(round(y_ - vec1y + vec2y))]  # Bottom-left\n    ], dtype=np.int32)\n\n    # Clip to image boundaries\n    h, w = img.shape[:2]\n    corners[:, 0] = np.clip(corners[:, 0], 0, w - 1)\n    corners[:, 1] = np.clip(corners[:, 1], 0, h - 1)\n\n    return corners\n\n\ndef get_rect_obb(img, bbox):\n    \"\"\"\n    Scale coordinates according to image resize ratio (matching C++ version)\n    param:\n        img: OpenCV image (numpy array)\n        bbox: [left, top, right, bottom]\n    return:\n        [x, y, width, height]\n    \"\"\"\n    l_x, t_y, r_x, b_y = bbox\n    r_w = INPUT_W / img.shape[1]  # INPUT_W should be your model input width\n    r_h = INPUT_H / img.shape[0]  # INPUT_H should be your model input height\n\n    if r_h > r_w:\n        l_x = l_x\n        r_x = r_x\n        t_y = t_y - (INPUT_H - r_w * img.shape[0]) / 2\n        b_y = b_y - (INPUT_H - r_w * img.shape[0]) / 2\n        l_x = l_x / r_w\n        r_x = r_x / r_w\n        t_y = t_y / r_w\n        b_y = b_y / r_w\n    else:\n        l_x = l_x - (INPUT_W - r_h * img.shape[1]) / 2\n        r_x = r_x - (INPUT_W - r_h * img.shape[1]) / 2\n        t_y = t_y\n        b_y = b_y\n        l_x = l_x / r_h\n        r_x = r_x / r_h\n        t_y = t_y / r_h\n        b_y = b_y / r_h\n\n    l_x = max(0.0, l_x)\n    t_y = max(0.0, t_y)\n    width = max(0, min(int(round(r_x - l_x)), img.shape[1] - int(round(l_x))))\n    height = max(0, min(int(round(b_y - t_y)), img.shape[0] - int(round(t_y))))\n\n    return [int(round(l_x)), int(round(t_y)), width, height]\n\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one rotated bounding box on image img\n    param:\n        x:      a box in [cx, cy, w, h, angle] format\n        img:    an opencv image object\n        color:  color to draw rectangle\n        label:  str\n        line_thickness: int\n    \"\"\"\n    tl = line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n\n    # Get four corner points\n    corners = get_corner(img, x)\n    corners = corners.astype(int)\n\n    # Draw the rotated rectangle\n    cv2.polylines(img, [corners], isClosed=True, color=color, thickness=tl, lineType=cv2.LINE_AA)\n\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        # Use first corner point for label placement\n        p1 = tuple(corners[0])\n        w, h = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n\n        outside = p1[1] - h >= 3\n        p2 = (p1[0] + w, p1[1] - h - 3 if outside else p1[1] + h + 3)\n\n        cv2.rectangle(img, p1, p2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (p1[0], p1[1] - 2 if outside else p1[1] + h + 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA\n        )\n\n\nclass YoLov8TRT(object):\n    \"\"\"\n    description: A YOLOv8 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n        self.det_output_length = host_outputs[0].shape[0]\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            keep = self.post_process(\n                output[i * self.det_output_length: (i + 1) * self.det_output_length], batch_origin_h[i],\n                batch_origin_w[i]\n            )\n            # Draw rectangles and labels on the original image\n            for j in range(len(keep)):\n                box = keep[j]  # type: Detection\n                np.random.seed(int(keep[j].class_id))\n                color = [np.random.randint(0, 255) for _ in range(3)]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(keep[j].class_id)], keep[j].score\n                    ),\n                    color=color,\n                    line_thickness=1\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0]\n            y[:, 2] = x[:, 2]\n            y[:, 1] = x[:, 1] - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 3] - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 2] - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1]\n            y[:, 3] = x[:, 3]\n            y /= r_h\n\n        return y\n\n    def covariance_matrix(self, res: Detection):\n        \"\"\"\n        description: Generating covariance matrix from obbs.\n        param:\n            box (np.ndarray): A numpy array representing rotated bounding box, with xywhr format.\n\n        return:\n            tuple: (a, b, c) values of covariance matrix\n        \"\"\"\n        w = res.bbox[2]\n        h = res.bbox[3]\n        angle = res.angle\n\n        a = w * w / 12.0\n        b = h * h / 12.0\n        c = angle\n\n        cos_r = math.cos(c)\n        sin_r = math.sin(c)\n\n        cos_r2 = cos_r * cos_r\n        sin_r2 = sin_r * sin_r\n\n        a_val = a * cos_r2 + b * sin_r2\n        b_val = a * sin_r2 + b * cos_r2\n        c_val = (a - b) * cos_r * sin_r\n\n        return a_val, b_val, c_val\n\n    def probiou(self, box1: Detection, box2: Detection, eps=1e-7):\n        \"\"\"\n        description: Calculate the prob IoU between oriented bounding boxes.\n        param:\n            box1 (np.ndarray): First box in xywhr format\n            box2 (np.ndarray): Second box in xywhr format\n            eps (float): Small value to avoid division by zero\n        return:\n            float: 1 - hd where hd is the Bhattacharyya distance\n        \"\"\"\n        a1, b1, c1 = self.covariance_matrix(box1)\n        a2, b2, c2 = self.covariance_matrix(box2)\n\n        x1, y1 = box1.bbox[0], box1.bbox[1]\n        x2, y2 = box2.bbox[0], box2.bbox[1]\n\n        t1 = ((a1 + a2) * (y1 - y2) ** 2 + (b1 + b2) * (x1 - x2) ** 2) / \\\n             ((a1 + a2) * (b1 + b2) - (c1 + c2) ** 2 + eps)\n        t1 *= 0.25\n\n        t2 = ((c1 + c2) * (x2 - x1) * (y1 - y2)) / \\\n             ((a1 + a2) * (b1 + b2) - (c1 + c2) ** 2 + eps)\n        t2 *= 0.5\n\n        t3 = ((a1 + a2) * (b1 + b2) - (c1 + c2) ** 2) / \\\n             (4 * math.sqrt(max(a1 * b1 - c1 * c1, 0.0)) *\n              math.sqrt(max(a2 * b2 - c2 * c2, 0.0)) + eps)\n        t3 = math.log(t3 + eps) * 0.5\n\n        bd = max(min(t1 + t2 + t3, 100.0), eps)\n        hd = math.sqrt(1.0 - math.exp(-bd) + eps)\n\n        return 1 - hd\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id,angle cx,cy,w,h,conf,cls_id,angle ...]\n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2, angle]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        num_values_per_detection = DET_NUM + SEG_NUM + POSE_NUM + OBB_NUM\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        pred = np.reshape(output[1:], (-1, num_values_per_detection))[:num, :]\n\n        # Filter by confidence threshold\n        mask = pred[:, 4] >= CONF_THRESH\n        pred = pred[mask]\n\n        if len(pred) == 0:\n            return []\n\n        m_map = {}\n        for i in range(len(pred)):\n            class_id = int(pred[i][5])\n            if class_id not in m_map:\n                m_map[class_id] = []\n            m_map[class_id].append(Detection(pred[i][:4], pred[i][4], class_id, pred[i][89]))\n\n        res = []\n        for it in m_map:\n            dets = m_map[it]\n            dets = sorted(dets, key=lambda x: x.score, reverse=True)\n            for m in range(len(dets)):\n                if dets[m].score == 0.0:\n                    continue\n                item = dets[m]\n                res.append(item)\n                for n in range(m + 1, len(dets)):\n                    if dets[n].score == 0.0:\n                        continue\n                    if self.probiou(item, dets[n]) > IOU_THRESHOLD:\n                        dets[n].score = 0.0\n\n        keep = []\n        for i in range(len(res)):\n            if res[i].score > CONF_THRESH:\n                keep.append(res[i])\n\n        return keep\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov8_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov8_wrapper = yolov8_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(self.yolov8_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov8_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov8_wrapper = yolov8_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(self.yolov8_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"./build/libmyplugins.so\"\n    engine_file_path = \"yolov8n-obb.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load DOTAV 1.5 labels\n\n    categories = [\"plane\", \"ship\", \"storage tank\", \"baseball diamond\", \"tennis court\",\n                  \"basketball court\", \"ground track field\", \"harbor\",\n                  \"bridge\", \"large vehicle\", \"small vehicle\", \"helicopter\",\n                  \"roundabout\", \"soccer ball field\", \"swimming pool\", \"container crane\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov8TRT instance\n    yolov8_wrapper = YoLov8TRT(engine_file_path)\n    try:\n        print('batch size is', yolov8_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolov8_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov8_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov8_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov8_wrapper.destroy()\n"
  },
  {
    "path": "yolov8/yolov8_pose.cpp",
    "content": "\n#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"utils.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\n\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, int& is_p, std::string& sub_type, float& gd,\n                      float& gw, int& max_channels) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine = nullptr;\n\n    if (is_p == 6) {\n        serialized_engine = buildEngineYolov8PoseP6(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\n    } else if (is_p == 2) {\n        std::cout << \"p2 is not supported right now\" << std::endl;\n    } else {\n        serialized_engine = buildEngineYolov8Pose(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\n    }\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_buffer_host, float** decode_ptr_host, float** decode_ptr_device,\n                    std::string cuda_post_process) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n    if (cuda_post_process == \"c\") {\n        *output_buffer_host = new float[kBatchSize * kOutputSize];\n    } else if (cuda_post_process == \"g\") {\n        if (kBatchSize > 1) {\n            std::cerr << \"Do not yet support GPU post processing for multiple batches\" << std::endl;\n            exit(0);\n        }\n        // Allocate memory for decode_ptr_host and copy to device\n        *decode_ptr_host = new float[1 + kMaxNumOutputBbox * bbox_element];\n        CUDA_CHECK(cudaMalloc((void**)decode_ptr_device, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element)));\n    }\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchsize,\n           float* decode_ptr_host, float* decode_ptr_device, int model_bboxes, std::string cuda_post_process) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueue(batchsize, buffers, stream, nullptr);\n    if (cuda_post_process == \"c\") {\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n    } else if (cuda_post_process == \"g\") {\n        CUDA_CHECK(\n                cudaMemsetAsync(decode_ptr_device, 0, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), stream));\n        cuda_decode((float*)buffers[1], model_bboxes, kConfThresh, decode_ptr_device, kMaxNumOutputBbox, stream);\n        cuda_nms(decode_ptr_device, kNmsThresh, kMaxNumOutputBbox, stream);  //cuda nms\n        CUDA_CHECK(cudaMemcpyAsync(decode_ptr_host, decode_ptr_device,\n                                   sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference and gpu postprocess time: \"\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, int& is_p, std::string& img_dir,\n                std::string& sub_type, std::string& cuda_post_process, float& gd, float& gw, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && (argc == 5 || argc == 7)) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        auto sub_type = std::string(argv[4]);\n\n        if (sub_type[0] == 'n') {\n            gd = 0.33;\n            gw = 0.25;\n            max_channels = 1024;\n        } else if (sub_type[0] == 's') {\n            gd = 0.33;\n            gw = 0.50;\n            max_channels = 1024;\n        } else if (sub_type[0] == 'm') {\n            gd = 0.67;\n            gw = 0.75;\n            max_channels = 576;\n        } else if (sub_type[0] == 'l') {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n        } else if (sub_type[0] == 'x') {\n            gd = 1.0;\n            gw = 1.25;\n            max_channels = 640;\n        } else {\n            return false;\n        }\n        if (sub_type.size() == 2 && sub_type[1] == '6') {\n            is_p = 6;\n        } else if (sub_type.size() == 2 && sub_type[1] == '2') {\n            is_p = 2;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 5) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n        cuda_post_process = std::string(argv[4]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(kGpuId);\n    std::string wts_name = \"\";\n    std::string engine_name = \"\";\n    std::string img_dir;\n    std::string sub_type = \"\";\n    std::string cuda_post_process = \"\";\n    int model_bboxes;\n    int is_p = 0;\n    float gd = 0.0f, gw = 0.0f;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, is_p, img_dir, sub_type, cuda_post_process, gd, gw,\n                    max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolov8 -s [.wts] [.engine] [n/s/m/l/x/n2/s2/m2/l2/x2/n6/s6/m6/l6/x6]  // serialize model to \"\n                     \"plan file\"\n                  << std::endl;\n        std::cerr << \"./yolov8 -d [.engine] ../samples  [c/g]// deserialize plan file and run inference\" << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, is_p, sub_type, gd, gw, max_channels);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* output_buffer_host = nullptr;\n    float* decode_ptr_host = nullptr;\n    float* decode_ptr_device = nullptr;\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host, &decode_ptr_host,\n                   &decode_ptr_device, cuda_post_process);\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n        // Run inference\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize, decode_ptr_host,\n              decode_ptr_device, model_bboxes, cuda_post_process);\n        std::vector<std::vector<Detection>> res_batch;\n        if (cuda_post_process == \"c\") {\n            // NMS\n            batch_nms(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n        } else if (cuda_post_process == \"g\") {\n            // Process gpu decode and nms results\n            // todo pose in gpu\n            std::cerr << \"pose_postprocess is not support in gpu right now\" << std::endl;\n        }\n        // Draw bounding boxes\n        draw_bbox_keypoints_line(img_batch, res_batch);\n        // Save images\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    CUDA_CHECK(cudaFree(decode_ptr_device));\n    delete[] decode_ptr_host;\n    delete[] output_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    // Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < kOutputSize; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov8/yolov8_pose_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\nPOSE_NUM = 17 * 3\nDET_NUM = 6\nSEG_NUM = 32\nOBB_NUM = 1\nkeypoint_pairs = [\n    (0, 1), (0, 2), (0, 5), (0, 6), (1, 2),\n    (1, 3), (2, 4), (5, 6), (5, 7), (5, 11),\n    (6, 8), (6, 12), (7, 9), (8, 10), (11, 12),\n    (11, 13), (12, 14), (13, 15), (14, 16)\n]\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLov8 project.\n    param:\n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n            line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n            )\n\n\nclass YoLov8TRT(object):\n    \"\"\"\n    description: A YOLOv8 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n        self.det_output_size = host_outputs[0].shape[0]\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i],\n                      input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n\n            result_boxes, result_scores, result_classid, keypoints = self.post_process(\n                output[i * (self.det_output_size): (i + 1) * (self.det_output_size)],\n                batch_origin_h[i], batch_origin_w[i]\n            )\n\n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n\n                num_keypoints = len(keypoints[j]) // 3\n                points = []\n                for k in range(num_keypoints):\n                    x = keypoints[j][k * 3]\n                    y = keypoints[j][k * 3 + 1]\n                    confidence = keypoints[j][k * 3 + 2]\n                    if confidence > 0:\n                        points.append((int(x), int(y)))\n                    else:\n                        points.append(None)\n\n                # 根据关键点索引对绘制线条\n                for pair in keypoint_pairs:\n                    partA, partB = pair\n                    if points[partA] and points[partB]:\n                        cv2.line(batch_image_raw[i], points[partA], points[partB], (0, 255, 0), 2)\n\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy_with_keypoints(self, origin_h, origin_w, boxes, keypoints):\n\n        n = len(boxes)\n        box_array = np.zeros_like(boxes)\n        keypoint_array = np.zeros_like(keypoints)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        for i in range(n):\n            if r_h > r_w:\n                box = boxes[i]\n                lmk = keypoints[i]\n                box_array[i, 0] = box[0] / r_w\n                box_array[i, 2] = box[2] / r_w\n                box_array[i, 1] = (box[1] - (self.input_h - r_w * origin_h) / 2) / r_w\n                box_array[i, 3] = (box[3] - (self.input_h - r_w * origin_h) / 2) / r_w\n\n                for j in range(0, len(lmk), 3):\n                    keypoint_array[i, j] = lmk[j] / r_w\n                    keypoint_array[i, j + 1] = (lmk[j + 1] - (self.input_h - r_w * origin_h) / 2) / r_w\n                    keypoint_array[i, j + 2] = lmk[j + 2]\n            else:\n\n                box = boxes[i]\n                lmk = keypoints[i]\n\n                box_array[i, 0] = (box[0] - (self.input_w - r_h * origin_w) / 2) / r_h\n                box_array[i, 2] = (box[2] - (self.input_w - r_h * origin_w) / 2) / r_h\n                box_array[i, 1] = box[1] / r_h\n                box_array[i, 3] = box[3] / r_h\n\n                for j in range(0, len(lmk), 3):\n                    keypoint_array[i, j] = (lmk[j] - (self.input_w - r_h * origin_w) / 2) / r_h\n                    keypoint_array[i, j + 1] = lmk[j + 1] / r_h\n                    keypoint_array[i, j + 2] = lmk[j + 2]\n\n        return box_array, keypoint_array\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: Post-process the prediction to include pose keypoints\n        param:\n            output:     A numpy array like [num_boxes, cx, cy, w, h, conf,\n            cls_id, px1, py1, pconf1,...px17, py17, pconf17] where p denotes pose keypoint\n            origin_h:   Height of original image\n            origin_w:   Width of original image\n        return:\n            result_boxes:    Final boxes, a numpy array, each row is a box [x1, y1, x2, y2]\n            result_scores:   Final scores, a numpy array, each element is the score corresponding to box\n            result_classid:  Final classID, a numpy array, each element is the classid corresponding to box\n            result_keypoints: Final keypoints, a list of numpy arrays,\n            each element represents keypoints for a box, shaped as (#keypoints, 3)\n        \"\"\"\n        # Number of values per detection: 38 base values + 17 keypoints * 3 values each + angle\n        num_values_per_detection = DET_NUM + SEG_NUM + POSE_NUM + OBB_NUM\n        # Get the number of boxes detected\n        num = int(output[0])\n        # Reshape to a two-dimensional ndarray with the full detection shape\n        pred = np.reshape(output[1:], (-1, num_values_per_detection))[:num, :]\n\n        # Perform non-maximum suppression to filter the detections\n        boxes = self.non_max_suppression(\n            pred[:, :num_values_per_detection], origin_h, origin_w,\n            conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n\n        # Extract the bounding boxes, confidence scores, and class IDs\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        result_keypoints = boxes[:, -POSE_NUM-1:-1] if len(boxes) else np.array([])\n\n        # Return the post-processed results including keypoints\n        return result_boxes, result_scores, result_classid, result_keypoints\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = np.clip(\n            inter_rect_x2 - inter_rect_x1 + 1, 0, None) * np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None)\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        res_array = np.copy(boxes)\n        box_pred_deep_copy = np.copy(boxes[:, :4])\n        keypoints_pred_deep_copy = np.copy(boxes[:, -POSE_NUM-1:-1])\n        res_box, res_keypoints = self.xywh2xyxy_with_keypoints(\n            origin_h, origin_w, box_pred_deep_copy, keypoints_pred_deep_copy)\n        res_array[:, :4] = res_box\n        res_array[:, -POSE_NUM-1:-1] = res_keypoints\n        # clip the coordinates\n        res_array[:, 0] = np.clip(res_array[:, 0], 0, origin_w - 1)\n        res_array[:, 2] = np.clip(res_array[:, 2], 0, origin_w - 1)\n        res_array[:, 1] = np.clip(res_array[:, 1], 0, origin_h - 1)\n        res_array[:, 3] = np.clip(res_array[:, 3], 0, origin_h - 1)\n        # Object confidence\n        confs = res_array[:, 4]\n        # Sort by the confs\n        res_array = res_array[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_res_array = []\n        while res_array.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(res_array[0, :4], 0), res_array[:, :4]) > nms_thres\n            label_match = res_array[0, 5] == res_array[:, 5]\n            invalid = large_overlap & label_match\n            keep_res_array.append(res_array[0])\n            res_array = res_array[~invalid]\n\n        res_array = np.stack(keep_res_array, 0) if len(keep_res_array) else np.array([])\n        return res_array\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov8_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov8_wrapper = yolov8_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(self.yolov8_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov8_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov8_wrapper = yolov8_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(self.yolov8_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"./build/libmyplugins.so\"\n    engine_file_path = \"yolov8n-pose.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load coco labels\n\n    categories = [\"person\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov8TRT instance\n    yolov8_wrapper = YoLov8TRT(engine_file_path)\n    try:\n        print('batch size is', yolov8_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolov8_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov8_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov8_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov8_wrapper.destroy()\n"
  },
  {
    "path": "yolov8/yolov8_seg.cpp",
    "content": "\n#include <fstream>\n#include <iostream>\n#include <opencv2/opencv.hpp>\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"utils.h\"\n\nLogger gLogger;\nusing namespace nvinfer1;\nconst int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\nconst static int kOutputSegSize = 32 * (kInputH / 4) * (kInputW / 4);\n\nstatic cv::Rect get_downscale_rect(float bbox[4], float scale) {\n\n    float left = bbox[0];\n    float top = bbox[1];\n    float right = bbox[0] + bbox[2];\n    float bottom = bbox[1] + bbox[3];\n\n    left = left < 0 ? 0 : left;\n    top = top < 0 ? 0 : top;\n    right = right > kInputW ? kInputW : right;\n    bottom = bottom > kInputH ? kInputH : bottom;\n\n    left /= scale;\n    top /= scale;\n    right /= scale;\n    bottom /= scale;\n    return cv::Rect(int(left), int(top), int(right - left), int(bottom - top));\n}\n\nstd::vector<cv::Mat> process_mask(const float* proto, int proto_size, std::vector<Detection>& dets) {\n\n    std::vector<cv::Mat> masks;\n    for (size_t i = 0; i < dets.size(); i++) {\n\n        cv::Mat mask_mat = cv::Mat::zeros(kInputH / 4, kInputW / 4, CV_32FC1);\n        auto r = get_downscale_rect(dets[i].bbox, 4);\n\n        for (int x = r.x; x < r.x + r.width; x++) {\n            for (int y = r.y; y < r.y + r.height; y++) {\n                float e = 0.0f;\n                for (int j = 0; j < 32; j++) {\n                    e += dets[i].mask[j] * proto[j * proto_size / 32 + y * mask_mat.cols + x];\n                }\n                e = 1.0f / (1.0f + expf(-e));\n                mask_mat.at<float>(y, x) = e;\n            }\n        }\n        cv::resize(mask_mat, mask_mat, cv::Size(kInputW, kInputH));\n        masks.push_back(mask_mat);\n    }\n    return masks;\n}\n\nvoid serialize_engine(std::string& wts_name, std::string& engine_name, std::string& sub_type, float& gd, float& gw,\n                      int& max_channels) {\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n    IHostMemory* serialized_engine = nullptr;\n\n    serialized_engine = buildEngineYolov8Seg(builder, config, DataType::kFLOAT, wts_name, gd, gw, max_channels);\n\n    assert(serialized_engine);\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cout << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    delete serialized_engine;\n    delete config;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_seg_buffer_device, float** output_buffer_host, float** output_seg_buffer_host,\n                    float** decode_ptr_host, float** decode_ptr_device, std::string cuda_post_process) {\n    assert(engine->getNbBindings() == 3);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    const int outputIndex_seg = engine->getBindingIndex(\"proto\");\n\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    assert(outputIndex_seg == 2);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_seg_buffer_device, kBatchSize * kOutputSegSize * sizeof(float)));\n\n    if (cuda_post_process == \"c\") {\n        *output_buffer_host = new float[kBatchSize * kOutputSize];\n        *output_seg_buffer_host = new float[kBatchSize * kOutputSegSize];\n    } else if (cuda_post_process == \"g\") {\n        if (kBatchSize > 1) {\n            std::cerr << \"Do not yet support GPU post processing for multiple batches\" << std::endl;\n            exit(0);\n        }\n        // Allocate memory for decode_ptr_host and copy to device\n        *decode_ptr_host = new float[1 + kMaxNumOutputBbox * bbox_element];\n        CUDA_CHECK(cudaMalloc((void**)decode_ptr_device, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element)));\n    }\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, float* output_seg,\n           int batchsize, float* decode_ptr_host, float* decode_ptr_device, int model_bboxes,\n           std::string cuda_post_process) {\n    // infer on the batch asynchronously, and DMA output back to host\n    auto start = std::chrono::system_clock::now();\n    context.enqueue(batchsize, buffers, stream, nullptr);\n    if (cuda_post_process == \"c\") {\n\n        std::cout << \"kOutputSize:\" << kOutputSize << std::endl;\n        CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchsize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                                   stream));\n        std::cout << \"kOutputSegSize:\" << kOutputSegSize << std::endl;\n        CUDA_CHECK(cudaMemcpyAsync(output_seg, buffers[2], batchsize * kOutputSegSize * sizeof(float),\n                                   cudaMemcpyDeviceToHost, stream));\n\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \" << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()\n                  << \"ms\" << std::endl;\n    } else if (cuda_post_process == \"g\") {\n        CUDA_CHECK(\n                cudaMemsetAsync(decode_ptr_device, 0, sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), stream));\n        cuda_decode((float*)buffers[1], model_bboxes, kConfThresh, decode_ptr_device, kMaxNumOutputBbox, stream);\n        cuda_nms(decode_ptr_device, kNmsThresh, kMaxNumOutputBbox, stream);  //cuda nms\n        CUDA_CHECK(cudaMemcpyAsync(decode_ptr_host, decode_ptr_device,\n                                   sizeof(float) * (1 + kMaxNumOutputBbox * bbox_element), cudaMemcpyDeviceToHost,\n                                   stream));\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference and gpu postprocess time: \"\n                  << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << \"ms\" << std::endl;\n    }\n\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir,\n                std::string& sub_type, std::string& cuda_post_process, std::string& labels_filename, float& gd,\n                float& gw, int& max_channels) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && argc == 5) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        sub_type = std::string(argv[4]);\n        if (sub_type == \"n\") {\n            gd = 0.33;\n            gw = 0.25;\n            max_channels = 1024;\n        } else if (sub_type == \"s\") {\n            gd = 0.33;\n            gw = 0.50;\n            max_channels = 1024;\n        } else if (sub_type == \"m\") {\n            gd = 0.67;\n            gw = 0.75;\n            max_channels = 576;\n        } else if (sub_type == \"l\") {\n            gd = 1.0;\n            gw = 1.0;\n            max_channels = 512;\n        } else if (sub_type == \"x\") {\n            gd = 1.0;\n            gw = 1.25;\n            max_channels = 640;\n        } else {\n            return false;\n        }\n    } else if (std::string(argv[1]) == \"-d\" && argc == 6) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n        cuda_post_process = std::string(argv[4]);\n        labels_filename = std::string(argv[5]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(kGpuId);\n    std::string wts_name = \"\";\n    std::string engine_name = \"\";\n    std::string img_dir;\n    std::string sub_type = \"\";\n    std::string cuda_post_process = \"\";\n    std::string labels_filename = \"../coco.txt\";\n    int model_bboxes;\n    float gd = 0.0f, gw = 0.0f;\n    int max_channels = 0;\n\n    if (!parse_args(argc, argv, wts_name, engine_name, img_dir, sub_type, cuda_post_process, labels_filename, gd, gw,\n                    max_channels)) {\n        std::cerr << \"Arguments not right!\" << std::endl;\n        std::cerr << \"./yolov8 -s [.wts] [.engine] [n/s/m/l/x]  // serialize model to plan file\" << std::endl;\n        std::cerr << \"./yolov8 -d [.engine] ../samples  [c/g] coco_file// deserialize plan file and run inference\"\n                  << std::endl;\n        return -1;\n    }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(wts_name, engine_name, sub_type, gd, gw, max_channels);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n    cuda_preprocess_init(kMaxInputImageSize);\n    auto out_dims = engine->getBindingDimensions(1);\n    model_bboxes = out_dims.d[0];\n    // Prepare cpu and gpu buffers\n    float* device_buffers[3];\n    float* output_buffer_host = nullptr;\n    float* output_seg_buffer_host = nullptr;\n    float* decode_ptr_host = nullptr;\n    float* decode_ptr_device = nullptr;\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    std::unordered_map<int, std::string> labels_map;\n    read_labels(labels_filename, labels_map);\n    assert(kNumClass == labels_map.size());\n\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &device_buffers[2], &output_buffer_host,\n                   &output_seg_buffer_host, &decode_ptr_host, &decode_ptr_device, cuda_post_process);\n\n    // // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n        // Run inference\n        infer(*context, stream, (void**)device_buffers, output_buffer_host, output_seg_buffer_host, kBatchSize,\n              decode_ptr_host, decode_ptr_device, model_bboxes, cuda_post_process);\n        std::vector<std::vector<Detection>> res_batch;\n        if (cuda_post_process == \"c\") {\n            // NMS\n            batch_nms(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n            for (size_t b = 0; b < img_batch.size(); b++) {\n                auto& res = res_batch[b];\n                cv::Mat img = img_batch[b];\n                auto masks = process_mask(&output_seg_buffer_host[b * kOutputSegSize], kOutputSegSize, res);\n                draw_mask_bbox(img, res, masks, labels_map);\n                cv::imwrite(\"_\" + img_name_batch[b], img);\n            }\n        } else if (cuda_post_process == \"g\") {\n            // Process gpu decode and nms results\n            // batch_process(res_batch, decode_ptr_host, img_batch.size(), bbox_element, img_batch);\n            // todo seg in gpu\n            std::cerr << \"seg_postprocess is not support in gpu right now\" << std::endl;\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    CUDA_CHECK(cudaFree(device_buffers[2]));\n    CUDA_CHECK(cudaFree(decode_ptr_device));\n    delete[] decode_ptr_host;\n    delete[] output_buffer_host;\n    delete[] output_seg_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    // Print histogram of the output distribution\n    // std::cout << \"\\nOutput:\\n\\n\";\n    // for (unsigned int i = 0; i < kOutputSize; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << std::endl;\n    //}\n    // std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov8/yolov8_seg_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\nPOSE_NUM = 17 * 3\nDET_NUM = 6\nSEG_NUM = 32\nOBB_NUM = 1\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from YoLov8 project.\n    param:\n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n            line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n            )\n\n\nclass YoLov8TRT(object):\n    \"\"\"\n    description: A YOLOv8 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n\n        # Data length\n        self.det_output_length = host_outputs[0].shape[0]\n        self.seg_output_length = host_outputs[1].shape[0]\n        self.seg_w = int(self.input_w / 4)\n        self.seg_h = int(self.input_h / 4)\n        self.seg_c = int(self.seg_output_length / (self.seg_w * self.seg_w))\n        self.det_row_output_length = self.seg_c + DET_NUM + POSE_NUM + OBB_NUM\n\n        # Draw mask\n        self.colors_obj = Colors()\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        cuda.memcpy_dtoh_async(host_outputs[1], cuda_outputs[1], stream)\n\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        output_proto_mask = host_outputs[1]\n        # Do postprocess\n        for i in range(self.batch_size):\n            result_boxes, result_scores, result_classid, result_proto_coef = self.post_process(\n                output[i * self.det_output_length: (i + 1) * self.det_output_length], batch_origin_h[i],\n                batch_origin_w[i]\n            )\n\n            if result_proto_coef.shape[0] == 0:\n                continue\n            result_masks = self.process_mask(output_proto_mask, result_proto_coef, result_boxes, batch_origin_h[i],\n                                             batch_origin_w[i])\n\n            self.draw_mask(result_masks, colors_=[self.colors_obj(x, True) for x in result_classid],\n                           im_src=batch_image_raw[i])\n\n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0]\n            y[:, 2] = x[:, 2]\n            y[:, 1] = x[:, 1] - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 3] - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 2] - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1]\n            y[:, 3] = x[:, 3]\n            y /= r_h\n\n        return y\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...]\n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        pred = np.reshape(output[1:], (-1, self.det_row_output_length))[:num, :]\n\n        # Do nms\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        result_proto_coef = boxes[:, DET_NUM:int(DET_NUM + SEG_NUM)] if len(boxes) else np.array([])\n        return result_boxes, result_scores, result_classid, result_proto_coef\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = (np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None)\n                      * np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None))\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\n        # clip the coordinates\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1)\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1)\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1)\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1)\n        # Object confidence\n        confs = boxes[:, 4]\n        # Sort by the confs\n        boxes = boxes[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_boxes = []\n        while boxes.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\n            label_match = boxes[0, 5] == boxes[:, 5]\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\n            invalid = large_overlap & label_match\n            keep_boxes += [boxes[0]]\n            boxes = boxes[~invalid]\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\n        return boxes\n\n    def sigmoid(self, x):\n        return 1 / (1 + np.exp(-x))\n\n    def scale_mask(self, mask, ih, iw):\n        mask = cv2.resize(mask, (self.input_w, self.input_h))\n        r_w = self.input_w / (iw * 1.0)\n        r_h = self.input_h / (ih * 1.0)\n        if r_h > r_w:\n            w = self.input_w\n            h = int(r_w * ih)\n            x = 0\n            y = int((self.input_h - h) / 2)\n        else:\n            w = int(r_h * iw)\n            h = self.input_h\n            x = int((self.input_w - w) / 2)\n            y = 0\n        crop = mask[y:y + h, x:x + w]\n        crop = cv2.resize(crop, (iw, ih))\n        return crop\n\n    def process_mask(self, output_proto_mask, result_proto_coef, result_boxes, ih, iw):\n        \"\"\"\n        description: Mask pred by yolov8 instance segmentation ,\n        param:\n            output_proto_mask: prototype mask e.g. (32, 160, 160) for 640x640 input\n            result_proto_coef: prototype mask coefficients (n, 32), n represents n results\n            result_boxes     :\n            ih: rows of original image\n            iw: cols of original image\n        return:\n            mask_result: (n, ih, iw)\n        \"\"\"\n        result_proto_masks = output_proto_mask.reshape(self.seg_c, self.seg_h, self.seg_w)\n        c, mh, mw = result_proto_masks.shape\n        print(result_proto_masks.shape)\n        print(result_proto_coef.shape)\n        masks = self.sigmoid((result_proto_coef @ result_proto_masks.astype(np.float32).reshape(c, -1))).reshape(-1, mh,\n                                                                                                                 mw)\n\n        mask_result = []\n        for mask, box in zip(masks, result_boxes):\n            mask_s = np.zeros((ih, iw))\n            crop_mask = self.scale_mask(mask, ih, iw)\n            x1 = int(box[0])\n            y1 = int(box[1])\n            x2 = int(box[2])\n            y2 = int(box[3])\n            crop = crop_mask[y1:y2, x1:x2]\n            crop = np.where(crop >= 0.5, 1, 0)\n            crop = crop.astype(np.uint8)\n            mask_s[y1:y2, x1:x2] = crop\n\n            mask_result.append(mask_s)\n        mask_result = np.array(mask_result)\n        return mask_result\n\n    def draw_mask(self, masks, colors_, im_src, alpha=0.5):\n        \"\"\"\n        description: Draw mask on image ,\n        param:\n            masks  : result_mask\n            colors_: color to draw mask\n            im_src : original image\n            alpha  : scale between original  image and mask\n        return:\n            no return\n        \"\"\"\n        if len(masks) == 0:\n            return\n        masks = np.asarray(masks, dtype=np.uint8)\n        masks = np.ascontiguousarray(masks.transpose(1, 2, 0))\n        masks = np.asarray(masks, dtype=np.float32)\n        colors_ = np.asarray(colors_, dtype=np.float32)\n        s = masks.sum(2, keepdims=True).clip(0, 1)\n        masks = (masks @ colors_).clip(0, 255)\n        im_src[:] = masks * alpha + im_src * (1 - s * alpha)\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov8_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov8_wrapper = yolov8_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(self.yolov8_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov8_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov8_wrapper = yolov8_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov8_wrapper.infer(self.yolov8_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nclass Colors:\n    def __init__(self):\n        hexs = ('FF3838', 'FF9D97', 'FF701F', 'FFB21D', 'CFD231', '48F90A',\n                '92CC17', '3DDB86', '1A9334', '00D4BB', '2C99A8', '00C2FF',\n                '344593', '6473FF', '0018EC', '8438FF', '520085', 'CB38FF',\n                'FF95C8', 'FF37C7')\n        self.palette = [self.hex2rgb(f'#{c}') for c in hexs]\n        self.n = len(self.palette)\n\n    def __call__(self, i, bgr=False):\n        c = self.palette[int(i) % self.n]\n        return (c[2], c[1], c[0]) if bgr else c\n\n    @staticmethod\n    def hex2rgb(h):  # rgb order (PIL)\n        return tuple(int(h[1 + i:1 + i + 2], 16) for i in (0, 2, 4))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"./build/libmyplugins.so\"\n    engine_file_path = \"yolov8n-seg.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load coco labels\n\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\",\n                  \"traffic light\",\n                  \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n                  \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\n                  \"frisbee\",\n                  \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\",\n                  \"surfboard\",\n                  \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n                  \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n                  \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\n                  \"cell phone\",\n                  \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\n                  \"teddy bear\",\n                  \"hair drier\", \"toothbrush\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a YoLov8TRT instance\n    yolov8_wrapper = YoLov8TRT(engine_file_path)\n    try:\n        print('batch size is', yolov8_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolov8_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov8_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov8_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov8_wrapper.destroy()\n"
  },
  {
    "path": "yolov9/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\n\nproject(TRTCreater)\n\nadd_definitions(-w)\nadd_definitions(-std=c++11)\nadd_definitions(-DAPI_EXPORTS)\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_BUILD_TYPE Debug)\nset(CMAKE_CUDA_ARCHITECTURES 75 86 89)\n\nMESSAGE(STATUS \"operation system is ${CMAKE_SYSTEM}\") \nIF (CMAKE_SYSTEM_NAME MATCHES \"Linux\")\n    MESSAGE(STATUS \"current platform: Linux \")\n    set(CUDA_COMPILER_PATH \"/usr/local/cuda/bin/nvcc\")\n    set(TENSORRT_PATH \"/home/benol/Package/TensorRT-8.6.1.6\")\n    include_directories(/usr/local/cuda/include)\n    link_directories(/usr/local/cuda/lib64)\n    link_directories(/usr/local/cuda/lib)\nELSEIF (CMAKE_SYSTEM_NAME MATCHES \"Windows\")\n    MESSAGE(STATUS \"current platform: Windows\")\n    set(CUDA_COMPILER_PATH \"C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.2/bin/nvcc.exe\")\n    set(TENSORRT_PATH \"D:\\\\Program Files\\\\TensorRT-8.6.1.6\")\n    set(OpenCV_DIR \"D:\\\\Program Files\\\\opencv\\\\build\")\n    include_directories(${PROJECT_SOURCE_DIR}/windows)\n    find_package(CUDA REQUIRED)\n    include_directories(${CUDA_INCLUDE_DIRS})\n    link_directories(${CUDA_LIBRARIES})\nELSE (CMAKE_SYSTEM_PROCESSOR MATCHES \"aarch64\")\n    MESSAGE(STATUS \"other platform: ${CMAKE_SYSTEM_PROCESSOR}\")\n    include_directories(/usr/local/cuda/targets/aarch64-linux/include)\n    link_directories(/usr/local/cuda/targets/aarch64-linux/lib)\nENDIF (CMAKE_SYSTEM_NAME MATCHES \"Linux\")\nset(CMAKE_CUDA_COMPILER ${CUDA_COMPILER_PATH})\nenable_language(CUDA)\n\n# tensorrt\ninclude_directories(${TENSORRT_PATH}/include)\nlink_directories(${TENSORRT_PATH}/lib)\n\nfind_package(OpenCV)\ninclude_directories(${OpenCV_INCLUDE_DIRS})\n\ninclude_directories(${PROJECT_SOURCE_DIR}/include/)\ninclude_directories(${PROJECT_SOURCE_DIR}/plugin/)\n\nfile(GLOB_RECURSE SRCS ${PROJECT_SOURCE_DIR}/src/*.cpp ${PROJECT_SOURCE_DIR}/src/*.cu)\nfile(GLOB_RECURSE PLUGIN_SRCS ${PROJECT_SOURCE_DIR}/plugin/*.cu)\n\n# add_library(myplugins SHARED ${PLUGIN_SRCS})\nadd_library(myplugins SHARED ${PLUGIN_SRCS})\ntarget_link_libraries(myplugins nvinfer cudart)\n\nadd_executable(yolov9 demo.cpp ${SRCS})\ntarget_link_libraries(yolov9 nvinfer cudart myplugins ${OpenCV_LIBS})\n\n"
  },
  {
    "path": "yolov9/README.md",
    "content": "# YOLOv9\r\n\r\nThe Pytorch implementation is [WongKinYiu/yolov9](https://github.com/WongKinYiu/yolov9).\r\n\r\n## Contributors\r\n\r\n<a href=\"https://github.com/WuxinrongY\"><img src=\"https://avatars.githubusercontent.com/u/53141838?v=4?s=48\" width=\"40px;\" alt=\"\"/></a>\r\n\r\n## Progress\r\n- [x] YOLOv9-t\r\n- [x] YOLOv9-t-convert(gelan)\r\n- [x] YOLOv9-s\r\n- [x] YOLOv9-s-convert(gelan)\r\n- [x] YOLOv9-m\r\n- [x] YOLOv9-m-convert(gelan)\r\n- [x] YOLOv9-c\r\n- [x] YOLOv9-c-convert(gelan)\r\n- [x] YOLOv9-e\r\n- [x] YOLOv9-e-convert(gelan)\r\n\r\n## Requirements\r\n\r\n- TensorRT 8.0+\r\n- OpenCV 3.4.0+\r\n\r\n## Speed Test\r\n\r\nThe speed test is done on a desktop with R7-5700G CPU and RTX 4060Ti GPU. The input size is 640x640. The FP32, FP16 and INT8 models are tested. The time only includes the inference time, not includes the pre-processing and post-processing. The time is the average of 1000 times inference.\r\n\r\n| frame  | Model | FP32 | FP16 | INT8 |\r\n| --- | --- | --- | --- | --- |\r\n| tensorrt | YOLOv5-n | -ms | 0.58ms | -ms |\r\n| tensorrt | YOLOv5-s | -ms | 0.90ms | -ms |\r\n| tensorrt | YOLOv5-m | -ms | 1.9ms | -ms |\r\n| tensorrt | YOLOv5-l | -ms | 2.8ms | -ms |\r\n| tensorrt | YOLOv5-x | -ms | 5.1ms | -ms |\r\n| tensorrt | YOLOv9-t-convert | -ms | 1.37ms | -ms |\r\n| tensorrt | YOLOv9-s | -ms | 1.78ms | -ms |\r\n| tensorrt | YOLOv9-s-convert | -ms | 1.78ms | -ms |\r\n| tensorrt | YOLOv9-m | -ms | 3.1ms | -ms |\r\n| tensorrt | YOLOv9-m-convert | -ms | 2.8ms | -ms |\r\n| tensorrt | YOLOv9-c | 13.5ms | 4.6ms | 3.0ms |\r\n| tensorrt | YOLOv9-e | 8.3ms | 3.2ms | 2.15ms |\r\n\r\n**GELAN will be updated later.**\r\n\r\nYOLOv9-e is faster than YOLOv9-c in tensorrt, because the YOLOv9-e requires fewer layers of inference.\r\n\r\n```\r\nYOLOv9-c:\r\n[[31, 34, 37, 16, 19, 22], 1, DualDDetect, [nc]] # [A3, A4, A5, P3, P4, P5]\r\n\r\nYOLOv9-e:\r\n[[35, 32, 29, 42, 45, 48], 1, DualDDetect, [nc]]\r\n\r\n```\r\n\r\nIn DualDDetect, the A3, A4, A5, P3, P4, P5 are the output of the backbone. The first 3 layers are used for the inference of the final result.\r\n\r\nThe YOLOv9-c requires 37 layers of inference, but YOLOv9-e requires 35 layers of inference.\r\n\r\n## How to Run, yolov9 as example\r\n\r\n1. generate .wts from pytorch with .pt, or download .wts from model zoo\r\n\r\n```\r\n// download https://github.com/WongKinYiu/yolov9\r\ncp {tensorrtx}/yolov9/gen_wts.py {yolov9}/yolov9\r\ncd {yolov9}/yolov9\r\npython gen_wts.py\r\n// a file 'yolov9.wts' will be generated.\r\n```\r\n2. build tensorrtx/yolov9 and run\r\n\r\n```\r\ncd {tensorrtx}/yolov9/\r\n// update kNumClass in config.h if your model is trained on custom dataset\r\nmkdir build\r\ncd build\r\ncp {ultralytics}/ultralytics/yolov9.wts {tensorrtx}/yolov9/build\r\ncmake ..\r\nmake\r\nsudo ./yolov9 -s [.wts] [.engine] [c/e]  // serialize model to plan file\r\nsudo ./yolov9 -d [.engine] [image folder] // deserialize and run inference, the images in [image folder] will be processed.\r\n// For example yolov9\r\nsudo ./yolov9 -s yolov9-c.wts yolov9-c.engine c\r\nsudo ./yolov9 -d yolov9-c.engine ../images\r\n```\r\n\r\n3. check the images generated, as follows. _zidane.jpg and _bus.jpg\r\n\r\n4. optional, load and run the tensorrt model in python\r\n\r\n```\r\n// install python-tensorrt, pycuda, etc.\r\n// ensure the yolov9.engine and libmyplugins.so have been built\r\npython yolov9_trt.py\r\n```\r\n\r\n# INT8 Quantization\r\n\r\n1. Prepare calibration images, you can randomly select 1000s images from your train set. For coco, you can also download my calibration images `coco_calib` from [GoogleDrive](https://drive.google.com/drive/folders/1s7jE9DtOngZMzJC1uL307J2MiaGwdRSI?usp=sharing) or [BaiduPan](https://pan.baidu.com/s/1GOm_-JobpyLMAqZWCDUhKg) pwd: a9wh\r\n\r\n2. unzip it in yolov9/build\r\n\r\n3. set the macro `USE_INT8` in config.h and change the path of calibration images in config.h, such as 'gCalibTablePath=\"./coco_calib/\";'\r\n\r\n4. serialize the model and test\r\n\r\n<p align=\"center\">\r\n<img src=\"https://user-images.githubusercontent.com/15235574/78247927-4d9fac00-751e-11ea-8b1b-704a0aeb3fcf.jpg\" height=\"360px;\">\r\n</p>\r\n\r\n## More Information\r\n\r\nSee the readme in [home page.](https://github.com/wang-xinyu/tensorrtx)\r\n"
  },
  {
    "path": "yolov9/demo.cpp",
    "content": "#include <chrono>\n#include <fstream>\n#include \"config.h\"\n#include \"cuda_utils.h\"\n#include \"logging.h\"\n#include \"model.h\"\n#include \"postprocess.h\"\n#include \"preprocess.h\"\n#include \"utils.h\"\n\nusing namespace nvinfer1;\nconst static int kOutputSize = kMaxNumOutputBbox * sizeof(Detection) / sizeof(float) + 1;\nstatic Logger gLogger;\nvoid serialize_engine(unsigned int max_batchsize, std::string& wts_name, std::string& sub_type,\n                      std::string& engine_name) {\n    // Create builder\n    IBuilder* builder = createInferBuilder(gLogger);\n    IBuilderConfig* config = builder->createBuilderConfig();\n\n    // Create model to populate the network, then set the outputs and create an engine\n    IHostMemory* serialized_engine = nullptr;\n    if (sub_type == \"t\") {\n        serialized_engine = build_engine_yolov9_t(max_batchsize, builder, config, DataType::kFLOAT, wts_name, false);\n    } else if (sub_type == \"s\") {\n        serialized_engine = build_engine_yolov9_s(max_batchsize, builder, config, DataType::kFLOAT, wts_name, false);\n    } else if (sub_type == \"m\") {\n        serialized_engine = build_engine_yolov9_m(max_batchsize, builder, config, DataType::kFLOAT, wts_name, false);\n    } else if (sub_type == \"c\") {\n        serialized_engine = build_engine_yolov9_c(max_batchsize, builder, config, DataType::kFLOAT, wts_name);\n    } else if (sub_type == \"e\") {\n        serialized_engine = build_engine_yolov9_e(max_batchsize, builder, config, DataType::kFLOAT, wts_name);\n    }\n\n    else if (sub_type == \"gt\") {\n        serialized_engine = build_engine_yolov9_t(max_batchsize, builder, config, DataType::kFLOAT, wts_name, true);\n    } else if (sub_type == \"gs\") {\n        serialized_engine = build_engine_yolov9_s(max_batchsize, builder, config, DataType::kFLOAT, wts_name, true);\n    } else if (sub_type == \"gm\") {\n        serialized_engine = build_engine_yolov9_m(max_batchsize, builder, config, DataType::kFLOAT, wts_name, true);\n    } else if (sub_type == \"gc\") {\n        serialized_engine = build_engine_gelan_c(max_batchsize, builder, config, DataType::kFLOAT, wts_name);\n    } else if (sub_type == \"ge\") {\n        serialized_engine = build_engine_gelan_e(max_batchsize, builder, config, DataType::kFLOAT, wts_name);\n    } else {\n        return;\n    }\n\n    assert(serialized_engine != nullptr);\n\n    std::ofstream p(engine_name, std::ios::binary);\n    if (!p) {\n        std::cerr << \"could not open plan output file\" << std::endl;\n        assert(false);\n    }\n    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());\n\n    delete config;\n    delete serialized_engine;\n    delete builder;\n}\n\nvoid deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine,\n                        IExecutionContext** context) {\n    std::ifstream file(engine_name, std::ios::binary);\n    if (!file.good()) {\n        std::cerr << \"read \" << engine_name << \" error!\" << std::endl;\n        assert(false);\n    }\n    size_t size = 0;\n    file.seekg(0, file.end);\n    size = file.tellg();\n    file.seekg(0, file.beg);\n    char* serialized_engine = new char[size];\n    assert(serialized_engine);\n    file.read(serialized_engine, size);\n    file.close();\n\n    *runtime = createInferRuntime(gLogger);\n    assert(*runtime);\n    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);\n    assert(*engine);\n    *context = (*engine)->createExecutionContext();\n    assert(*context);\n    delete[] serialized_engine;\n}\n\nvoid prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device,\n                    float** output_buffer_host) {\n    assert(engine->getNbBindings() == 2);\n    // In order to bind the buffers, we need to know the names of the input and output tensors.\n    // Note that indices are guaranteed to be less than IEngine::getNbBindings()\n    const int inputIndex = engine->getBindingIndex(kInputTensorName);\n    const int outputIndex = engine->getBindingIndex(kOutputTensorName);\n    assert(inputIndex == 0);\n    assert(outputIndex == 1);\n    // Create GPU buffers on device\n    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));\n    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));\n\n    *output_buffer_host = new float[kBatchSize * kOutputSize];\n}\n\nvoid infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchSize) {\n    // infer on the batch asynchronously, and DMA output back to host\n    context.enqueue(batchSize, buffers, stream, nullptr);\n    CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost,\n                               stream));\n    CUDA_CHECK(cudaStreamSynchronize(stream));\n}\n\nbool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir,\n                std::string& sub_type) {\n    if (argc < 4)\n        return false;\n    if (std::string(argv[1]) == \"-s\" && argc == 5) {\n        wts = std::string(argv[2]);\n        engine = std::string(argv[3]);\n        sub_type = std::string(argv[4]);\n    } else if (std::string(argv[1]) == \"-d\" && argc == 4) {\n        engine = std::string(argv[2]);\n        img_dir = std::string(argv[3]);\n    } else {\n        return false;\n    }\n    return true;\n}\n\nint main(int argc, char** argv) {\n    cudaSetDevice(kGpuId);\n\n    std::string wts_name = \"\";\n    std::string engine_name = \"../yolov9-m-converted.engine\";\n    std::string img_dir = \"../images\";\n    std::string sub_type = \"m\";\n    // speed test or inference\n    const int speed_test_iter = 1000;\n    // const int speed_test_iter = 1;\n\n    // if (!parse_args(argc, argv, wts_name, engine_name, img_dir, sub_type)) {\n    //     std::cerr << \"Arguments not right!\" << std::endl;\n    //     std::cerr << \"./yolov9 -s [.wts] [.engine] [s/m/c/e/gt/gs/gm/gc/ge]  // serialize model to plan file\" << std::endl;\n    //     std::cerr << \"./yolov9 -d [.engine] ../samples  // deserialize plan file and run inference\" << std::endl;\n    //     return -1;\n    // }\n\n    // Create a model using the API directly and serialize it to a file\n    if (!wts_name.empty()) {\n        serialize_engine(kBatchSize, wts_name, sub_type, engine_name);\n        return 0;\n    }\n\n    // Deserialize the engine from file\n    IRuntime* runtime = nullptr;\n    ICudaEngine* engine = nullptr;\n    IExecutionContext* context = nullptr;\n    deserialize_engine(engine_name, &runtime, &engine, &context);\n    cudaStream_t stream;\n    CUDA_CHECK(cudaStreamCreate(&stream));\n\n    cuda_preprocess_init(kMaxInputImageSize);\n\n    // Prepare cpu and gpu buffers\n    float* device_buffers[2];\n    float* output_buffer_host = nullptr;\n    prepare_buffer(engine, &device_buffers[0], &device_buffers[1], &output_buffer_host);\n\n    // Read images from directory\n    std::vector<std::string> file_names;\n    if (read_files_in_dir(img_dir.c_str(), file_names) < 0) {\n        std::cerr << \"read_files_in_dir failed.\" << std::endl;\n        return -1;\n    }\n\n    // batch predict\n    for (size_t i = 0; i < file_names.size(); i += kBatchSize) {\n        // Get a batch of images\n        std::vector<cv::Mat> img_batch;\n        std::vector<std::string> img_name_batch;\n        for (size_t j = i; j < i + kBatchSize && j < file_names.size(); j++) {\n            cv::Mat img = cv::imread(img_dir + \"/\" + file_names[j]);\n            img_batch.push_back(img);\n            img_name_batch.push_back(file_names[j]);\n        }\n\n        // Preprocess\n        cuda_batch_preprocess(img_batch, device_buffers[0], kInputW, kInputH, stream);\n\n        // Run inference\n        auto start = std::chrono::system_clock::now();\n        for (int j = 0; j < speed_test_iter; j++) {\n            infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize);\n        }\n        // infer(*context, stream, (void**)device_buffers, output_buffer_host, kBatchSize);\n        auto end = std::chrono::system_clock::now();\n        std::cout << \"inference time: \"\n                  << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / 1000.0 /\n                             speed_test_iter\n                  << \"ms\" << std::endl;\n\n        // NMS\n        std::vector<std::vector<Detection>> res_batch;\n        batch_nms(res_batch, output_buffer_host, img_batch.size(), kOutputSize, kConfThresh, kNmsThresh);\n\n        // Draw bounding boxes\n        draw_bbox(img_batch, res_batch);\n\n        // Save images\n        for (size_t j = 0; j < img_batch.size(); j++) {\n            cv::imwrite(\"_\" + img_name_batch[j], img_batch[j]);\n        }\n    }\n\n    // Release stream and buffers\n    cudaStreamDestroy(stream);\n    CUDA_CHECK(cudaFree(device_buffers[0]));\n    CUDA_CHECK(cudaFree(device_buffers[1]));\n    delete[] output_buffer_host;\n    cuda_preprocess_destroy();\n    // Destroy the engine\n    delete context;\n    delete engine;\n    delete runtime;\n\n    // Print histogram of the output distribution\n    //std::cout << \"\\nOutput:\\n\\n\";\n    //for (unsigned int i = 0; i < kOutputSize; i++)\n    //{\n    //    std::cout << prob[i] << \", \";\n    //    if (i % 10 == 0) std::cout << std::endl;\n    //}\n    //std::cout << std::endl;\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov9/gen_wts.py",
    "content": "import sys  # noqa: F401\nimport argparse\nimport os\nimport struct\nimport torch\nfrom utils.torch_utils import select_device\n\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description='Convert .pt file to .wts')\n    parser.add_argument('-w', '--weights', default='yolov9-e.pt',\n                        help='Input weights (.pt) file path (required)')\n    parser.add_argument(\n        '-o', '--output', help='Output (.wts) file path (optional)')\n    parser.add_argument(\n        '-t', '--type', type=str, default='detect', choices=['detect', 'cls', 'seg'],\n        help='determines the model is detection/classification')\n    args = parser.parse_args()\n    if not os.path.isfile(args.weights):\n        raise SystemExit('Invalid input file')\n    if not args.output:\n        args.output = os.path.splitext(args.weights)[0] + '.wts'\n    elif os.path.isdir(args.output):\n        args.output = os.path.join(\n            args.output,\n            os.path.splitext(os.path.basename(args.weights))[0] + '.wts')\n    return args.weights, args.output, args.type\n\n\npt_file, wts_file, m_type = parse_args()\nprint(f'Generating .wts for {m_type} model')\n\n# Load model\nprint(f'Loading {pt_file}')\ndevice = select_device('cpu')\nmodel = torch.load(pt_file, map_location=device, weights_only=False)  # Load FP32 weights\nmodel = model['ema' if model.get('ema') else 'model'].float()\n\nif m_type in ['detect', 'seg']:\n    # update anchor_grid info\n    anchor_grid = model.model[-1].anchors * model.model[-1].stride[..., None, None]\n    # model.model[-1].anchor_grid = anchor_grid\n    # delattr(model.model[-1], 'anchor_grid')  # model.model[-1] is detect layer\n    # The parameters are saved in the OrderDict through the \"register_buffer\" method, and then saved to the weight.\n    model.model[-1].register_buffer(\"anchor_grid\", anchor_grid)\n    # model.model[-1].register_buffer(\"strides\", model.model[-1].stride)\n\nmodel.to(device).eval()\n\n# print(model.model)\n# 将model.model保存到txt中\nwith open('model.txt', 'w') as f:\n    f.write(str(model.model))\nf.close()\nprint(f'Writing into {wts_file}')\nwith open(wts_file, 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        for vv in vr:\n            f.write(' ')\n            f.write(struct.pack('>f', float(vv)).hex())\n        f.write('\\n')\nwts_file_key = wts_file.replace('.wts', '_key.txt')\nprint(f'Writing into {wts_file_key}')\nwith open(wts_file_key, 'w') as f:\n    f.write('{}\\n'.format(len(model.state_dict().keys())))\n    for k, v in model.state_dict().items():\n        vr = v.reshape(-1).cpu().numpy()\n        f.write('{} {} '.format(k, len(vr)))\n        f.write('\\n')\n"
  },
  {
    "path": "yolov9/include/block.h",
    "content": "#include \"config.h\"\n#include \"yololayer.h\"\n\n#include <cassert>\n#include <cmath>\n#include <cstring>\n#include <fstream>\n#include <iostream>\n#include <map>\n\nusing namespace nvinfer1;\n\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nvoid PrintDim(const ILayer* layer, std::string log = \"\");\nstd::map<std::string, Weights> loadWeights(const std::string file);\nint get_width(int x, float gw, int divisor = 8);\nint get_depth(int x, float gd);\nILayer* Proto(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c_, int c2,\n              std::string lname);\nstd::vector<std::vector<float>> getAnchors(std::map<std::string, Weights>& weightMap, std::string lname);\n// ----------------------------------------------------------------\nnvinfer1::ILayer* convBnSiLU(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>& weightMap,\n                             nvinfer1::ITensor& input, int ch, int k, int s, int p, std::string lname, int g = 1);\nILayer* ELAN1(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2,\n              int c3, int c4, std::string lname);\nILayer* RepNCSPELAN4(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1,\n                     int c2, int c3, int c4, int c5, std::string lname);\nILayer* ADown(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c2,\n              std::string lname);\nILayer* AConv(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c2,\n              std::string lname);\nstd::vector<ILayer*> CBLinear(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\n                              std::vector<int> c2s, int k, int s, int p, int g, std::string lname);\nILayer* CBFuse(INetworkDefinition* network, std::vector<std::vector<ILayer*>> input, std::vector<int> idx,\n               std::vector<int> strides);\nILayer* SPPELAN(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2,\n                int c3, std::string lname);\nstd::vector<IConcatenationLayer*> DualDDetect(INetworkDefinition* network, std::map<std::string, Weights>& weightMap,\n                                              std::vector<ILayer*> dets, int cls, std::vector<int> ch,\n                                              std::string lname);\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network,\n                                       std::vector<nvinfer1::IConcatenationLayer*> dets, bool is_segmentation);\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int k, int s, int p, std::string lname);\nnvinfer1::ILayer* convBnNoAct(nvinfer1::INetworkDefinition* network,\n                              std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int ch,\n                              int k, int s, int p, std::string lname, int g);\nstd::vector<IConcatenationLayer*> DDetect(INetworkDefinition* network, std::map<std::string, Weights>& weightMap,\n                                          std::vector<ILayer*> dets, int cls, std::vector<int> ch, std::string lname);\n"
  },
  {
    "path": "yolov9/include/calibrator.h",
    "content": "#pragma once\n\n#include \"macros.h\"\n#include <string>\n#include <vector>\n\n//! \\class Int8EntropyCalibrator2\n//!\n//! \\brief Implements Entropy calibrator 2.\n//!  CalibrationAlgoType is kENTROPY_CALIBRATION_2.\n//!\nclass Int8EntropyCalibrator2 : public nvinfer1::IInt8EntropyCalibrator2 {\npublic:\n    Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir, const char* calib_table_name, const char* input_blob_name, bool read_cache = true);\n\n    virtual ~Int8EntropyCalibrator2();\n    int getBatchSize() const TRT_NOEXCEPT override;\n    bool getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT override;\n    const void* readCalibrationCache(size_t& length) TRT_NOEXCEPT override;\n    void writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT override;\n\nprivate:\n    int batchsize_;\n    int input_w_;\n    int input_h_;\n    int img_idx_;\n    std::string img_dir_;\n    std::vector<std::string> img_files_;\n    size_t input_count_;\n    std::string calib_table_name_;\n    const char* input_blob_name_;\n    bool read_cache_;\n    void* device_input_;\n    std::vector<char> calib_cache_;\n};\n\n"
  },
  {
    "path": "yolov9/include/config.h",
    "content": "#pragma once\n\n/* --------------------------------------------------------\n * These configs are related to tensorrt model, if these are changed,\n * please re-compile and re-serialize the tensorrt model.\n * --------------------------------------------------------*/\n\n// For INT8, you need prepare the calibration dataset, please refer to\n// https://github.com/wang-xinyu/tensorrtx/tree/master/yolov5#int8-quantization\n#define USE_FP16  // set USE_INT8 or USE_FP16 or USE_FP32\n#ifdef USE_INT8\nconst static char* gCalibTablePath = \"./calib\";\n#endif\n\n// These are used to define input/output tensor names,\n// you can set them to whatever you want.\nconst static char* kInputTensorName = \"images\";\nconst static char* kOutputTensorName = \"output\";\n\n// Detection model and Segmentation model' number of classes\nconstexpr static int kNumClass = 80;\n\n// Classfication model's number of classes\nconstexpr static int kClsNumClass = 1000;\n\nconstexpr static int kBatchSize = 1;\n\n// Yolo's input width and height must by divisible by 32\nconstexpr static int kInputH = 640;\nconstexpr static int kInputW = 640;\n\n// Classfication model's input shape\nconstexpr static int kClsInputH = 224;\nconstexpr static int kClsInputW = 224;\n\n// Maximum number of output bounding boxes from yololayer plugin.\n// That is maximum number of output bounding boxes before NMS.\nconstexpr static int kMaxNumOutputBbox = 2000;\n\nconstexpr static int kNumAnchor = 3;\n\n// The bboxes whose confidence is lower than kIgnoreThresh will be ignored in yololayer plugin.\nconstexpr static float kIgnoreThresh = 0.05f;\n\n/* --------------------------------------------------------\n * These configs are NOT related to tensorrt model, if these are changed,\n * please re-compile, but no need to re-serialize the tensorrt model.\n * --------------------------------------------------------*/\n\n// NMS overlapping thresh and final detection confidence thresh\nconst static float kNmsThresh = 0.45f;\nconst static float kConfThresh = 0.1f;\n\nconst static int kGpuId = 0;\n\n// If your image size is larger than 4096 * 3112, please increase this value\nconst static int kMaxInputImageSize = 4096 * 3112;\n"
  },
  {
    "path": "yolov9/include/cuda_utils.h",
    "content": "#ifndef TRTX_CUDA_UTILS_H_\n#define TRTX_CUDA_UTILS_H_\n\n#include <cuda_runtime_api.h>\n\n#ifndef CUDA_CHECK\n#define CUDA_CHECK(callstr)\\\n    {\\\n        cudaError_t error_code = callstr;\\\n        if (error_code != cudaSuccess) {\\\n            std::cerr << \"CUDA error \" << error_code << \" at \" << __FILE__ << \":\" << __LINE__;\\\n            assert(0);\\\n        }\\\n    }\n#endif  // CUDA_CHECK\n\n#endif  // TRTX_CUDA_UTILS_H_\n\n"
  },
  {
    "path": "yolov9/include/logging.h",
    "content": "/*\n * Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n#ifndef TENSORRT_LOGGING_H\n#define TENSORRT_LOGGING_H\n\n#include \"NvInferRuntimeCommon.h\"\n#include <cassert>\n#include <ctime>\n#include <iomanip>\n#include <iostream>\n#include <ostream>\n#include <sstream>\n#include <string>\n#include \"macros.h\"\n\nusing Severity = nvinfer1::ILogger::Severity;\n\nclass LogStreamConsumerBuffer : public std::stringbuf\n{\npublic:\n    LogStreamConsumerBuffer(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mOutput(stream)\n        , mPrefix(prefix)\n        , mShouldLog(shouldLog)\n    {\n    }\n\n    LogStreamConsumerBuffer(LogStreamConsumerBuffer&& other)\n        : mOutput(other.mOutput)\n    {\n    }\n\n    ~LogStreamConsumerBuffer()\n    {\n        // std::streambuf::pbase() gives a pointer to the beginning of the buffered part of the output sequence\n        // std::streambuf::pptr() gives a pointer to the current position of the output sequence\n        // if the pointer to the beginning is not equal to the pointer to the current position,\n        // call putOutput() to log the output to the stream\n        if (pbase() != pptr())\n        {\n            putOutput();\n        }\n    }\n\n    // synchronizes the stream buffer and returns 0 on success\n    // synchronizing the stream buffer consists of inserting the buffer contents into the stream,\n    // resetting the buffer and flushing the stream\n    virtual int sync()\n    {\n        putOutput();\n        return 0;\n    }\n\n    void putOutput()\n    {\n        if (mShouldLog)\n        {\n            // prepend timestamp\n            std::time_t timestamp = std::time(nullptr);\n            tm* tm_local = std::localtime(&timestamp);\n            std::cout << \"[\";\n            std::cout << std::setw(2) << std::setfill('0') << 1 + tm_local->tm_mon << \"/\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_mday << \"/\";\n            std::cout << std::setw(4) << std::setfill('0') << 1900 + tm_local->tm_year << \"-\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_hour << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_min << \":\";\n            std::cout << std::setw(2) << std::setfill('0') << tm_local->tm_sec << \"] \";\n            // std::stringbuf::str() gets the string contents of the buffer\n            // insert the buffer contents pre-appended by the appropriate prefix into the stream\n            mOutput << mPrefix << str();\n            // set the buffer to empty\n            str(\"\");\n            // flush the stream\n            mOutput.flush();\n        }\n    }\n\n    void setShouldLog(bool shouldLog)\n    {\n        mShouldLog = shouldLog;\n    }\n\nprivate:\n    std::ostream& mOutput;\n    std::string mPrefix;\n    bool mShouldLog;\n};\n\n//!\n//! \\class LogStreamConsumerBase\n//! \\brief Convenience object used to initialize LogStreamConsumerBuffer before std::ostream in LogStreamConsumer\n//!\nclass LogStreamConsumerBase\n{\npublic:\n    LogStreamConsumerBase(std::ostream& stream, const std::string& prefix, bool shouldLog)\n        : mBuffer(stream, prefix, shouldLog)\n    {\n    }\n\nprotected:\n    LogStreamConsumerBuffer mBuffer;\n};\n\n//!\n//! \\class LogStreamConsumer\n//! \\brief Convenience object used to facilitate use of C++ stream syntax when logging messages.\n//!  Order of base classes is LogStreamConsumerBase and then std::ostream.\n//!  This is because the LogStreamConsumerBase class is used to initialize the LogStreamConsumerBuffer member field\n//!  in LogStreamConsumer and then the address of the buffer is passed to std::ostream.\n//!  This is necessary to prevent the address of an uninitialized buffer from being passed to std::ostream.\n//!  Please do not change the order of the parent classes.\n//!\nclass LogStreamConsumer : protected LogStreamConsumerBase, public std::ostream\n{\npublic:\n    //! \\brief Creates a LogStreamConsumer which logs messages with level severity.\n    //!  Reportable severity determines if the messages are severe enough to be logged.\n    LogStreamConsumer(Severity reportableSeverity, Severity severity)\n        : LogStreamConsumerBase(severityOstream(severity), severityPrefix(severity), severity <= reportableSeverity)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(severity <= reportableSeverity)\n        , mSeverity(severity)\n    {\n    }\n\n    LogStreamConsumer(LogStreamConsumer&& other)\n        : LogStreamConsumerBase(severityOstream(other.mSeverity), severityPrefix(other.mSeverity), other.mShouldLog)\n        , std::ostream(&mBuffer) // links the stream buffer with the stream\n        , mShouldLog(other.mShouldLog)\n        , mSeverity(other.mSeverity)\n    {\n    }\n\n    void setReportableSeverity(Severity reportableSeverity)\n    {\n        mShouldLog = mSeverity <= reportableSeverity;\n        mBuffer.setShouldLog(mShouldLog);\n    }\n\nprivate:\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    static std::string severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    bool mShouldLog;\n    Severity mSeverity;\n};\n\n//! \\class Logger\n//!\n//! \\brief Class which manages logging of TensorRT tools and samples\n//!\n//! \\details This class provides a common interface for TensorRT tools and samples to log information to the console,\n//! and supports logging two types of messages:\n//!\n//! - Debugging messages with an associated severity (info, warning, error, or internal error/fatal)\n//! - Test pass/fail messages\n//!\n//! The advantage of having all samples use this class for logging as opposed to emitting directly to stdout/stderr is\n//! that the logic for controlling the verbosity and formatting of sample output is centralized in one location.\n//!\n//! In the future, this class could be extended to support dumping test results to a file in some standard format\n//! (for example, JUnit XML), and providing additional metadata (e.g. timing the duration of a test run).\n//!\n//! TODO: For backwards compatibility with existing samples, this class inherits directly from the nvinfer1::ILogger\n//! interface, which is problematic since there isn't a clean separation between messages coming from the TensorRT\n//! library and messages coming from the sample.\n//!\n//! In the future (once all samples are updated to use Logger::getTRTLogger() to access the ILogger) we can refactor the\n//! class to eliminate the inheritance and instead make the nvinfer1::ILogger implementation a member of the Logger\n//! object.\n\nclass Logger : public nvinfer1::ILogger\n{\npublic:\n    Logger(Severity severity = Severity::kWARNING)\n        : mReportableSeverity(severity)\n    {\n    }\n\n    //!\n    //! \\enum TestResult\n    //! \\brief Represents the state of a given test\n    //!\n    enum class TestResult\n    {\n        kRUNNING, //!< The test is running\n        kPASSED,  //!< The test passed\n        kFAILED,  //!< The test failed\n        kWAIVED   //!< The test was waived\n    };\n\n    //!\n    //! \\brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger\n    //! \\return The nvinfer1::ILogger associated with this Logger\n    //!\n    //! TODO Once all samples are updated to use this method to register the logger with TensorRT,\n    //! we can eliminate the inheritance of Logger from ILogger\n    //!\n    nvinfer1::ILogger& getTRTLogger()\n    {\n        return *this;\n    }\n\n    //!\n    //! \\brief Implementation of the nvinfer1::ILogger::log() virtual method\n    //!\n    //! Note samples should not be calling this function directly; it will eventually go away once we eliminate the\n    //! inheritance from nvinfer1::ILogger\n    //!\n    void log(Severity severity, const char* msg) TRT_NOEXCEPT override \n    {\n        LogStreamConsumer(mReportableSeverity, severity) << \"[TRT] \" << std::string(msg) << std::endl;\n    }\n\n    //!\n    //! \\brief Method for controlling the verbosity of logging output\n    //!\n    //! \\param severity The logger will only emit messages that have severity of this level or higher.\n    //!\n    void setReportableSeverity(Severity severity)\n    {\n        mReportableSeverity = severity;\n    }\n\n    //!\n    //! \\brief Opaque handle that holds logging information for a particular test\n    //!\n    //! This object is an opaque handle to information used by the Logger to print test results.\n    //! The sample must call Logger::defineTest() in order to obtain a TestAtom that can be used\n    //! with Logger::reportTest{Start,End}().\n    //!\n    class TestAtom\n    {\n    public:\n        TestAtom(TestAtom&&) = default;\n\n    private:\n        friend class Logger;\n\n        TestAtom(bool started, const std::string& name, const std::string& cmdline)\n            : mStarted(started)\n            , mName(name)\n            , mCmdline(cmdline)\n        {\n        }\n\n        bool mStarted;\n        std::string mName;\n        std::string mCmdline;\n    };\n\n    //!\n    //! \\brief Define a test for logging\n    //!\n    //! \\param[in] name The name of the test.  This should be a string starting with\n    //!                  \"TensorRT\" and containing dot-separated strings containing\n    //!                  the characters [A-Za-z0-9_].\n    //!                  For example, \"TensorRT.sample_googlenet\"\n    //! \\param[in] cmdline The command line used to reproduce the test\n    //\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    //!\n    static TestAtom defineTest(const std::string& name, const std::string& cmdline)\n    {\n        return TestAtom(false, name, cmdline);\n    }\n\n    //!\n    //! \\brief A convenience overloaded version of defineTest() that accepts an array of command-line arguments\n    //!        as input\n    //!\n    //! \\param[in] name The name of the test\n    //! \\param[in] argc The number of command-line arguments\n    //! \\param[in] argv The array of command-line arguments (given as C strings)\n    //!\n    //! \\return a TestAtom that can be used in Logger::reportTest{Start,End}().\n    static TestAtom defineTest(const std::string& name, int argc, char const* const* argv)\n    {\n        auto cmdline = genCmdlineString(argc, argv);\n        return defineTest(name, cmdline);\n    }\n\n    //!\n    //! \\brief Report that a test has started.\n    //!\n    //! \\pre reportTestStart() has not been called yet for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has started\n    //!\n    static void reportTestStart(TestAtom& testAtom)\n    {\n        reportTestResult(testAtom, TestResult::kRUNNING);\n        assert(!testAtom.mStarted);\n        testAtom.mStarted = true;\n    }\n\n    //!\n    //! \\brief Report that a test has ended.\n    //!\n    //! \\pre reportTestStart() has been called for the given testAtom\n    //!\n    //! \\param[in] testAtom The handle to the test that has ended\n    //! \\param[in] result The result of the test. Should be one of TestResult::kPASSED,\n    //!                   TestResult::kFAILED, TestResult::kWAIVED\n    //!\n    static void reportTestEnd(const TestAtom& testAtom, TestResult result)\n    {\n        assert(result != TestResult::kRUNNING);\n        assert(testAtom.mStarted);\n        reportTestResult(testAtom, result);\n    }\n\n    static int reportPass(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kPASSED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportFail(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kFAILED);\n        return EXIT_FAILURE;\n    }\n\n    static int reportWaive(const TestAtom& testAtom)\n    {\n        reportTestEnd(testAtom, TestResult::kWAIVED);\n        return EXIT_SUCCESS;\n    }\n\n    static int reportTest(const TestAtom& testAtom, bool pass)\n    {\n        return pass ? reportPass(testAtom) : reportFail(testAtom);\n    }\n\n    Severity getReportableSeverity() const\n    {\n        return mReportableSeverity;\n    }\n\nprivate:\n    //!\n    //! \\brief returns an appropriate string for prefixing a log message with the given severity\n    //!\n    static const char* severityPrefix(Severity severity)\n    {\n        switch (severity)\n        {\n        case Severity::kINTERNAL_ERROR: return \"[F] \";\n        case Severity::kERROR: return \"[E] \";\n        case Severity::kWARNING: return \"[W] \";\n        case Severity::kINFO: return \"[I] \";\n        case Severity::kVERBOSE: return \"[V] \";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate string for prefixing a test result message with the given result\n    //!\n    static const char* testResultString(TestResult result)\n    {\n        switch (result)\n        {\n        case TestResult::kRUNNING: return \"RUNNING\";\n        case TestResult::kPASSED: return \"PASSED\";\n        case TestResult::kFAILED: return \"FAILED\";\n        case TestResult::kWAIVED: return \"WAIVED\";\n        default: assert(0); return \"\";\n        }\n    }\n\n    //!\n    //! \\brief returns an appropriate output stream (cout or cerr) to use with the given severity\n    //!\n    static std::ostream& severityOstream(Severity severity)\n    {\n        return severity >= Severity::kINFO ? std::cout : std::cerr;\n    }\n\n    //!\n    //! \\brief method that implements logging test results\n    //!\n    static void reportTestResult(const TestAtom& testAtom, TestResult result)\n    {\n        severityOstream(Severity::kINFO) << \"&&&& \" << testResultString(result) << \" \" << testAtom.mName << \" # \"\n                                         << testAtom.mCmdline << std::endl;\n    }\n\n    //!\n    //! \\brief generate a command line string from the given (argc, argv) values\n    //!\n    static std::string genCmdlineString(int argc, char const* const* argv)\n    {\n        std::stringstream ss;\n        for (int i = 0; i < argc; i++)\n        {\n            if (i > 0)\n                ss << \" \";\n            ss << argv[i];\n        }\n        return ss.str();\n    }\n\n    Severity mReportableSeverity;\n};\n\nnamespace\n{\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kVERBOSE\n//!\n//! Example usage:\n//!\n//!     LOG_VERBOSE(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_VERBOSE(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kVERBOSE);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINFO\n//!\n//! Example usage:\n//!\n//!     LOG_INFO(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_INFO(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINFO);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kWARNING\n//!\n//! Example usage:\n//!\n//!     LOG_WARN(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_WARN(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kWARNING);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kERROR\n//!\n//! Example usage:\n//!\n//!     LOG_ERROR(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_ERROR(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kERROR);\n}\n\n//!\n//! \\brief produces a LogStreamConsumer object that can be used to log messages of severity kINTERNAL_ERROR\n//         (\"fatal\" severity)\n//!\n//! Example usage:\n//!\n//!     LOG_FATAL(logger) << \"hello world\" << std::endl;\n//!\ninline LogStreamConsumer LOG_FATAL(const Logger& logger)\n{\n    return LogStreamConsumer(logger.getReportableSeverity(), Severity::kINTERNAL_ERROR);\n}\n\n} // anonymous namespace\n\n#endif // TENSORRT_LOGGING_H\n"
  },
  {
    "path": "yolov9/include/macros.h",
    "content": "#ifndef __MACROS_H\n#define __MACROS_H\n\n#include <NvInfer.h>\n\n#ifdef API_EXPORTS\n#if defined(_MSC_VER)\n#define API __declspec(dllexport)\n#else\n#define API __attribute__((visibility(\"default\")))\n#endif\n#else\n\n#if defined(_MSC_VER)\n#define API __declspec(dllimport)\n#else\n#define API\n#endif\n#endif  // API_EXPORTS\n\n#if NV_TENSORRT_MAJOR >= 8\n#define TRT_NOEXCEPT noexcept\n#define TRT_CONST_ENQUEUE const\n#else\n#define TRT_NOEXCEPT\n#define TRT_CONST_ENQUEUE\n#endif\n\n#endif  // __MACROS_H\n"
  },
  {
    "path": "yolov9/include/model.h",
    "content": "#pragma once\n\n#include <NvInfer.h>\n#include <string>\n// yolov9\nnvinfer1::IHostMemory* build_engine_yolov9_t(unsigned int maxBatchSize, nvinfer1::IBuilder* builder,\n                                             nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt,\n                                             std::string& wts_name, bool isConvert = false);\nnvinfer1::IHostMemory* build_engine_yolov9_s(unsigned int maxBatchSize, nvinfer1::IBuilder* builder,\n                                             nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt,\n                                             std::string& wts_name, bool isConvert = false);\nnvinfer1::IHostMemory* build_engine_yolov9_m(unsigned int maxBatchSize, nvinfer1::IBuilder* builder,\n                                             nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt,\n                                             std::string& wts_name, bool isConvert = false);\nnvinfer1::IHostMemory* build_engine_yolov9_c(unsigned int maxBatchSize, nvinfer1::IBuilder* builder,\n                                             nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt,\n                                             std::string& wts_name);\nnvinfer1::IHostMemory* build_engine_yolov9_e(unsigned int maxBatchSize, nvinfer1::IBuilder* builder,\n                                             nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt,\n                                             std::string& wts_name);\n// gelan\nnvinfer1::IHostMemory* build_engine_gelan_t(unsigned int maxBatchSize, nvinfer1::IBuilder* builder,\n                                            nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt,\n                                            std::string& wts_name);\nnvinfer1::IHostMemory* build_engine_gelan_m(unsigned int maxBatchSize, nvinfer1::IBuilder* builder,\n                                            nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt,\n                                            std::string& wts_name);\nnvinfer1::IHostMemory* build_engine_gelan_c(unsigned int maxBatchSize, nvinfer1::IBuilder* builder,\n                                            nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt,\n                                            std::string& wts_name);\nnvinfer1::IHostMemory* build_engine_gelan_e(unsigned int maxBatchSize, nvinfer1::IBuilder* builder,\n                                            nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt,\n                                            std::string& wts_name);\n"
  },
  {
    "path": "yolov9/include/postprocess.h",
    "content": "#pragma once\n\n#include \"types.h\"\n#include <opencv2/opencv.hpp>\n#include <cuda_runtime.h>\ncv::Rect get_rect(cv::Mat& img, float bbox[4]);\n\nvoid nms(std::vector<Detection>& res, float *output, float conf_thresh, float nms_thresh = 0.5);\n\nvoid batch_nms(std::vector<std::vector<Detection>>& batch_res, float *output, int batch_size, int output_size, float conf_thresh, float nms_thresh = 0.5);\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch);\n\nstd::vector<cv::Mat> process_mask(const float* proto, int proto_size, std::vector<Detection>& dets);\n\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks, std::unordered_map<int, std::string>& labels_map);\n// cuda NMS\nvoid cuda_decode(float* predict, int num_bboxes, float confidence_threshold,float* parray,int max_objects, cudaStream_t stream);\nvoid cuda_nms(float* parray, float nms_threshold, int max_objects, cudaStream_t stream);\nvoid batch_process(std::vector<std::vector<Detection>> &res_batch, const float* decode_ptr_host, int batch_size, int bbox_element, const std::vector<cv::Mat>& img_batch);"
  },
  {
    "path": "yolov9/include/preprocess.h",
    "content": "#pragma once\n\n#include <cuda_runtime.h>\n#include <cstdint>\n#include <opencv2/opencv.hpp>\n\nvoid cuda_preprocess_init(int max_image_size);\nvoid cuda_preprocess_destroy();\nvoid cuda_preprocess(uint8_t* src, int src_width, int src_height,\n                     float* dst, int dst_width, int dst_height,\n                     cudaStream_t stream);\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch,\n                           float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream);\n\n"
  },
  {
    "path": "yolov9/include/types.h",
    "content": "#pragma once\n\n#include \"config.h\"\n\nstruct YoloKernel {\n    int width;\n    int height;\n    float anchors[kNumAnchor * 2];\n};\n\nstruct alignas(float) Detection {\n    float bbox[4];  // center_x center_y w h\n    float conf;  // bbox_conf * cls_conf\n    float class_id;\n    float mask[32];\n};\nconst int bbox_element = 7; // center_x, center_y, w, h, conf, cls, obj\n"
  },
  {
    "path": "yolov9/include/utils.h",
    "content": "#pragma once\n\n#include <dirent.h>\n#include <cstring>\n#include <fstream>\n#include <sstream>\n#include <string>\n#include <unordered_map>\n#include <vector>\n\nstatic inline int read_files_in_dir(const char* p_dir_name, std::vector<std::string>& file_names) {\n    DIR* p_dir = opendir(p_dir_name);\n    if (p_dir == nullptr) {\n        return -1;\n    }\n\n    struct dirent* p_file = nullptr;\n    while ((p_file = readdir(p_dir)) != nullptr) {\n        if (strcmp(p_file->d_name, \".\") != 0 && strcmp(p_file->d_name, \"..\") != 0) {\n            //std::string cur_file_name(p_dir_name);\n            //cur_file_name += \"/\";\n            //cur_file_name += p_file->d_name;\n            std::string cur_file_name(p_file->d_name);\n            file_names.push_back(cur_file_name);\n        }\n    }\n\n    closedir(p_dir);\n    return 0;\n}\n// Function to trim leading and trailing whitespace from a string\nstatic inline std::string trim_leading_whitespace(const std::string& str) {\n    size_t first = str.find_first_not_of(' ');\n    if (std::string::npos == first) {\n        return str;\n    }\n    size_t last = str.find_last_not_of(' ');\n    return str.substr(first, (last - first + 1));\n}\n\n// Src: https://stackoverflow.com/questions/16605967\nstatic inline std::string to_string_with_precision(const float a_value, const int n = 2) {\n    std::ostringstream out;\n    out.precision(n);\n    out << std::fixed << a_value;\n    return out.str();\n}\n\nstatic inline int read_labels(const std::string labels_filename, std::unordered_map<int, std::string>& labels_map) {\n\n    std::ifstream file(labels_filename);\n    // Read each line of the file\n    std::string line;\n    int index = 0;\n    while (std::getline(file, line)) {\n        // Strip the line of any leading or trailing whitespace\n        line = trim_leading_whitespace(line);\n\n        // Add the stripped line to the labels_map, using the loop index as the key\n        labels_map[index] = line;\n        index++;\n    }\n    // Close the file\n    file.close();\n\n    return 0;\n}\n"
  },
  {
    "path": "yolov9/plugin/yololayer.cu",
    "content": "#include \"yololayer.h\"\n#include \"types.h\"\n#include <assert.h>\n#include <math.h>\n#include \"cuda_utils.h\"\n#include <vector>\n#include <iostream>\n\nnamespace Tn {\n    template<typename T>\n    void write(char*& buffer, const T& val) {\n        *reinterpret_cast<T*>(buffer) = val;\n        buffer += sizeof(T);\n    }\n\n    template<typename T>\n    void read(const char*& buffer, T& val) {\n        val = *reinterpret_cast<const T*>(buffer);\n        buffer += sizeof(T);\n    }\n}  // namespace Tn\n\n\nnamespace nvinfer1 {\nYoloLayerPlugin::YoloLayerPlugin(int classCount, int netWidth, int netHeight, int maxOut, bool is_segmentation) {\n    mClassCount = classCount;\n    mYoloV8NetWidth = netWidth;\n    mYoloV8netHeight = netHeight;\n    mMaxOutObject = maxOut;\n    is_segmentation_ = is_segmentation;\n}\n\nYoloLayerPlugin::~YoloLayerPlugin() {}\n\nYoloLayerPlugin::YoloLayerPlugin(const void* data, size_t length) {\n    using namespace Tn;\n    const char* d = reinterpret_cast<const char*>(data), * a = d;\n    read(d, mClassCount);\n    read(d, mThreadCount);\n    read(d, mYoloV8NetWidth);\n    read(d, mYoloV8netHeight);\n    read(d, mMaxOutObject);\n    read(d, is_segmentation_);\n\n    assert(d == a + length);\n}\n\nvoid YoloLayerPlugin::serialize(void* buffer) const TRT_NOEXCEPT {\n\n    using namespace Tn;\n    char* d = static_cast<char*>(buffer), * a = d;\n    write(d, mClassCount);\n    write(d, mThreadCount);\n    write(d, mYoloV8NetWidth);\n    write(d, mYoloV8netHeight);\n    write(d, mMaxOutObject);\n    write(d, is_segmentation_);\n\n    assert(d == a + getSerializationSize());\n}\n\nsize_t YoloLayerPlugin::getSerializationSize() const TRT_NOEXCEPT {\n    return sizeof(mClassCount) + sizeof(mThreadCount) + sizeof(mYoloV8netHeight) + sizeof(mYoloV8NetWidth) + sizeof(mMaxOutObject) + sizeof(is_segmentation_);\n}\n\nint YoloLayerPlugin::initialize() TRT_NOEXCEPT {\n    return 0;\n}\n\nnvinfer1::Dims YoloLayerPlugin::getOutputDimensions(int index, const nvinfer1::Dims* inputs, int nbInputDims) TRT_NOEXCEPT {\n    int total_size = mMaxOutObject * sizeof(Detection) / sizeof(float);\n    return nvinfer1::Dims3(total_size + 1, 1, 1);\n}\n\nvoid YoloLayerPlugin::setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT {\n    mPluginNamespace = pluginNamespace;\n}\n\nconst char* YoloLayerPlugin::getPluginNamespace() const TRT_NOEXCEPT {\n    return mPluginNamespace;\n}\n\nnvinfer1::DataType YoloLayerPlugin::getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const TRT_NOEXCEPT {\n    return nvinfer1::DataType::kFLOAT;\n}\n\nbool YoloLayerPlugin::isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT {\n\n    return false;\n}\n\nbool YoloLayerPlugin::canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT {\n\n    return false;\n}\n\nvoid YoloLayerPlugin::configurePlugin(nvinfer1::PluginTensorDesc const* in, int nbInput, nvinfer1::PluginTensorDesc const* out, int nbOutput) TRT_NOEXCEPT {};\n\nvoid YoloLayerPlugin::attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT {};\n\nvoid YoloLayerPlugin::detachFromContext() TRT_NOEXCEPT {}\n\nconst char* YoloLayerPlugin::getPluginType() const TRT_NOEXCEPT {\n\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloLayerPlugin::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nvoid YoloLayerPlugin::destroy() TRT_NOEXCEPT {\n\n    delete this;\n}\n\nnvinfer1::IPluginV2IOExt* YoloLayerPlugin::clone() const TRT_NOEXCEPT {\n\n    YoloLayerPlugin* p = new YoloLayerPlugin(mClassCount, mYoloV8NetWidth, mYoloV8netHeight, mMaxOutObject, is_segmentation_);\n    p->setPluginNamespace(mPluginNamespace);\n    return p;\n}\n\nint YoloLayerPlugin::enqueue(int batchSize, const void* TRT_CONST_ENQUEUE* inputs, void* const* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT {\n\n    forwardGpu((const float* const*)inputs, (float*)outputs[0], stream, mYoloV8netHeight, mYoloV8NetWidth, batchSize);\n    return 0;\n}\n\n\n__device__ float Logist(float data) { return 1.0f / (1.0f + expf(-data)); };\n\n__global__ void CalDetection(const float* input, float* output, int numElements, int maxoutobject,\n                             const int grid_h, int grid_w, const int stride, int classes, int outputElem, bool is_segmentation) {\n    int idx = threadIdx.x + blockDim.x * blockIdx.x;\n    if (idx >= numElements) return;\n\n    int total_grid = grid_h * grid_w;\n    int info_len = 4 + classes;\n    if (is_segmentation) info_len += 32;\n    int batchIdx = idx / total_grid;\n    int elemIdx = idx % total_grid;\n    const float* curInput = input + batchIdx * total_grid * info_len;\n    int outputIdx = batchIdx * outputElem;\n\n    int class_id = 0;\n    float max_cls_prob = 0.0;\n    for (int i = 4; i < 4 + classes; i++) {\n        float p = Logist(curInput[elemIdx + i * total_grid]);\n        if (p > max_cls_prob) {\n            max_cls_prob = p;\n            class_id = i - 4;\n        }\n    }\n\n    if (max_cls_prob < 0.1) return;\n\n    int count = (int)atomicAdd(output + outputIdx, 1);\n    if (count >= maxoutobject) return;\n    char* data = (char*)(output + outputIdx) + sizeof(float) + count * sizeof(Detection);\n    Detection* det = (Detection*)(data);\n\n    int row = elemIdx / grid_w;\n    int col = elemIdx % grid_w;\n\n    det->conf = max_cls_prob;\n    det->class_id = class_id;\n    det->bbox[0] = (col + 0.5f - curInput[elemIdx + 0 * total_grid]) * stride;\n    det->bbox[1] = (row + 0.5f - curInput[elemIdx + 1 * total_grid]) * stride;\n    det->bbox[2] = (col + 0.5f + curInput[elemIdx + 2 * total_grid]) * stride;\n    det->bbox[3] = (row + 0.5f + curInput[elemIdx + 3 * total_grid]) * stride;\n\n    for (int k = 0; is_segmentation && k < 32; k++) {\n        det->mask[k] = curInput[elemIdx + (k + 4 + classes) * total_grid];\n    }\n}\n\nvoid YoloLayerPlugin::forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV8netHeight,int mYoloV8NetWidth, int batchSize) {\n    int outputElem = 1 + mMaxOutObject * sizeof(Detection) / sizeof(float);\n    cudaMemsetAsync(output, 0, sizeof(float), stream);\n    for (int idx = 0; idx < batchSize; ++idx) {\n        CUDA_CHECK(cudaMemsetAsync(output + idx * outputElem, 0, sizeof(float), stream));\n    }\n    int numElem = 0;\n    int grids[3][2] = { {mYoloV8netHeight / 8, mYoloV8NetWidth / 8}, {mYoloV8netHeight / 16, mYoloV8NetWidth / 16}, {mYoloV8netHeight / 32, mYoloV8NetWidth / 32} };\n    int strides[] = { 8, 16, 32 };\n    for (unsigned int i = 0; i < 3; i++) {\n        int grid_h = grids[i][0];\n        int grid_w = grids[i][1];\n        int stride = strides[i];\n        numElem = grid_h * grid_w * batchSize;\n        if (numElem < mThreadCount) mThreadCount = numElem;\n\n        CalDetection << <(numElem + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream >> >\n            (inputs[i], output, numElem, mMaxOutObject, grid_h, grid_w, stride, mClassCount, outputElem, is_segmentation_);\n    }\n}\n\nPluginFieldCollection YoloPluginCreator::mFC{};\nstd::vector<PluginField> YoloPluginCreator::mPluginAttributes;\n\nYoloPluginCreator::YoloPluginCreator() {\n    mPluginAttributes.clear();\n    mFC.nbFields = mPluginAttributes.size();\n    mFC.fields = mPluginAttributes.data();\n}\n\nconst char* YoloPluginCreator::getPluginName() const TRT_NOEXCEPT {\n    return \"YoloLayer_TRT\";\n}\n\nconst char* YoloPluginCreator::getPluginVersion() const TRT_NOEXCEPT {\n    return \"1\";\n}\n\nconst PluginFieldCollection* YoloPluginCreator::getFieldNames() TRT_NOEXCEPT {\n    return &mFC;\n}\n\nIPluginV2IOExt* YoloPluginCreator::createPlugin(const char* name, const PluginFieldCollection* fc) TRT_NOEXCEPT {\n    assert(fc->nbFields == 1);\n    assert(strcmp(fc->fields[0].name, \"netinfo\") == 0);\n    int* p_netinfo = (int*)(fc->fields[0].data);\n    int class_count = p_netinfo[0];\n    int input_w = p_netinfo[1];\n    int input_h = p_netinfo[2];\n    int max_output_object_count = p_netinfo[3];\n    bool is_segmentation = p_netinfo[4];\n    YoloLayerPlugin* obj = new YoloLayerPlugin(class_count, input_w, input_h, max_output_object_count, is_segmentation);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\nIPluginV2IOExt* YoloPluginCreator::deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT {\n    // This object will be deleted when the network is destroyed, which will\n    // call YoloLayerPlugin::destroy()\n    YoloLayerPlugin* obj = new YoloLayerPlugin(serialData, serialLength);\n    obj->setPluginNamespace(mNamespace.c_str());\n    return obj;\n}\n\n} // namespace nvinfer1\n"
  },
  {
    "path": "yolov9/plugin/yololayer.h",
    "content": "#pragma once\n#include \"macros.h\"\n#include \"NvInfer.h\"\n#include <string>\n#include <vector>\n#include \"macros.h\"\nnamespace nvinfer1 {\nclass API YoloLayerPlugin : public IPluginV2IOExt {\npublic:\n        YoloLayerPlugin(int classCount, int netWdith, int netHeight, int maxOut, bool is_segmentation);\n        YoloLayerPlugin(const void* data, size_t length);\n        ~YoloLayerPlugin();\n\n        int getNbOutputs() const TRT_NOEXCEPT override {\n            return 1;\n        }\n\n        nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims* inputs, int nbInputDims) TRT_NOEXCEPT override;\n\n        int initialize() TRT_NOEXCEPT override;\n\n        virtual void terminate() TRT_NOEXCEPT override {}\n\n        virtual size_t getWorkspaceSize(int maxBatchSize) const TRT_NOEXCEPT override { return 0; }\n\n        virtual int enqueue(int batchSize, const void* const* inputs, void* TRT_CONST_ENQUEUE* outputs, void* workspace, cudaStream_t stream) TRT_NOEXCEPT override;\n\n        virtual size_t getSerializationSize() const TRT_NOEXCEPT override;\n\n        virtual void serialize(void* buffer) const TRT_NOEXCEPT override;\n\n        bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const TRT_NOEXCEPT override {\n            return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;\n        }\n\n\n        const char* getPluginType() const TRT_NOEXCEPT override;\n\n        const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n        void destroy() TRT_NOEXCEPT override;\n\n        IPluginV2IOExt* clone() const TRT_NOEXCEPT override;\n\n        void setPluginNamespace(const char* pluginNamespace) TRT_NOEXCEPT override;\n\n        const char* getPluginNamespace() const TRT_NOEXCEPT override;\n\n        nvinfer1::DataType getOutputDataType(int32_t index, nvinfer1::DataType const* inputTypes, int32_t nbInputs) const TRT_NOEXCEPT;\n\n        bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const TRT_NOEXCEPT override;\n\n        bool canBroadcastInputAcrossBatch(int inputIndex) const TRT_NOEXCEPT override;\n\n        void attachToContext(cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) TRT_NOEXCEPT override;\n\n        void configurePlugin(PluginTensorDesc const* in, int32_t nbInput, PluginTensorDesc const* out, int32_t nbOutput) TRT_NOEXCEPT override;\n\n        void detachFromContext() TRT_NOEXCEPT override;\n\n    private:\n        void forwardGpu(const float* const* inputs, float* output, cudaStream_t stream, int mYoloV8netHeight, int mYoloV8NetWidth, int batchSize);\n        int mThreadCount = 256;\n        const char* mPluginNamespace;\n        int mClassCount;\n        int mYoloV8NetWidth;\n        int mYoloV8netHeight;\n        int mMaxOutObject;\n        bool is_segmentation_;\n    };\n\nclass API YoloPluginCreator : public IPluginCreator {\npublic:\n        YoloPluginCreator();\n        ~YoloPluginCreator() override = default;\n\n        const char* getPluginName() const TRT_NOEXCEPT override;\n\n        const char* getPluginVersion() const TRT_NOEXCEPT override;\n\n        const nvinfer1::PluginFieldCollection* getFieldNames() TRT_NOEXCEPT override;\n\n        nvinfer1::IPluginV2IOExt* createPlugin(const char* name, const nvinfer1::PluginFieldCollection* fc) TRT_NOEXCEPT override;\n\n        nvinfer1::IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) TRT_NOEXCEPT override;\n\n        void setPluginNamespace(const char* libNamespace) TRT_NOEXCEPT override {\n            mNamespace = libNamespace;\n        }\n\n        const char* getPluginNamespace() const TRT_NOEXCEPT override {\n            return mNamespace.c_str();\n        }\n\n    private:\n        std::string mNamespace;\n        static PluginFieldCollection mFC;\n        static std::vector<PluginField> mPluginAttributes;\n    };\n    REGISTER_TENSORRT_PLUGIN(YoloPluginCreator);\n} // namespace nvinfer1\n\n"
  },
  {
    "path": "yolov9/src/block.cpp",
    "content": "#include \"block.h\"\n#include \"calibrator.h\"\n#include \"config.h\"\n#include \"yololayer.h\"\n\n#include <algorithm>\n#include <cassert>\n#include <cmath>\n#include <cstring>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include <numeric>\n#include <vector>\n\nusing namespace nvinfer1;\n// TensorRT weight files have a simple space delimited format:\n// [type] [size] <data x size in hex>\nvoid PrintDim(const ILayer* layer, std::string log) {\n    Dims dim = layer->getOutput(0)->getDimensions();\n    std::cout << log << \": \"\n              << \"\\t\\t\\t\\t\";\n    for (int i = 0; i < dim.nbDims; i++) {\n        std::cout << dim.d[i] << \" \";\n    }\n    std::cout << std::endl;\n}\n\nstd::map<std::string, Weights> loadWeights(const std::string file) {\n    std::cout << \"Loading weights: \" << file << std::endl;\n    std::map<std::string, Weights> weightMap;\n\n    // Open weights file\n    std::ifstream input(file);\n    assert(input.is_open() && \"Unable to load weight file. please check if the .wts file path is right!!!!!!\");\n\n    // Read number of weight blobs\n    int32_t count;\n    input >> count;\n    assert(count > 0 && \"Invalid weight map file.\");\n\n    while (count--) {\n        Weights wt{DataType::kFLOAT, nullptr, 0};\n        uint32_t size;\n\n        // Read name and type of blob\n        std::string name;\n        input >> name >> std::dec >> size;\n        wt.type = DataType::kFLOAT;\n\n        // Load blob\n        uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));\n        for (uint32_t x = 0, y = size; x < y; ++x) {\n            input >> std::hex >> val[x];\n        }\n        wt.values = val;\n\n        wt.count = size;\n        weightMap[name] = wt;\n    }\n\n    return weightMap;\n}\n\nint get_width(int x, float gw, int divisor) {\n    return int(ceil((x * gw) / divisor)) * divisor;\n}\n\nint get_depth(int x, float gd) {\n    if (x == 1)\n        return 1;\n    int r = round(x * gd);\n    if (x * gd - int(x * gd) == 0.5 && (int(x * gd) % 2) == 0) {\n        --r;\n    }\n    return std::max<int>(r, 1);\n}\nstatic nvinfer1::IScaleLayer* addBatchNorm2d(nvinfer1::INetworkDefinition* network,\n                                             std::map<std::string, nvinfer1::Weights> weightMap,\n                                             nvinfer1::ITensor& input, std::string lname, float eps) {\n    float* gamma = (float*)weightMap[lname + \".weight\"].values;\n    float* beta = (float*)weightMap[lname + \".bias\"].values;\n    float* mean = (float*)weightMap[lname + \".running_mean\"].values;\n    float* var = (float*)weightMap[lname + \".running_var\"].values;\n    int len = weightMap[lname + \".running_var\"].count;\n\n    float* scval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        scval[i] = gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights scale{nvinfer1::DataType::kFLOAT, scval, len};\n\n    float* shval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        shval[i] = beta[i] - mean[i] * gamma[i] / sqrt(var[i] + eps);\n    }\n    nvinfer1::Weights shift{nvinfer1::DataType::kFLOAT, shval, len};\n\n    float* pval = reinterpret_cast<float*>(malloc(sizeof(float) * len));\n    for (int i = 0; i < len; i++) {\n        pval[i] = 1.0;\n    }\n    nvinfer1::Weights power{nvinfer1::DataType::kFLOAT, pval, len};\n    weightMap[lname + \".scale\"] = scale;\n    weightMap[lname + \".shift\"] = shift;\n    weightMap[lname + \".power\"] = power;\n    nvinfer1::IScaleLayer* output = network->addScale(input, nvinfer1::ScaleMode::kCHANNEL, shift, scale, power);\n    assert(output);\n    return output;\n}\nnvinfer1::ILayer* convBnSiLU(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights>& weightMap,\n                             nvinfer1::ITensor& input, int ch, int k, int s, int p, std::string lname, int g) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k, k}, weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n    conv->setNbGroups(g);\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n\n    nvinfer1::IActivationLayer* sigmoid = network->addActivation(*bn->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    assert(sigmoid);\n    auto ew = network->addElementWise(*bn->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\nnvinfer1::ILayer* convBnNoAct(nvinfer1::INetworkDefinition* network,\n                              std::map<std::string, nvinfer1::Weights>& weightMap, nvinfer1::ITensor& input, int ch,\n                              int k, int s, int p, std::string lname, int g) {\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv =\n            network->addConvolutionNd(input, ch, nvinfer1::DimsHW{k, k}, weightMap[lname + \".conv.weight\"], bias_empty);\n    assert(conv);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n    conv->setNbGroups(g);\n\n    nvinfer1::IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + \".bn\", 1e-3);\n    return bn;\n}\n\nstd::vector<std::vector<float>> getAnchors(std::map<std::string, Weights>& weightMap, std::string lname) {\n    std::vector<std::vector<float>> anchors;\n    Weights wts = weightMap[lname + \".anchor_grid\"];\n    int anchor_len = kNumAnchor * 2;\n    for (int i = 0; i < wts.count / anchor_len; i++) {\n        auto* p = (const float*)wts.values + i * anchor_len;\n        std::vector<float> anchor(p, p + anchor_len);\n        anchors.push_back(anchor);\n    }\n    return anchors;\n}\n\nILayer* RepConvN(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2,\n                 int k, int s, int p, int g, int d, bool act, bool bn, bool deploy, std::string lname) {\n    assert(k == 3 && p == 1);\n    ILayer* conv1 = convBnNoAct(network, weightMap, input, c2, k, s, p, lname + \".conv1\", g);\n    ILayer* conv2 = convBnNoAct(network, weightMap, input, c2, 1, s, p - k / 2, lname + \".conv2\", g);\n    ILayer* ew0 = network->addElementWise(*conv1->getOutput(0), *conv2->getOutput(0), ElementWiseOperation::kSUM);\n    nvinfer1::IActivationLayer* sigmoid =\n            network->addActivation(*ew0->getOutput(0), nvinfer1::ActivationType::kSIGMOID);\n    assert(sigmoid);\n\n    auto ew =\n            network->addElementWise(*ew0->getOutput(0), *sigmoid->getOutput(0), nvinfer1::ElementWiseOperation::kPROD);\n    assert(ew);\n    return ew;\n}\n\nILayer* RepNBottleneck(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1,\n                       int c2, bool shortcut, int k, int g, float e, std::string lname) {\n    int c_ = int(c2 * e);\n    assert(k == 3 && \"RepVGG only support kernel size 3\");\n    auto cv1 = RepConvN(network, weightMap, input, c1, c_, k, 1, 1, g, 1, true, false, false, lname + \".cv1\");\n    auto cv2 = convBnSiLU(network, weightMap, *cv1->getOutput(0), c2, k, 1, 1, lname + \".cv2\", g);\n    if (shortcut && c1 == c2) {\n        auto ew = network->addElementWise(input, *cv2->getOutput(0), ElementWiseOperation::kSUM);\n        return ew;\n    }\n    return cv2;\n}\n\nILayer* RepNCSP(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2,\n                int n, bool shortcut, int g, float e, std::string lname) {\n    int c_ = int(c2 * e);\n\n    auto cv1 = convBnSiLU(network, weightMap, input, c_, 1, 1, 0, lname + \".cv1\", 1);\n\n    ILayer* m = cv1;\n    for (int i = 0; i < n; i++) {\n        m = RepNBottleneck(network, weightMap, *m->getOutput(0), c_, c_, shortcut, 3, g, 1.0,\n                           lname + \".m.\" + std::to_string(i));\n    }\n\n    // auto m_0 = RepNBottleneck(network, weightMap, *cv1->getOutput(0), c_, c_, shortcut, 3, g, 1.0, lname + \".m.0\");\n    // auto m_1 = RepNBottleneck(network, weightMap, *m_0->getOutput(0), c_, c_, shortcut, 3, g, 1.0, lname + \".m.1\");\n\n    auto cv2 = convBnSiLU(network, weightMap, input, c_, 1, 1, 0, lname + \".cv2\", 1);\n    ITensor* inputTensors[] = {m->getOutput(0), cv2->getOutput(0)};\n    auto cat = network->addConcatenation(inputTensors, 2);\n\n    auto cv3 = convBnSiLU(network, weightMap, *cat->getOutput(0), c2, 1, 1, 0, lname + \".cv3\", 1);\n    return cv3;\n}\n\nILayer* ELAN1(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2,\n              int c3, int c4, std::string lname) {\n    auto cv1 = convBnSiLU(network, weightMap, input, c3, 1, 1, 0, lname + \".cv1\", 1);\n    // chunk(2, 1)\n\n    nvinfer1::Dims d = cv1->getOutput(0)->getDimensions();\n    nvinfer1::ISliceLayer* split1 =\n            network->addSlice(*cv1->getOutput(0), nvinfer1::Dims3{0, 0, 0}, nvinfer1::Dims3{d.d[0] / 2, d.d[1], d.d[2]},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split2 =\n            network->addSlice(*cv1->getOutput(0), nvinfer1::Dims3{d.d[0] / 2, 0, 0},\n                              nvinfer1::Dims3{d.d[0] / 2, d.d[1], d.d[2]}, nvinfer1::Dims3{1, 1, 1});\n    auto cv2 = convBnSiLU(network, weightMap, *split2->getOutput(0), c4, 3, 1, 1, lname + \".cv2\", 1);\n\n    auto cv3 = convBnSiLU(network, weightMap, *cv2->getOutput(0), c4, 3, 1, 1, lname + \".cv3\", 1);\n\n    ITensor* inputTensors[] = {split1->getOutput(0), split2->getOutput(0), cv2->getOutput(0), cv3->getOutput(0)};\n    auto cat = network->addConcatenation(inputTensors, 4);\n    auto cv4 = convBnSiLU(network, weightMap, *cat->getOutput(0), c2, 1, 1, 0, lname + \".cv4\", 1);\n    return cv4;\n}\n\nILayer* RepNCSPELAN4(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1,\n                     int c2, int c3, int c4, int c5, std::string lname) {\n\n    auto cv1 = convBnSiLU(network, weightMap, input, c3, 1, 1, 0, lname + \".cv1\", 1);\n    // chunk(2, 1)\n\n    nvinfer1::Dims d = cv1->getOutput(0)->getDimensions();\n    nvinfer1::ISliceLayer* split1 =\n            network->addSlice(*cv1->getOutput(0), nvinfer1::Dims3{0, 0, 0}, nvinfer1::Dims3{d.d[0] / 2, d.d[1], d.d[2]},\n                              nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split2 =\n            network->addSlice(*cv1->getOutput(0), nvinfer1::Dims3{d.d[0] / 2, 0, 0},\n                              nvinfer1::Dims3{d.d[0] / 2, d.d[1], d.d[2]}, nvinfer1::Dims3{1, 1, 1});\n\n    auto cv2_0 = RepNCSP(network, weightMap, *split2->getOutput(0), c3 / 2, c4, c5, true, 1, 0.5, lname + \".cv2.0\");\n    auto cv2_1 = convBnSiLU(network, weightMap, *cv2_0->getOutput(0), c4, 3, 1, 1, lname + \".cv2.1\", 1);\n\n    auto cv3_0 = RepNCSP(network, weightMap, *cv2_1->getOutput(0), c4, c4, c5, true, 1, 0.5, lname + \".cv3.0\");\n    auto cv3_1 = convBnSiLU(network, weightMap, *cv3_0->getOutput(0), c4, 3, 1, 1, lname + \".cv3.1\", 1);\n\n    ITensor* inputTensors[] = {split1->getOutput(0), split2->getOutput(0), cv2_1->getOutput(0), cv3_1->getOutput(0)};\n    auto cat = network->addConcatenation(inputTensors, 4);\n    auto cv4 = convBnSiLU(network, weightMap, *cat->getOutput(0), c2, 1, 1, 0, lname + \".cv4\", 1);\n    return cv4;\n}\n\nILayer* AConv(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c2,\n              std::string lname) {\n    auto pool = network->addPoolingNd(input, PoolingType::kAVERAGE, DimsHW{2, 2});\n    pool->setStrideNd(DimsHW{1, 1});\n    pool->setPaddingNd(DimsHW{0, 0});\n    auto cv1 = convBnSiLU(network, weightMap, *pool->getOutput(0), c2, 3, 2, 1, lname + \".cv1\", 1);\n    return cv1;\n}\nILayer* ADown(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c2,\n              std::string lname) {\n    int c_ = c2 / 2;\n    auto pool = network->addPoolingNd(input, PoolingType::kAVERAGE, DimsHW{2, 2});\n    pool->setStrideNd(DimsHW{1, 1});\n    pool->setPaddingNd(DimsHW{0, 0});\n\n    nvinfer1::Dims d = pool->getOutput(0)->getDimensions();\n    nvinfer1::ISliceLayer* split1 =\n            network->addSlice(*pool->getOutput(0), nvinfer1::Dims3{0, 0, 0},\n                              nvinfer1::Dims3{d.d[0] / 2, d.d[1], d.d[2]}, nvinfer1::Dims3{1, 1, 1});\n    nvinfer1::ISliceLayer* split2 =\n            network->addSlice(*pool->getOutput(0), nvinfer1::Dims3{d.d[0] / 2, 0, 0},\n                              nvinfer1::Dims3{d.d[0] / 2, d.d[1], d.d[2]}, nvinfer1::Dims3{1, 1, 1});\n\n    // auto chunklayer = layer_split(1, pool->getOutput(0), network);\n    auto cv1 = convBnSiLU(network, weightMap, *split1->getOutput(0), c_, 3, 2, 1, lname + \".cv1\", 1);\n\n    auto pool2 = network->addPoolingNd(*split2->getOutput(0), PoolingType::kMAX, DimsHW{3, 3});\n    pool2->setStrideNd(DimsHW{2, 2});\n    pool2->setPaddingNd(DimsHW{1, 1});\n    auto cv2 = convBnSiLU(network, weightMap, *pool2->getOutput(0), c_, 1, 1, 0, lname + \".cv2\", 1);\n\n    ITensor* inputTensors[] = {cv1->getOutput(0), cv2->getOutput(0)};\n    auto cat = network->addConcatenation(inputTensors, 2);\n    return cat;\n}\n\nstd::vector<ILayer*> CBLinear(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input,\n                              std::vector<int> c2s, int k, int s, int p, int g, std::string lname) {\n\n    IConvolutionLayer* conv1 =\n            network->addConvolutionNd(input, std::accumulate(c2s.begin(), c2s.end(), 0), DimsHW{k, k},\n                                      weightMap[lname + \".conv.weight\"], weightMap[lname + \".conv.bias\"]);\n    assert(conv1);\n    conv1->setName((lname + \".conv\").c_str());\n    conv1->setStrideNd(DimsHW{s, s});\n    conv1->setPaddingNd(DimsHW{p, p});\n\n    int h = input.getDimensions().d[1];\n    int w = input.getDimensions().d[2];\n    std::vector<ILayer*> slices(c2s.size());\n    int start = 0;\n    for (int i = 0; i < c2s.size(); i++) {\n        slices[i] = network->addSlice(*conv1->getOutput(0), Dims3{start, 0, 0}, Dims3{c2s[i], h, w}, Dims3{1, 1, 1});\n        start += c2s[i];\n    }\n    return slices;\n}\n\nILayer* CBFuse(INetworkDefinition* network, std::vector<std::vector<ILayer*>> input, std::vector<int> idx,\n               std::vector<int> strides) {\n    ILayer** res = new ILayer*[input.size()];\n    res[input.size() - 1] = input[input.size() - 1][0];\n\n    for (int i = input.size() - 2; i >= 0; i--) {\n        auto upsample = network->addResize(*input[i][idx[i]]->getOutput(0));\n        upsample->setResizeMode(ResizeMode::kNEAREST);\n        const float scales[] = {1, strides[i] / strides[strides.size() - 1], strides[i] / strides[strides.size() - 1]};\n        upsample->setScales(scales, 3);\n        res[i] = upsample;\n    }\n\n    for (int i = 1; i < input.size(); i++) {\n        auto ew = network->addElementWise(*res[0]->getOutput(0), *res[i]->getOutput(0), ElementWiseOperation::kSUM);\n        res[0] = ew;\n    }\n    return res[0];\n}\n\nILayer* SP(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int k, int s) {\n    int p = k / 2;\n    auto pool = network->addPoolingNd(input, PoolingType::kMAX, DimsHW{k, k});\n    pool->setPaddingNd(DimsHW{p, p});\n    pool->setStrideNd(DimsHW{s, s});\n    return pool;\n}\n\nILayer* SPPELAN(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c1, int c2,\n                int c3, std::string lname) {\n    auto cv1 = convBnSiLU(network, weightMap, input, c3, 1, 1, 0, lname + \".cv1\", 1);\n    auto cv2 = SP(network, weightMap, *cv1->getOutput(0), 5, 1);\n    auto cv3 = SP(network, weightMap, *cv2->getOutput(0), 5, 1);\n    auto cv4 = SP(network, weightMap, *cv3->getOutput(0), 5, 1);\n\n    ITensor* inputTensors[] = {cv1->getOutput(0), cv2->getOutput(0), cv3->getOutput(0), cv4->getOutput(0)};\n    auto cat = network->addConcatenation(inputTensors, 4);\n    auto cv5 = convBnSiLU(network, weightMap, *cat->getOutput(0), c2, 1, 1, 0, lname + \".cv5\", 1);\n    return cv5;\n}\n\nILayer* DetectBbox_Conv(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c2,\n                        int reg_max, std::string lname) {\n    auto cv_0 = convBnSiLU(network, weightMap, input, c2, 3, 1, 1, lname + \".0\", 1);\n    auto cv_1 = convBnSiLU(network, weightMap, *cv_0->getOutput(0), c2, 3, 1, 1, lname + \".1\", 4);\n    auto cv_2 = network->addConvolutionNd(*cv_1->getOutput(0), reg_max * 4, DimsHW{1, 1},\n                                          weightMap[lname + \".2.weight\"], weightMap[lname + \".2.bias\"]);\n    cv_2->setName((lname + \".conv\").c_str());\n    cv_2->setStrideNd(DimsHW{1, 1});\n    cv_2->setPaddingNd(DimsHW{0, 0});\n    cv_2->setNbGroups(4);\n    return cv_2;\n}\n\nILayer* DetectCls_Conv(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, int c2,\n                       int cls, std::string lname) {\n    auto cv_0 = convBnSiLU(network, weightMap, input, c2, 3, 1, 1, lname + \".0\", 1);\n    auto cv_1 = convBnSiLU(network, weightMap, *cv_0->getOutput(0), c2, 3, 1, 1, lname + \".1\", 1);\n    auto cv_2 = network->addConvolutionNd(*cv_1->getOutput(0), cls, DimsHW{1, 1}, weightMap[lname + \".2.weight\"],\n                                          weightMap[lname + \".2.bias\"]);\n    cv_2->setName((lname + \".conv\").c_str());\n    cv_2->setStrideNd(DimsHW{1, 1});\n    cv_2->setPaddingNd(DimsHW{0, 0});\n    return cv_2;\n}\n\nnvinfer1::IShuffleLayer* DFL(nvinfer1::INetworkDefinition* network, std::map<std::string, nvinfer1::Weights> weightMap,\n                             nvinfer1::ITensor& input, int ch, int k, int s, int p, std::string lname) {\n    auto dim = input.getDimensions();\n    int c = dim.d[0];\n    int grid = dim.d[1] * dim.d[2];\n    int split_num = c / ch;\n\n    nvinfer1::IShuffleLayer* shuffle1 = network->addShuffle(input);\n    shuffle1->setReshapeDimensions(nvinfer1::Dims3{split_num, ch, grid});\n    shuffle1->setSecondTranspose(nvinfer1::Permutation{1, 0, 2});\n    nvinfer1::ISoftMaxLayer* softmax = network->addSoftMax(*shuffle1->getOutput(0));\n    nvinfer1::Weights bias_empty{nvinfer1::DataType::kFLOAT, nullptr, 0};\n    nvinfer1::IConvolutionLayer* conv = network->addConvolutionNd(*softmax->getOutput(0), 1, nvinfer1::DimsHW{1, 1},\n                                                                  weightMap[lname + \".conv.weight\"], bias_empty);\n    conv->setStrideNd(nvinfer1::DimsHW{s, s});\n    conv->setPaddingNd(nvinfer1::DimsHW{p, p});\n    nvinfer1::IShuffleLayer* shuffle2 = network->addShuffle(*conv->getOutput(0));\n    shuffle2->setReshapeDimensions(nvinfer1::Dims2{4, grid});\n    return shuffle2;\n}\n\nnvinfer1::IPluginV2Layer* addYoLoLayer(nvinfer1::INetworkDefinition* network,\n                                       std::vector<nvinfer1::IConcatenationLayer*> dets, bool is_segmentation) {\n    auto creator = getPluginRegistry()->getPluginCreator(\"YoloLayer_TRT\", \"1\");\n\n    nvinfer1::PluginField plugin_fields[1];\n    int netinfo[5] = {kNumClass, kInputW, kInputH, kMaxNumOutputBbox, is_segmentation};\n    plugin_fields[0].data = netinfo;\n    plugin_fields[0].length = 5;\n    plugin_fields[0].name = \"netinfo\";\n    plugin_fields[0].type = nvinfer1::PluginFieldType::kFLOAT32;\n\n    nvinfer1::PluginFieldCollection plugin_data;\n    plugin_data.nbFields = 1;\n    plugin_data.fields = plugin_fields;\n    nvinfer1::IPluginV2* plugin_obj = creator->createPlugin(\"yololayer\", &plugin_data);\n    std::vector<nvinfer1::ITensor*> input_tensors;\n    for (auto det : dets) {\n        input_tensors.push_back(det->getOutput(0));\n    }\n    auto yolo = network->addPluginV2(&input_tensors[0], input_tensors.size(), *plugin_obj);\n    return yolo;\n}\n\nstd::vector<IConcatenationLayer*> DualDDetect(INetworkDefinition* network, std::map<std::string, Weights>& weightMap,\n                                              std::vector<ILayer*> dets, int cls, std::vector<int> ch,\n                                              std::string lname) {\n    int c2 = std::max(int(ch[0] / 4), int(16 * 4));\n    int c3 = std::max(ch[0], std::min(cls * 2, 128));\n    int reg_max = 16;\n\n    std::vector<ILayer*> bboxlayers;\n    std::vector<ILayer*> clslayers;\n\n    for (int i = 0; i < dets.size(); i++) {\n        // Conv(x, c2, 3), Conv(c2, c2, 3, g=4), nn.Conv2d(c2, 4 * self.reg_max, 1, groups=4)\n        bboxlayers.push_back(DetectBbox_Conv(network, weightMap, *dets[i]->getOutput(0), c2, reg_max,\n                                             lname + \".cv2.\" + std::to_string(i)));\n        // Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, self.nc, 1)\n        auto cls_layer = DetectCls_Conv(network, weightMap, *dets[i]->getOutput(0), c3, cls,\n                                        lname + \".cv3.\" + std::to_string(i));\n        auto dim = cls_layer->getOutput(0)->getDimensions();\n        nvinfer1::IShuffleLayer* shuffle = network->addShuffle(*cls_layer->getOutput(0));\n        shuffle->setReshapeDimensions(nvinfer1::Dims2{kNumClass, dim.d[1] * dim.d[2]});\n        clslayers.push_back(shuffle);\n    }\n\n    std::vector<IConcatenationLayer*> ret;\n    for (int i = 0; i < dets.size(); i++) {\n        // softmax 16*4, w, h => 16, 4, w, h\n        auto loc = DFL(network, weightMap, *bboxlayers[i]->getOutput(0), 16, 1, 1, 0, lname + \".dfl\");\n        nvinfer1::ITensor* inputTensor[] = {loc->getOutput(0), clslayers[i]->getOutput(0)};\n        ret.push_back(network->addConcatenation(inputTensor, 2));\n    }\n    return ret;\n}\n\nstd::vector<IConcatenationLayer*> DDetect(INetworkDefinition* network, std::map<std::string, Weights>& weightMap,\n                                          std::vector<ILayer*> dets, int cls, std::vector<int> ch, std::string lname) {\n    int c2 = std::max(int(ch[0] / 4), int(16 * 4));\n    //  max((ch[0], min((self.nc * 2, 128))))\n    // int c3 = std::max(ch[0], std::min(cls * 2, 128));\n    int c3 = std::max(ch[0], std::min(cls, 128));\n    int reg_max = 16;\n\n    std::vector<ILayer*> bboxlayers;\n    std::vector<ILayer*> clslayers;\n\n    for (int i = 0; i < dets.size(); i++) {\n        // Conv(x, c2, 3), Conv(c2, c2, 3, g=4), nn.Conv2d(c2, 4 * self.reg_max, 1, groups=4)\n        bboxlayers.push_back(DetectBbox_Conv(network, weightMap, *dets[i]->getOutput(0), c2, reg_max,\n                                             lname + \".cv2.\" + std::to_string(i)));\n        // Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, self.nc, 1)\n        auto cls_layer = DetectCls_Conv(network, weightMap, *dets[i]->getOutput(0), c3, cls,\n                                        lname + \".cv3.\" + std::to_string(i));\n        auto dim = cls_layer->getOutput(0)->getDimensions();\n        nvinfer1::IShuffleLayer* shuffle = network->addShuffle(*cls_layer->getOutput(0));\n        shuffle->setReshapeDimensions(nvinfer1::Dims2{kNumClass, dim.d[1] * dim.d[2]});\n        clslayers.push_back(shuffle);\n    }\n\n    std::vector<IConcatenationLayer*> ret;\n    for (int i = 0; i < dets.size(); i++) {\n        // softmax 16*4, w, h => 16, 4, w, h\n        auto loc = DFL(network, weightMap, *bboxlayers[i]->getOutput(0), 16, 1, 1, 0, lname + \".dfl\");\n        nvinfer1::ITensor* inputTensor[] = {loc->getOutput(0), clslayers[i]->getOutput(0)};\n        ret.push_back(network->addConcatenation(inputTensor, 2));\n    }\n    return ret;\n}\n"
  },
  {
    "path": "yolov9/src/calibrator.cpp",
    "content": "#include \"calibrator.h\"\n#include \"cuda_utils.h\"\n#include \"utils.h\"\n\n#include <fstream>\n#include <iostream>\n#include <iterator>\n#include <opencv2/dnn/dnn.hpp>\n#include <opencv2/opencv.hpp>\nstatic cv::Mat preprocess_img(cv::Mat& img, int input_w, int input_h) {\n    int w, h, x, y;\n    float r_w = input_w / (img.cols * 1.0);\n    float r_h = input_h / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = input_w;\n        h = r_w * img.rows;\n        x = 0;\n        y = (input_h - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = input_h;\n        x = (input_w - w) / 2;\n        y = 0;\n    }\n    cv::Mat re(h, w, CV_8UC3);\n    cv::resize(img, re, re.size(), 0, 0, cv::INTER_LINEAR);\n    cv::Mat out(input_h, input_w, CV_8UC3, cv::Scalar(128, 128, 128));\n    re.copyTo(out(cv::Rect(x, y, re.cols, re.rows)));\n    return out;\n}\n\nInt8EntropyCalibrator2::Int8EntropyCalibrator2(int batchsize, int input_w, int input_h, const char* img_dir,\n                                               const char* calib_table_name, const char* input_blob_name,\n                                               bool read_cache)\n    : batchsize_(batchsize),\n      input_w_(input_w),\n      input_h_(input_h),\n      img_idx_(0),\n      img_dir_(img_dir),\n      calib_table_name_(calib_table_name),\n      input_blob_name_(input_blob_name),\n      read_cache_(read_cache) {\n    input_count_ = 3 * input_w * input_h * batchsize;\n    CUDA_CHECK(cudaMalloc(&device_input_, input_count_ * sizeof(float)));\n    read_files_in_dir(img_dir, img_files_);\n}\n\nInt8EntropyCalibrator2::~Int8EntropyCalibrator2() {\n    CUDA_CHECK(cudaFree(device_input_));\n}\n\nint Int8EntropyCalibrator2::getBatchSize() const TRT_NOEXCEPT {\n    return batchsize_;\n}\n\nbool Int8EntropyCalibrator2::getBatch(void* bindings[], const char* names[], int nbBindings) TRT_NOEXCEPT {\n    if (img_idx_ + batchsize_ > (int)img_files_.size()) {\n        return false;\n    }\n\n    std::vector<cv::Mat> input_imgs_;\n    for (int i = img_idx_; i < img_idx_ + batchsize_; i++) {\n        std::cout << img_files_[i] << \"  \" << i << std::endl;\n        cv::Mat temp = cv::imread(img_dir_ + img_files_[i]);\n        if (temp.empty()) {\n            std::cerr << \"Fatal error: image cannot open!\" << std::endl;\n            return false;\n        }\n        cv::Mat pr_img = preprocess_img(temp, input_w_, input_h_);\n        input_imgs_.push_back(pr_img);\n    }\n    img_idx_ += batchsize_;\n    cv::Mat blob = cv::dnn::blobFromImages(input_imgs_, 1.0 / 255.0, cv::Size(input_w_, input_h_), cv::Scalar(0, 0, 0),\n                                           true, false);\n\n    CUDA_CHECK(cudaMemcpy(device_input_, blob.ptr<float>(0), input_count_ * sizeof(float), cudaMemcpyHostToDevice));\n    assert(!strcmp(names[0], input_blob_name_));\n    bindings[0] = device_input_;\n    return true;\n}\n\nconst void* Int8EntropyCalibrator2::readCalibrationCache(size_t& length) TRT_NOEXCEPT {\n    std::cout << \"reading calib cache: \" << calib_table_name_ << std::endl;\n    calib_cache_.clear();\n    std::ifstream input(calib_table_name_, std::ios::binary);\n    input >> std::noskipws;\n    if (read_cache_ && input.good()) {\n        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(), std::back_inserter(calib_cache_));\n    }\n    length = calib_cache_.size();\n    return length ? calib_cache_.data() : nullptr;\n}\n\nvoid Int8EntropyCalibrator2::writeCalibrationCache(const void* cache, size_t length) TRT_NOEXCEPT {\n    std::cout << \"writing calib cache: \" << calib_table_name_ << \" size: \" << length << std::endl;\n    std::ofstream output(calib_table_name_, std::ios::binary);\n    output.write(reinterpret_cast<const char*>(cache), length);\n}\n"
  },
  {
    "path": "yolov9/src/model.cpp",
    "content": "#include \"model.h\"\n#include <cassert>\n#include <cmath>\n#include <cstring>\n#include <fstream>\n#include <iostream>\n#include <map>\n#include \"block.h\"\n#include \"calibrator.h\"\n#include \"config.h\"\n#include \"yololayer.h\"\n\nusing namespace nvinfer1;\n#ifdef USE_INT8\nvoid Calibrator(IBuilder* builder, IBuilderConfig* config) {\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(BuilderFlag::kINT8);\n    Int8EntropyCalibrator2* calibrator =\n            new Int8EntropyCalibrator2(1, kInputW, kInputH, gCalibTablePath, \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n}\n#endif\n\nIHostMemory* build_engine_yolov9_t(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt,\n                                   std::string& wts_name, bool isConvert) {\n    /* ------ Create the builder ------ */\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{3, kInputH, kInputW});\n    assert(data);\n    std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n    // # conv down\n    auto conv_1 = convBnSiLU(network, weightMap, *data, 16, 3, 2, 1, \"model.0\", 1);\n    // # conv down\n    auto conv_2 = convBnSiLU(network, weightMap, *conv_1->getOutput(0), 32, 3, 2, 1, \"model.1\");\n    // # elan-1 block\n    auto repncspelan_3 = ELAN1(network, weightMap, *conv_2->getOutput(0), 32, 32, 32, 16, \"model.2\");\n    // # avg-conv down\n    // [-1, 1, ADown, [256]],  # 4-P3/8\n    auto adown_4 = AConv(network, weightMap, *repncspelan_3->getOutput(0), 64, \"model.3\");\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 256, 128, 1]],  # 5\n    auto repncspelan_5 = RepNCSPELAN4(network, weightMap, *adown_4->getOutput(0), 64, 64, 64, 32, 3, \"model.4\");\n    // # avg-conv down\n    // [-1, 1, ADown, [512]],  # 6-P4/16\n    auto adown_6 = AConv(network, weightMap, *repncspelan_5->getOutput(0), 96, \"model.5\");\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 7\n    auto repncspelan_7 = RepNCSPELAN4(network, weightMap, *adown_6->getOutput(0), 96, 96, 96, 48, 3, \"model.6\");\n    // # avg-conv down\n    // [-1, 1, ADown, [512]],  # 8-P5/32\n    auto adown_8 = AConv(network, weightMap, *repncspelan_7->getOutput(0), 128, \"model.7\");\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 9\n    auto repncspelan_9 = RepNCSPELAN4(network, weightMap, *adown_8->getOutput(0), 128, 128, 128, 64, 3, \"model.8\");\n    // # elan-spp block\n    // [-1, 1, SPPELAN, [512, 256]],  # 10\n    auto sppelan_10 = SPPELAN(network, weightMap, *repncspelan_9->getOutput(0), 128, 128, 64, \"model.9\");\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_11 = network->addResize(*sppelan_10->getOutput(0));\n    upsample_11->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_11[] = {1.0, 2.0, 2.0};\n    upsample_11->setScales(scales_11, 3);\n    // [[-1, 7], 1, Concat, [1]],  # cat backbone P4\n    ITensor* input_tensor_12[] = {upsample_11->getOutput(0), repncspelan_7->getOutput(0)};\n    auto cat_12 = network->addConcatenation(input_tensor_12, 2);\n\n    // # elan-2 block\n    auto repncspelan_13 = RepNCSPELAN4(network, weightMap, *cat_12->getOutput(0), 288, 96, 96, 48, 3, \"model.12\");\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_14 = network->addResize(*repncspelan_13->getOutput(0));\n    upsample_14->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_14[] = {1.0, 2.0, 2.0};\n    upsample_14->setScales(scales_14, 3);\n    // [[-1, 5], 1, Concat, [1]],  # cat backbone P3\n    ITensor* input_tensor_15[] = {upsample_14->getOutput(0), repncspelan_5->getOutput(0)};\n    auto cat_15 = network->addConcatenation(input_tensor_15, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [256, 256, 128, 1]],  # 16 (P3/8-small)\n    auto repncspelan_16 = RepNCSPELAN4(network, weightMap, *cat_15->getOutput(0), 192, 64, 64, 32, 3, \"model.15\");\n\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [256]],\n    auto adown_17 = AConv(network, weightMap, *repncspelan_16->getOutput(0), 48, \"model.16\");\n    // [[-1, 13], 1, Concat, [1]],  # cat head P4\n    ITensor* input_tensor_18[] = {adown_17->getOutput(0), repncspelan_13->getOutput(0)};\n    auto cat_18 = network->addConcatenation(input_tensor_18, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 19 (P4/16-medium)\n    auto repncspelan_19 = RepNCSPELAN4(network, weightMap, *cat_18->getOutput(0), 144, 96, 96, 48, 3, \"model.18\");\n\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [512]],\n    auto adown_20 = AConv(network, weightMap, *repncspelan_19->getOutput(0), 64, \"model.19\");\n    // [[-1, 10], 1, Concat, [1]],  # cat head P5\n    ITensor* input_tensor_21[] = {adown_20->getOutput(0), sppelan_10->getOutput(0)};\n    auto cat_21 = network->addConcatenation(input_tensor_21, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 22 (P5/32-large)\n    auto repncspelan_22 = RepNCSPELAN4(network, weightMap, *cat_21->getOutput(0), 256, 128, 128, 64, 3, \"model.21\");\n\n    std::vector<IConcatenationLayer*> head;\n    if (!isConvert) {\n        // # elan-spp block\n        auto sppelan_23 = SPPELAN(network, weightMap, *repncspelan_9->getOutput(0), 512, 128, 64, \"model.22\");\n\n        // # up-concat merge\n        auto upsample_24 = network->addResize(*sppelan_23->getOutput(0));\n        upsample_24->setResizeMode(ResizeMode::kNEAREST);\n        const float scales_24[] = {1.0, 2.0, 2.0};\n        upsample_24->setScales(scales_24, 3);\n        // [[-1, 6], 1, Concat, [1]],  # cat backbone P4\n        ITensor* input_tensor_25[] = {upsample_24->getOutput(0), repncspelan_7->getOutput(0)};\n        auto cat_25 = network->addConcatenation(input_tensor_25, 2);\n\n        // # elan-2 block\n        auto repncspelan_26 = RepNCSPELAN4(network, weightMap, *cat_25->getOutput(0), 384, 96, 96, 48, 3, \"model.25\");\n\n        // # up-concat merge\n        auto upsample_27 = network->addResize(*repncspelan_26->getOutput(0));\n        upsample_27->setResizeMode(ResizeMode::kNEAREST);\n        const float scales_27[] = {1.0, 2.0, 2.0};\n        upsample_27->setScales(scales_27, 3);\n        // [[-1, 4], 1, Concat, [1]],  # cat backbone P3\n        ITensor* input_tensor_28[] = {upsample_27->getOutput(0), repncspelan_5->getOutput(0)};\n        auto cat_28 = network->addConcatenation(input_tensor_28, 2);\n\n        // # elan-2 block\n        auto repncspelan_29 = RepNCSPELAN4(network, weightMap, *cat_28->getOutput(0), 256, 64, 64, 32, 3, \"model.28\");\n        head = DualDDetect(network, weightMap, std::vector<ILayer*>{repncspelan_16, repncspelan_19, repncspelan_22},\n                           kNumClass, {64, 96, 128}, \"model.29\");\n    } else {\n        head = DDetect(network, weightMap, std::vector<ILayer*>{repncspelan_16, repncspelan_19, repncspelan_22},\n                       kNumClass, {64, 96, 128}, \"model.22\");\n    }\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(network, head, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator =\n            new Int8EntropyCalibrator2(1, kInputW, kInputH, gCalibTablePath, \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return serialized_model;\n}\n\nIHostMemory* build_engine_yolov9_s(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt,\n                                   std::string& wts_name, bool isConvert) {\n    /* ------ Create the builder ------ */\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{3, kInputH, kInputW});\n    assert(data);\n    std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n    // # conv down\n    auto conv_1 = convBnSiLU(network, weightMap, *data, 32, 3, 2, 1, \"model.0\", 1);\n    // # conv down\n    auto conv_2 = convBnSiLU(network, weightMap, *conv_1->getOutput(0), 64, 3, 2, 1, \"model.1\");\n    // # elan-1 block\n    auto repncspelan_3 = ELAN1(network, weightMap, *conv_2->getOutput(0), 32, 64, 64, 32, \"model.2\");\n    // # avg-conv down\n    auto adown_4 = AConv(network, weightMap, *repncspelan_3->getOutput(0), 128, \"model.3\");\n    // # elan-2 block\n    auto repncspelan_5 = RepNCSPELAN4(network, weightMap, *adown_4->getOutput(0), 128, 128, 128, 64, 3, \"model.4\");\n    // # avg-conv down\n    auto adown_6 = AConv(network, weightMap, *repncspelan_5->getOutput(0), 192, \"model.5\");\n    // # elan-2 block\n    auto repncspelan_7 = RepNCSPELAN4(network, weightMap, *adown_6->getOutput(0), 192, 192, 192, 96, 3, \"model.6\");\n    // # avg-conv down\n    auto adown_8 = AConv(network, weightMap, *repncspelan_7->getOutput(0), 256, \"model.7\");\n    // # elan-2 block\n    auto repncspelan_9 = RepNCSPELAN4(network, weightMap, *adown_8->getOutput(0), 256, 256, 256, 128, 3, \"model.8\");\n    // # elan-spp block\n    auto sppelan_10 = SPPELAN(network, weightMap, *repncspelan_9->getOutput(0), 512, 256, 128, \"model.9\");\n\n    // # up-concat merge\n    auto upsample_11 = network->addResize(*sppelan_10->getOutput(0));\n    upsample_11->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_11[] = {1.0, 2.0, 2.0};\n    upsample_11->setScales(scales_11, 3);\n    // [[-1, 7], 1, Concat, [1]],  # cat backbone P4\n    ITensor* input_tensor_12[] = {upsample_11->getOutput(0), repncspelan_7->getOutput(0)};\n    auto cat_12 = network->addConcatenation(input_tensor_12, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 13\n    auto repncspelan_13 = RepNCSPELAN4(network, weightMap, *cat_12->getOutput(0), 192, 192, 192, 96, 3, \"model.12\");\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_14 = network->addResize(*repncspelan_13->getOutput(0));\n    upsample_14->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_14[] = {1.0, 2.0, 2.0};\n    upsample_14->setScales(scales_14, 3);\n    // [[-1, 5], 1, Concat, [1]],  # cat backbone P3\n    ITensor* input_tensor_15[] = {upsample_14->getOutput(0), repncspelan_5->getOutput(0)};\n    auto cat_15 = network->addConcatenation(input_tensor_15, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [256, 256, 128, 1]],  # 16 (P3/8-small)\n    auto repncspelan_16 = RepNCSPELAN4(network, weightMap, *cat_15->getOutput(0), 128, 128, 128, 64, 3, \"model.15\");\n\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [256]],\n    auto adown_17 = AConv(network, weightMap, *repncspelan_16->getOutput(0), 96, \"model.16\");\n    // [[-1, 13], 1, Concat, [1]],  # cat head P4\n    ITensor* input_tensor_18[] = {adown_17->getOutput(0), repncspelan_13->getOutput(0)};\n    auto cat_18 = network->addConcatenation(input_tensor_18, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 19 (P4/16-medium)\n    auto repncspelan_19 = RepNCSPELAN4(network, weightMap, *cat_18->getOutput(0), 768, 192, 192, 96, 3, \"model.18\");\n\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [512]],\n    auto adown_20 = AConv(network, weightMap, *repncspelan_19->getOutput(0), 128, \"model.19\");\n    // [[-1, 10], 1, Concat, [1]],  # cat head P5\n    ITensor* input_tensor_21[] = {adown_20->getOutput(0), sppelan_10->getOutput(0)};\n    auto cat_21 = network->addConcatenation(input_tensor_21, 2);\n\n    // # elan-2 block\n    auto repncspelan_22 = RepNCSPELAN4(network, weightMap, *cat_21->getOutput(0), 1024, 256, 256, 128, 1, \"model.21\");\n    std::vector<IConcatenationLayer*> head;\n    if (!isConvert) {\n        // # elan-spp block\n        auto sppelan_23 = SPPELAN(network, weightMap, *repncspelan_9->getOutput(0), 512, 256, 128, \"model.22\");\n\n        // # up-concat merge\n        auto upsample_24 = network->addResize(*sppelan_23->getOutput(0));\n        upsample_24->setResizeMode(ResizeMode::kNEAREST);\n        const float scales_24[] = {1.0, 2.0, 2.0};\n        upsample_24->setScales(scales_24, 3);\n        // [[-1, 6], 1, Concat, [1]],  # cat backbone P4\n        ITensor* input_tensor_25[] = {upsample_24->getOutput(0), repncspelan_7->getOutput(0)};\n        auto cat_25 = network->addConcatenation(input_tensor_25, 2);\n\n        // # elan-2 block\n        auto repncspelan_26 = RepNCSPELAN4(network, weightMap, *cat_25->getOutput(0), 384, 192, 192, 96, 3, \"model.25\");\n\n        // # up-concat merge\n        auto upsample_27 = network->addResize(*repncspelan_26->getOutput(0));\n        upsample_27->setResizeMode(ResizeMode::kNEAREST);\n        const float scales_27[] = {1.0, 2.0, 2.0};\n        upsample_27->setScales(scales_27, 3);\n        // [[-1, 4], 1, Concat, [1]],  # cat backbone P3\n        ITensor* input_tensor_28[] = {upsample_27->getOutput(0), repncspelan_5->getOutput(0)};\n        auto cat_28 = network->addConcatenation(input_tensor_28, 2);\n\n        // # elan-2 block\n        auto repncspelan_29 = RepNCSPELAN4(network, weightMap, *cat_28->getOutput(0), 256, 128, 128, 64, 3, \"model.28\");\n        head = DualDDetect(network, weightMap, std::vector<ILayer*>{repncspelan_16, repncspelan_19, repncspelan_22},\n                           kNumClass, {128, 192, 256}, \"model.29\");\n    } else {\n        head = DDetect(network, weightMap, std::vector<ILayer*>{repncspelan_16, repncspelan_19, repncspelan_22},\n                       kNumClass, {128, 192, 256}, \"model.22\");\n    }\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(network, head, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator =\n            new Int8EntropyCalibrator2(1, kInputW, kInputH, gCalibTablePath, \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return serialized_model;\n}\nIHostMemory* build_engine_yolov9_m(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt,\n                                   std::string& wts_name, bool isConvert) {\n    /* ------ Create the builder ------ */\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{3, kInputH, kInputW});\n    assert(data);\n    std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n    int begin = isConvert ? 0 : 1;\n\n    // # conv down\n    // [-1, 1, Conv, [64, 3, 2]],  # 1-P1/2\n    auto conv_1 = convBnSiLU(network, weightMap, *data, 32, 3, 2, 1, \"model.\" + std::to_string(begin), 1);\n    begin += 1;\n    // # conv down\n    // [-1, 1, Conv, [128, 3, 2]],  # 2-P2/4\n    auto conv_2 = convBnSiLU(network, weightMap, *conv_1->getOutput(0), 64, 3, 2, 1, \"model.\" + std::to_string(begin));\n    begin += 1;\n    // # elan-1 block\n    // [-1, 1, RepNCSPELAN4, [256, 128, 64, 1]],  # 3\n    auto repncspelan_3 = RepNCSPELAN4(network, weightMap, *conv_2->getOutput(0), 128, 128, 128, 64, 1,\n                                      \"model.\" + std::to_string(begin));\n    begin += 1;\n    // # avg-conv down\n    // [-1, 1, ADown, [256]],  # 4-P3/8\n    auto adown_4 = AConv(network, weightMap, *repncspelan_3->getOutput(0), 240, \"model.\" + std::to_string(begin));\n    begin += 1;\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 256, 128, 1]],  # 5\n    auto repncspelan_5 = RepNCSPELAN4(network, weightMap, *adown_4->getOutput(0), 256, 240, 240, 120, 1,\n                                      \"model.\" + std::to_string(begin));\n    begin += 1;\n    // # avg-conv down\n    // [-1, 1, ADown, [512]],  # 6-P4/16\n    auto adown_6 = AConv(network, weightMap, *repncspelan_5->getOutput(0), 360, \"model.\" + std::to_string(begin));\n    begin += 1;\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 7\n    auto repncspelan_7 = RepNCSPELAN4(network, weightMap, *adown_6->getOutput(0), 512, 360, 360, 180, 1,\n                                      \"model.\" + std::to_string(begin));\n    begin += 1;\n    // # avg-conv down\n    // [-1, 1, ADown, [512]],  # 8-P5/32\n    auto adown_8 = AConv(network, weightMap, *repncspelan_7->getOutput(0), 480, \"model.\" + std::to_string(begin));\n    begin += 1;\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 9\n    auto repncspelan_9 = RepNCSPELAN4(network, weightMap, *adown_8->getOutput(0), 512, 480, 480, 240, 1,\n                                      \"model.\" + std::to_string(begin));\n    begin += 1;\n    // # elan-spp block\n    // [-1, 1, SPPELAN, [512, 256]],  # 10\n    auto sppelan_10 =\n            SPPELAN(network, weightMap, *repncspelan_9->getOutput(0), 512, 480, 240, \"model.\" + std::to_string(begin));\n    begin += 3;\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_11 = network->addResize(*sppelan_10->getOutput(0));\n    upsample_11->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_11[] = {1.0, 2.0, 2.0};\n    upsample_11->setScales(scales_11, 3);\n    // [[-1, 7], 1, Concat, [1]],  # cat backbone P4\n    ITensor* input_tensor_12[] = {upsample_11->getOutput(0), repncspelan_7->getOutput(0)};\n    auto cat_12 = network->addConcatenation(input_tensor_12, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 13\n    auto repncspelan_13 = RepNCSPELAN4(network, weightMap, *cat_12->getOutput(0), 1536, 360, 360, 180, 1,\n                                       \"model.\" + std::to_string(begin));\n    begin += 3;\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_14 = network->addResize(*repncspelan_13->getOutput(0));\n    upsample_14->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_14[] = {1.0, 2.0, 2.0};\n    upsample_14->setScales(scales_14, 3);\n    // [[-1, 5], 1, Concat, [1]],  # cat backbone P3\n    ITensor* input_tensor_15[] = {upsample_14->getOutput(0), repncspelan_5->getOutput(0)};\n    auto cat_15 = network->addConcatenation(input_tensor_15, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [256, 256, 128, 1]],  # 16 (P3/8-small)\n    auto repncspelan_16 = RepNCSPELAN4(network, weightMap, *cat_15->getOutput(0), 1024, 240, 240, 120, 1,\n                                       \"model.\" + std::to_string(begin));\n    begin += 1;\n\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [256]],\n    auto adown_17 = AConv(network, weightMap, *repncspelan_16->getOutput(0), 184, \"model.\" + std::to_string(begin));\n    begin += 2;\n    // [[-1, 13], 1, Concat, [1]],  # cat head P4\n    ITensor* input_tensor_18[] = {adown_17->getOutput(0), repncspelan_13->getOutput(0)};\n    auto cat_18 = network->addConcatenation(input_tensor_18, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 19 (P4/16-medium)\n    auto repncspelan_19 = RepNCSPELAN4(network, weightMap, *cat_18->getOutput(0), 768, 360, 360, 180, 1,\n                                       \"model.\" + std::to_string(begin));\n    begin += 1;\n\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [512]],\n    auto adown_20 = AConv(network, weightMap, *repncspelan_19->getOutput(0), 240, \"model.\" + std::to_string(begin));\n    begin += 2;\n    // [[-1, 10], 1, Concat, [1]],  # cat head P5\n    ITensor* input_tensor_21[] = {adown_20->getOutput(0), sppelan_10->getOutput(0)};\n    auto cat_21 = network->addConcatenation(input_tensor_21, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 22 (P5/32-large)\n    auto repncspelan_22 = RepNCSPELAN4(network, weightMap, *cat_21->getOutput(0), 1024, 480, 480, 240, 1,\n                                       \"model.\" + std::to_string(begin));\n    begin += 1;\n    std::vector<IConcatenationLayer*> head;\n    if (!isConvert) {\n        // # routing\n        // [5, 1, CBLinear, [[256]]], # 23\n        auto cblinear_23 = CBLinear(network, weightMap, *repncspelan_5->getOutput(0), {240}, 1, 1, 0, 1,\n                                    \"model.\" + std::to_string(begin));\n        begin += 1;\n        // [7, 1, CBLinear, [[256, 512]]], # 24\n        auto cblinear_24 = CBLinear(network, weightMap, *repncspelan_7->getOutput(0), {240, 360}, 1, 1, 0, 1,\n                                    \"model.\" + std::to_string(begin));\n        begin += 1;\n        // [9, 1, CBLinear, [[256, 512, 512]]], # 25\n        auto cblinear_25 = CBLinear(network, weightMap, *repncspelan_9->getOutput(0), {240, 360, 480}, 1, 1, 0, 1,\n                                    \"model.\" + std::to_string(begin));\n        begin += 1;\n\n        // # conv down\n        // [0, 1, Conv, [64, 3, 2]],  # 26-P1/2\n        auto conv_26 = convBnSiLU(network, weightMap, *data, 32, 3, 2, 1, \"model.\" + std::to_string(begin), 1);\n        begin += 1;\n\n        // # conv down\n        // [-1, 1, Conv, [128, 3, 2]],  # 27-P2/4\n        auto conv_27 =\n                convBnSiLU(network, weightMap, *conv_26->getOutput(0), 64, 3, 2, 1, \"model.\" + std::to_string(begin));\n        begin += 1;\n\n        // # elan-1 block\n        // [-1, 1, RepNCSPELAN4, [256, 128, 64, 1]],  # 28\n        auto repncspelan_28 = RepNCSPELAN4(network, weightMap, *conv_27->getOutput(0), 128, 128, 128, 64, 1,\n                                           \"model.\" + std::to_string(begin));\n        begin += 1;\n\n        // # avg-conv down fuse\n        // [-1, 1, ADown, [256]],  # 29-P3/8\n        auto adown_29 = AConv(network, weightMap, *repncspelan_28->getOutput(0), 240, \"model.\" + std::to_string(begin));\n        begin += 2;\n        // [[23, 24, 25, -1], 1, CBFuse, [[0, 0, 0]]], # 30\n        auto cbfuse = CBFuse(network, {cblinear_23, cblinear_24, cblinear_25, std::vector<ILayer*>{adown_29}},\n                             {0, 0, 0, 0}, {8, 16, 32, 8});\n\n        // # elan-2 block\n        // [-1, 1, RepNCSPELAN4, [512, 256, 128, 1]],  # 31\n        auto repncspelan_31 = RepNCSPELAN4(network, weightMap, *cbfuse->getOutput(0), 256, 240, 240, 120, 1,\n                                           \"model.\" + std::to_string(begin));\n        begin += 1;\n\n        // # avg-conv down fuse\n        // [-1, 1, ADown, [512]],  # 32-P4/16\n        auto adown_32 = AConv(network, weightMap, *repncspelan_31->getOutput(0), 360, \"model.\" + std::to_string(begin));\n        begin += 2;\n        // [[24, 25, -1], 1, CBFuse, [[1, 1]]], # 33\n        auto cbfuse_33 =\n                CBFuse(network, {cblinear_24, cblinear_25, std::vector<ILayer*>{adown_32}}, {1, 1, 0}, {16, 32, 16});\n\n        // # elan-2 block\n        // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 34\n        auto repncspelan_34 = RepNCSPELAN4(network, weightMap, *cbfuse_33->getOutput(0), 512, 360, 360, 180, 1,\n                                           \"model.\" + std::to_string(begin));\n        begin += 1;\n\n        // # avg-conv down fuse\n        // [-1, 1, ADown, [512]],  # 35-P5/32\n        auto adown_35 = AConv(network, weightMap, *repncspelan_34->getOutput(0), 480, \"model.\" + std::to_string(begin));\n        begin += 2;\n\n        // [[25, -1], 1, CBFuse, [[2]]], # 36\n        auto cbfuse_36 = CBFuse(network, {cblinear_25, std::vector<ILayer*>{adown_35}}, {2, 0}, {32, 32});\n\n        // # elan-2 block\n        // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 37\n        auto repncspelan_37 = RepNCSPELAN4(network, weightMap, *cbfuse_36->getOutput(0), 512, 480, 480, 240, 1,\n                                           \"model.\" + std::to_string(begin));\n        begin += 1;\n\n        // # detection head\n        // # detect\n        // [[31, 34, 37, 16, 19, 22], 1, DualDDetect, [nc]],  # DualDDetect(A3, A4, A5, P3, P4, P5)\n        head = DualDDetect(network, weightMap, std::vector<ILayer*>{repncspelan_31, repncspelan_34, repncspelan_37},\n                           kNumClass, {240, 360, 480}, \"model.\" + std::to_string(begin));\n    } else {\n        // # detection head\n        // # detect\n        // [[16, 19, 22], 1, DDetect, [nc]],  # DDetect(P3, P4, P5)\n        head = DDetect(network, weightMap, std::vector<ILayer*>{repncspelan_16, repncspelan_19, repncspelan_22},\n                       kNumClass, {240, 360, 480}, \"model.\" + std::to_string(begin));\n    }\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(network, head, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator =\n            new Int8EntropyCalibrator2(1, kInputW, kInputH, gCalibTablePath, \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return serialized_model;\n}\n\nIHostMemory* build_engine_yolov9_c(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt,\n                                   std::string& wts_name) {\n    /* ------ Create the builder ------ */\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{3, kInputH, kInputW});\n    assert(data);\n    std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n    // # conv down\n    // [-1, 1, Conv, [64, 3, 2]],  # 1-P1/2\n    auto conv_1 = convBnSiLU(network, weightMap, *data, 64, 3, 2, 1, \"model.1\", 1);\n    // # conv down\n    // [-1, 1, Conv, [128, 3, 2]],  # 2-P2/4\n    auto conv_2 = convBnSiLU(network, weightMap, *conv_1->getOutput(0), 128, 3, 2, 1, \"model.2\");\n    // # elan-1 block\n    // [-1, 1, RepNCSPELAN4, [256, 128, 64, 1]],  # 3\n    auto repncspelan_3 = RepNCSPELAN4(network, weightMap, *conv_2->getOutput(0), 128, 256, 128, 64, 1, \"model.3\");\n    // # avg-conv down\n    // [-1, 1, ADown, [256]],  # 4-P3/8\n    auto adown_4 = ADown(network, weightMap, *repncspelan_3->getOutput(0), 256, \"model.4\");\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 256, 128, 1]],  # 5\n    auto repncspelan_5 = RepNCSPELAN4(network, weightMap, *adown_4->getOutput(0), 256, 512, 256, 128, 1, \"model.5\");\n    // # avg-conv down\n    // [-1, 1, ADown, [512]],  # 6-P4/16\n    auto adown_6 = ADown(network, weightMap, *repncspelan_5->getOutput(0), 512, \"model.6\");\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 7\n    auto repncspelan_7 = RepNCSPELAN4(network, weightMap, *adown_6->getOutput(0), 512, 512, 512, 256, 1, \"model.7\");\n    // # avg-conv down\n    // [-1, 1, ADown, [512]],  # 8-P5/32\n    auto adown_8 = ADown(network, weightMap, *repncspelan_7->getOutput(0), 512, \"model.8\");\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 9\n    auto repncspelan_9 = RepNCSPELAN4(network, weightMap, *adown_8->getOutput(0), 512, 512, 512, 256, 1, \"model.9\");\n    // # elan-spp block\n    // [-1, 1, SPPELAN, [512, 256]],  # 10\n    auto sppelan_10 = SPPELAN(network, weightMap, *repncspelan_9->getOutput(0), 512, 512, 256, \"model.10\");\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_11 = network->addResize(*sppelan_10->getOutput(0));\n    upsample_11->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_11[] = {1.0, 2.0, 2.0};\n    upsample_11->setScales(scales_11, 3);\n    // [[-1, 7], 1, Concat, [1]],  # cat backbone P4\n    ITensor* input_tensor_12[] = {upsample_11->getOutput(0), repncspelan_7->getOutput(0)};\n    auto cat_12 = network->addConcatenation(input_tensor_12, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 13\n    auto repncspelan_13 = RepNCSPELAN4(network, weightMap, *cat_12->getOutput(0), 1536, 512, 512, 256, 1, \"model.13\");\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_14 = network->addResize(*repncspelan_13->getOutput(0));\n    upsample_14->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_14[] = {1.0, 2.0, 2.0};\n    upsample_14->setScales(scales_14, 3);\n    // [[-1, 5], 1, Concat, [1]],  # cat backbone P3\n    ITensor* input_tensor_15[] = {upsample_14->getOutput(0), repncspelan_5->getOutput(0)};\n    auto cat_15 = network->addConcatenation(input_tensor_15, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [256, 256, 128, 1]],  # 16 (P3/8-small)\n    auto repncspelan_16 = RepNCSPELAN4(network, weightMap, *cat_15->getOutput(0), 1024, 256, 256, 128, 1, \"model.16\");\n\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [256]],\n    auto adown_17 = ADown(network, weightMap, *repncspelan_16->getOutput(0), 256, \"model.17\");\n    // [[-1, 13], 1, Concat, [1]],  # cat head P4\n    ITensor* input_tensor_18[] = {adown_17->getOutput(0), repncspelan_13->getOutput(0)};\n    auto cat_18 = network->addConcatenation(input_tensor_18, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 19 (P4/16-medium)\n    auto repncspelan_19 = RepNCSPELAN4(network, weightMap, *cat_18->getOutput(0), 768, 512, 512, 256, 1, \"model.19\");\n\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [512]],\n    auto adown_20 = ADown(network, weightMap, *repncspelan_19->getOutput(0), 512, \"model.20\");\n    // [[-1, 10], 1, Concat, [1]],  # cat head P5\n    ITensor* input_tensor_21[] = {adown_20->getOutput(0), sppelan_10->getOutput(0)};\n    auto cat_21 = network->addConcatenation(input_tensor_21, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 22 (P5/32-large)\n    auto repncspelan_22 = RepNCSPELAN4(network, weightMap, *cat_21->getOutput(0), 1024, 512, 512, 256, 1, \"model.22\");\n\n    // # multi-level reversible auxiliary branch\n\n    // # routing\n    // [5, 1, CBLinear, [[256]]], # 23\n    auto cblinear_23 = CBLinear(network, weightMap, *repncspelan_5->getOutput(0), {256}, 1, 1, 0, 1, \"model.23\");\n    // [7, 1, CBLinear, [[256, 512]]], # 24\n    auto cblinear_24 = CBLinear(network, weightMap, *repncspelan_7->getOutput(0), {256, 512}, 1, 1, 0, 1, \"model.24\");\n    // [9, 1, CBLinear, [[256, 512, 512]]], # 25\n    auto cblinear_25 =\n            CBLinear(network, weightMap, *repncspelan_9->getOutput(0), {256, 512, 512}, 1, 1, 0, 1, \"model.25\");\n\n    // # conv down\n    // [0, 1, Conv, [64, 3, 2]],  # 26-P1/2\n    auto conv_26 = convBnSiLU(network, weightMap, *data, 64, 3, 2, 1, \"model.26\", 1);\n\n    // # conv down\n    // [-1, 1, Conv, [128, 3, 2]],  # 27-P2/4\n    auto conv_27 = convBnSiLU(network, weightMap, *conv_26->getOutput(0), 128, 3, 2, 1, \"model.27\");\n\n    // # elan-1 block\n    // [-1, 1, RepNCSPELAN4, [256, 128, 64, 1]],  # 28\n    auto repncspelan_28 = RepNCSPELAN4(network, weightMap, *conv_27->getOutput(0), 128, 256, 128, 64, 1, \"model.28\");\n\n    // # avg-conv down fuse\n    // [-1, 1, ADown, [256]],  # 29-P3/8\n    auto adown_29 = ADown(network, weightMap, *repncspelan_28->getOutput(0), 256, \"model.29\");\n    // [[23, 24, 25, -1], 1, CBFuse, [[0, 0, 0]]], # 30\n    auto cbfuse = CBFuse(network, {cblinear_23, cblinear_24, cblinear_25, std::vector<ILayer*>{adown_29}}, {0, 0, 0, 0},\n                         {8, 16, 32, 8});\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 256, 128, 1]],  # 31\n    auto repncspelan_31 = RepNCSPELAN4(network, weightMap, *cbfuse->getOutput(0), 256, 512, 256, 128, 1, \"model.31\");\n\n    // # avg-conv down fuse\n    // [-1, 1, ADown, [512]],  # 32-P4/16\n    auto adown_32 = ADown(network, weightMap, *repncspelan_31->getOutput(0), 512, \"model.32\");\n    // [[24, 25, -1], 1, CBFuse, [[1, 1]]], # 33\n    auto cbfuse_33 =\n            CBFuse(network, {cblinear_24, cblinear_25, std::vector<ILayer*>{adown_32}}, {1, 1, 0}, {16, 32, 16});\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 34\n    auto repncspelan_34 = RepNCSPELAN4(network, weightMap, *cbfuse_33->getOutput(0), 512, 512, 512, 256, 1, \"model.34\");\n\n    // # avg-conv down fuse\n    // [-1, 1, ADown, [512]],  # 35-P5/32\n    auto adown_35 = ADown(network, weightMap, *repncspelan_34->getOutput(0), 512, \"model.35\");\n\n    // [[25, -1], 1, CBFuse, [[2]]], # 36\n    auto cbfuse_36 = CBFuse(network, {cblinear_25, std::vector<ILayer*>{adown_35}}, {2, 0}, {32, 32});\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 37\n    auto repncspelan_37 = RepNCSPELAN4(network, weightMap, *cbfuse_36->getOutput(0), 512, 512, 512, 256, 1, \"model.37\");\n\n    // # detection head\n    // # detect\n    // [[31, 34, 37, 16, 19, 22], 1, DualDDetect, [nc]],  # DualDDetect(A3, A4, A5, P3, P4, P5)\n    auto dualddetect_38 =\n            DualDDetect(network, weightMap, std::vector<ILayer*>{repncspelan_31, repncspelan_34, repncspelan_37},\n                        kNumClass, {512, 512, 512}, \"model.38\");\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(network, dualddetect_38, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator =\n            new Int8EntropyCalibrator2(1, kInputW, kInputH, gCalibTablePath, \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return serialized_model;\n}\n\nIHostMemory* build_engine_yolov9_e(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt,\n                                   std::string& wts_name) {\n    /* ------ Create the builder ------ */\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{3, kInputH, kInputW});\n    assert(data);\n    std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n    /* ------backbone------ */\n    // [-1, 1, Conv, [64, 3, 2]],  # 1-P1/2\n    auto conv_1 = convBnSiLU(network, weightMap, *data, 64, 3, 2, 1, \"model.1\", 1);\n    assert(conv_1);\n    // [-1, 1, Conv, [128, 3, 2]],  # 2-P2/4\n    auto conv_2 = convBnSiLU(network, weightMap, *conv_1->getOutput(0), 128, 3, 2, 1, \"model.2\");\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [256, 128, 64, 2]],  # 3\n    auto repncspelan_3 = RepNCSPELAN4(network, weightMap, *conv_2->getOutput(0), 128, 256, 128, 64, 2, \"model.3\");\n    // avg-conv down\n    // [-1, 1, ADown, [256]],  # 4-P3/8\n    auto adown_4 = ADown(network, weightMap, *repncspelan_3->getOutput(0), 256, \"model.4\");\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [512, 256, 128, 2]],  # 5\n    auto repncspelan_5 = RepNCSPELAN4(network, weightMap, *adown_4->getOutput(0), 256, 512, 256, 128, 2, \"model.5\");\n    // avg-conv down\n    // [-1, 1, ADown, [512]],  # 6-P4/16\n    auto adown_6 = ADown(network, weightMap, *repncspelan_5->getOutput(0), 512, \"model.6\");\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [1024, 512, 256, 2]],  # 7\n    auto repncspelan_7 = RepNCSPELAN4(network, weightMap, *adown_6->getOutput(0), 512, 1024, 512, 256, 2, \"model.7\");\n    // avg-conv down\n    // [-1, 1, ADown, [1024]],  # 8-P5/32\n    auto adown_8 = ADown(network, weightMap, *repncspelan_7->getOutput(0), 1024, \"model.8\");\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [1024, 512, 256, 2]],  # 9\n    auto repncspelan_9 = RepNCSPELAN4(network, weightMap, *adown_8->getOutput(0), 512, 1024, 512, 256, 2, \"model.9\");\n\n    // [1, 1, CBLinear, [[64]]], # 10\n    auto cblinear_10 = CBLinear(network, weightMap, *conv_1->getOutput(0), {64}, 1, 1, 0, 1, \"model.10\");\n    // [3, 1, CBLinear, [[64, 128]]], # 11\n    auto cblinear_11 = CBLinear(network, weightMap, *repncspelan_3->getOutput(0), {64, 128}, 1, 1, 0, 1, \"model.11\");\n    // [5, 1, CBLinear, [[64, 128, 256]]], # 12\n    auto cblinear_12 =\n            CBLinear(network, weightMap, *repncspelan_5->getOutput(0), {64, 128, 256}, 1, 1, 0, 1, \"model.12\");\n    // [7, 1, CBLinear, [[64, 128, 256, 512]]], # 13\n    auto cblinear_13 =\n            CBLinear(network, weightMap, *repncspelan_7->getOutput(0), {64, 128, 256, 512}, 1, 1, 0, 1, \"model.13\");\n    // [9, 1, CBLinear, [[64, 128, 256, 512, 1024]]], # 14\n    auto cblinear_14 = CBLinear(network, weightMap, *repncspelan_9->getOutput(0), {64, 128, 256, 512, 1024}, 1, 1, 0, 1,\n                                \"model.14\");\n\n    // conv down\n    // [0, 1, Conv, [64, 3, 2]],  # 15-P1/2\n    auto conv_15 = convBnSiLU(network, weightMap, *data, 64, 3, 2, 1, \"model.15\", 1);\n    // [[10, 11, 12, 13, 14, -1], 1, CBFuse, [[0, 0, 0, 0, 0]]], # 16\n    auto cbfuse_16 = CBFuse(\n            network, {cblinear_10, cblinear_11, cblinear_12, cblinear_13, cblinear_14, std::vector<ILayer*>{conv_15}},\n            {0, 0, 0, 0, 0, 0}, {2, 4, 8, 16, 32, 2});\n\n    // conv down\n    // [-1, 1, Conv, [128, 3, 2]],  # 17-P2/4\n    auto conv_17 = convBnSiLU(network, weightMap, *cbfuse_16->getOutput(0), 128, 3, 2, 1, \"model.17\");\n    // [[11, 12, 13, 14, -1], 1, CBFuse, [[1, 1, 1, 1]]], # 18\n    auto cbfuse_18 =\n            CBFuse(network, {cblinear_11, cblinear_12, cblinear_13, cblinear_14, std::vector<ILayer*>{conv_17}},\n                   {1, 1, 1, 1, 0}, {4, 8, 16, 32, 4});\n\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [256, 128, 64, 2]],  # 19\n    auto repncspelan_19 = RepNCSPELAN4(network, weightMap, *cbfuse_18->getOutput(0), 128, 256, 128, 64, 2, \"model.19\");\n\n    // avg-conv down fuse\n    // [-1, 1, ADown, [256]],  # 20-P3/8\n    auto adown_20 = ADown(network, weightMap, *repncspelan_19->getOutput(0), 256, \"model.20\");\n    // [[12, 13, 14, -1], 1, CBFuse, [[2, 2, 2]]], # 21\n    auto cbfuse_21 = CBFuse(network, {cblinear_12, cblinear_13, cblinear_14, std::vector<ILayer*>{adown_20}},\n                            {2, 2, 2, 0}, {8, 16, 32, 8});\n\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [512, 256, 128, 2]],  # 22\n    auto repncspelan_22 = RepNCSPELAN4(network, weightMap, *cbfuse_21->getOutput(0), 256, 512, 256, 128, 2, \"model.22\");\n\n    // avg-conv down fuse\n    // [-1, 1, ADown, [512]],  # 23-P4/16\n    auto adown_23 = ADown(network, weightMap, *repncspelan_22->getOutput(0), 512, \"model.23\");\n    // [[13, 14, -1], 1, CBFuse, [[3, 3]]], # 24\n    auto cbfuse_24 =\n            CBFuse(network, {cblinear_13, cblinear_14, std::vector<ILayer*>{adown_23}}, {3, 3, 0}, {16, 32, 16});\n\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [1024, 512, 256, 2]],  # 25\n    auto repncspelan_25 =\n            RepNCSPELAN4(network, weightMap, *cbfuse_24->getOutput(0), 512, 1024, 512, 256, 2, \"model.25\");\n\n    // avg-conv down fuse\n    // [-1, 1, ADown, [1024]],  # 26-P5/32\n    auto adown_26 = ADown(network, weightMap, *repncspelan_25->getOutput(0), 1024, \"model.26\");\n    // [[14, -1], 1, CBFuse, [[4]]], # 27\n    auto cbfuse_27 = CBFuse(network, {cblinear_14, std::vector<ILayer*>{adown_26}}, {4, 0}, {32, 32});\n\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [1024, 512, 256, 2]],  # 28\n    auto repncspelan_28 =\n            RepNCSPELAN4(network, weightMap, *cbfuse_27->getOutput(0), 512, 1024, 512, 256, 2, \"model.28\");\n\n    // elan-spp block\n    // [9, 1, SPPELAN, [512, 256]],  # 29\n    auto sppelan_29 = SPPELAN(network, weightMap, *repncspelan_9->getOutput(0), 1024, 512, 256, \"model.29\");\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_30 = network->addResize(*sppelan_29->getOutput(0));\n    upsample_30->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_30[] = {1.0, 2.0, 2.0};\n    upsample_30->setScales(scales_30, 3);\n    // [[-1, 7], 1, Concat, [1]],  # cat backbone P4\n    ITensor* input_tensor_31[] = {upsample_30->getOutput(0), repncspelan_7->getOutput(0)};\n    auto cat_31 = network->addConcatenation(input_tensor_31, 2);\n\n    // # csp-elan block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 2]],  # 32\n    auto repncspelan_32 = RepNCSPELAN4(network, weightMap, *cat_31->getOutput(0), 1536, 512, 512, 256, 2, \"model.32\");\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_33 = network->addResize(*repncspelan_32->getOutput(0));\n    upsample_33->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_33[] = {1.0, 2.0, 2.0};\n    upsample_33->setScales(scales_33, 3);\n    // [[-1, 5], 1, Concat, [1]],  # cat backbone P3\n    ITensor* input_tensor_34[] = {upsample_33->getOutput(0), repncspelan_5->getOutput(0)};\n    auto cat_34 = network->addConcatenation(input_tensor_34, 2);\n\n    // # csp-elan block\n    // [-1, 1, RepNCSPELAN4, [256, 256, 128, 2]],  # 35\n    auto repncspelan_35 = RepNCSPELAN4(network, weightMap, *cat_34->getOutput(0), 1024, 256, 256, 128, 2, \"model.35\");\n\n    // # elan-spp block\n    // [28, 1, SPPELAN, [512, 256]],  # 36\n    auto sppelan_36 = SPPELAN(network, weightMap, *repncspelan_28->getOutput(0), 1024, 512, 256, \"model.36\");\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_37 = network->addResize(*sppelan_36->getOutput(0));\n    upsample_37->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_37[] = {1.0, 2.0, 2.0};\n    upsample_37->setScales(scales_37, 3);\n    // [[-1, 25], 1, Concat, [1]],  # cat backbone P4\n    ITensor* input_tensor_38[] = {upsample_37->getOutput(0), repncspelan_25->getOutput(0)};\n    auto cat_38 = network->addConcatenation(input_tensor_38, 2);\n\n    // # csp-elan block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 2]],  # 39\n    auto repncspelan_39 = RepNCSPELAN4(network, weightMap, *cat_38->getOutput(0), 1536, 512, 512, 256, 2, \"model.39\");\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_40 = network->addResize(*repncspelan_39->getOutput(0));\n    upsample_40->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_40[] = {1.0, 2.0, 2.0};\n    upsample_40->setScales(scales_40, 3);\n    // [[-1, 22], 1, Concat, [1]],  # cat backbone P3\n    ITensor* input_tensor_41[] = {upsample_40->getOutput(0), repncspelan_22->getOutput(0)};\n    auto cat_41 = network->addConcatenation(input_tensor_41, 2);\n\n    // # csp-elan block\n    // [-1, 1, RepNCSPELAN4, [256, 256, 128, 2]],  # 42 (P3/8-small)\n    auto repncspelan_42 = RepNCSPELAN4(network, weightMap, *cat_41->getOutput(0), 1024, 256, 256, 128, 2, \"model.42\");\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [256]],\n    auto adown_43 = ADown(network, weightMap, *repncspelan_42->getOutput(0), 256, \"model.43\");\n    // [[-1, 39], 1, Concat, [1]],  # cat head P4\n    ITensor* input_tensor_44[] = {adown_43->getOutput(0), repncspelan_39->getOutput(0)};\n    auto cat_44 = network->addConcatenation(input_tensor_44, 2);\n\n    // # csp-elan block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 2]],  # 45 (P4/16-medium)\n    auto repncspelan_45 = RepNCSPELAN4(network, weightMap, *cat_44->getOutput(0), 768, 512, 512, 256, 2, \"model.45\");\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [512]],\n    auto adown_46 = ADown(network, weightMap, *repncspelan_45->getOutput(0), 512, \"model.46\");\n    // [[-1, 36], 1, Concat, [1]],  # cat head P5\n    ITensor* input_tensor_47[] = {adown_46->getOutput(0), sppelan_36->getOutput(0)};\n    auto cat_47 = network->addConcatenation(input_tensor_47, 2);\n\n    // # csp-elan block\n    // [-1, 1, RepNCSPELAN4, [512, 1024, 512, 2]],  # 48 (P5/32-large)\n    auto repncspelan_48 = RepNCSPELAN4(network, weightMap, *cat_47->getOutput(0), 1024, 512, 1024, 512, 2, \"model.48\");\n\n    // auto DualDDetect_49 = DualDDetect(network, weightMap, std::vector<ILayer*>{RepNCSPELAN_42, RepNCSPELAN_45, RepNCSPELAN_48}, kNumClass, {256, 512, 512}, \"model.49\");\n    auto dualddetect_49 =\n            DualDDetect(network, weightMap, std::vector<ILayer*>{repncspelan_35, repncspelan_32, sppelan_29}, kNumClass,\n                        {256, 512, 512}, \"model.49\");\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(network, dualddetect_49, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator =\n            new Int8EntropyCalibrator2(1, kInputW, kInputH, gCalibTablePath, \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return serialized_model;\n}\n\nIHostMemory* build_engine_gelan_e(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt,\n                                  std::string& wts_name) {\n    /* ------ Create the builder ------ */\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{3, kInputH, kInputW});\n    assert(data);\n    std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n    /* ------backbone------ */\n    // [-1, 1, Conv, [64, 3, 2]],  # 1-P1/2\n    auto conv_1 = convBnSiLU(network, weightMap, *data, 64, 3, 2, 1, \"model.1\", 1);\n    assert(conv_1);\n    // [-1, 1, Conv, [128, 3, 2]],  # 2-P2/4\n    auto conv_2 = convBnSiLU(network, weightMap, *conv_1->getOutput(0), 128, 3, 2, 1, \"model.2\");\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [256, 128, 64, 2]],  # 3\n    auto repncspelan_3 = RepNCSPELAN4(network, weightMap, *conv_2->getOutput(0), 128, 256, 128, 64, 2, \"model.3\");\n    // avg-conv down\n    // [-1, 1, ADown, [256]],  # 4-P3/8\n    auto adown_4 = ADown(network, weightMap, *repncspelan_3->getOutput(0), 256, \"model.4\");\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [512, 256, 128, 2]],  # 5\n    auto repncspelan_5 = RepNCSPELAN4(network, weightMap, *adown_4->getOutput(0), 256, 512, 256, 128, 2, \"model.5\");\n    // avg-conv down\n    // [-1, 1, ADown, [512]],  # 6-P4/16\n    auto adown_6 = ADown(network, weightMap, *repncspelan_5->getOutput(0), 512, \"model.6\");\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [1024, 512, 256, 2]],  # 7\n    auto repncspelan_7 = RepNCSPELAN4(network, weightMap, *adown_6->getOutput(0), 512, 1024, 512, 256, 2, \"model.7\");\n    // avg-conv down\n    // [-1, 1, ADown, [1024]],  # 8-P5/32\n    auto adown_8 = ADown(network, weightMap, *repncspelan_7->getOutput(0), 1024, \"model.8\");\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [1024, 512, 256, 2]],  # 9\n    auto repncspelan_9 = RepNCSPELAN4(network, weightMap, *adown_8->getOutput(0), 512, 1024, 512, 256, 2, \"model.9\");\n\n    // [1, 1, CBLinear, [[64]]], # 10\n    auto cblinear_10 = CBLinear(network, weightMap, *conv_1->getOutput(0), {64}, 1, 1, 0, 1, \"model.10\");\n    // [3, 1, CBLinear, [[64, 128]]], # 11\n    auto cblinear_11 = CBLinear(network, weightMap, *repncspelan_3->getOutput(0), {64, 128}, 1, 1, 0, 1, \"model.11\");\n    // [5, 1, CBLinear, [[64, 128, 256]]], # 12\n    auto cblinear_12 =\n            CBLinear(network, weightMap, *repncspelan_5->getOutput(0), {64, 128, 256}, 1, 1, 0, 1, \"model.12\");\n    // [7, 1, CBLinear, [[64, 128, 256, 512]]], # 13\n    auto cblinear_13 =\n            CBLinear(network, weightMap, *repncspelan_7->getOutput(0), {64, 128, 256, 512}, 1, 1, 0, 1, \"model.13\");\n    // [9, 1, CBLinear, [[64, 128, 256, 512, 1024]]], # 14\n    auto cblinear_14 = CBLinear(network, weightMap, *repncspelan_9->getOutput(0), {64, 128, 256, 512, 1024}, 1, 1, 0, 1,\n                                \"model.14\");\n\n    // conv down\n    // [0, 1, Conv, [64, 3, 2]],  # 15-P1/2\n    auto conv_15 = convBnSiLU(network, weightMap, *data, 64, 3, 2, 1, \"model.15\", 1);\n    // [[10, 11, 12, 13, 14, -1], 1, CBFuse, [[0, 0, 0, 0, 0]]], # 16\n    auto cbfuse_16 = CBFuse(\n            network, {cblinear_10, cblinear_11, cblinear_12, cblinear_13, cblinear_14, std::vector<ILayer*>{conv_15}},\n            {0, 0, 0, 0, 0, 0}, {2, 4, 8, 16, 32, 2});\n\n    // conv down\n    // [-1, 1, Conv, [128, 3, 2]],  # 17-P2/4\n    auto conv_17 = convBnSiLU(network, weightMap, *cbfuse_16->getOutput(0), 128, 3, 2, 1, \"model.17\");\n    // [[11, 12, 13, 14, -1], 1, CBFuse, [[1, 1, 1, 1]]], # 18\n    auto cbfuse_18 =\n            CBFuse(network, {cblinear_11, cblinear_12, cblinear_13, cblinear_14, std::vector<ILayer*>{conv_17}},\n                   {1, 1, 1, 1, 0}, {4, 8, 16, 32, 4});\n\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [256, 128, 64, 2]],  # 19\n    auto repncspelan_19 = RepNCSPELAN4(network, weightMap, *cbfuse_18->getOutput(0), 128, 256, 128, 64, 2, \"model.19\");\n\n    // avg-conv down fuse\n    // [-1, 1, ADown, [256]],  # 20-P3/8\n    auto adown_20 = ADown(network, weightMap, *repncspelan_19->getOutput(0), 256, \"model.20\");\n    // [[12, 13, 14, -1], 1, CBFuse, [[2, 2, 2]]], # 21\n    auto cbfuse_21 = CBFuse(network, {cblinear_12, cblinear_13, cblinear_14, std::vector<ILayer*>{adown_20}},\n                            {2, 2, 2, 0}, {8, 16, 32, 8});\n\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [512, 256, 128, 2]],  # 22\n    auto repncspelan_22 = RepNCSPELAN4(network, weightMap, *cbfuse_21->getOutput(0), 256, 512, 256, 128, 2, \"model.22\");\n\n    // avg-conv down fuse\n    // [-1, 1, ADown, [512]],  # 23-P4/16\n    auto adown_23 = ADown(network, weightMap, *repncspelan_22->getOutput(0), 512, \"model.23\");\n    // [[13, 14, -1], 1, CBFuse, [[3, 3]]], # 24\n    auto cbfuse_24 =\n            CBFuse(network, {cblinear_13, cblinear_14, std::vector<ILayer*>{adown_23}}, {3, 3, 0}, {16, 32, 16});\n\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [1024, 512, 256, 2]],  # 25\n    auto repncspelan_25 =\n            RepNCSPELAN4(network, weightMap, *cbfuse_24->getOutput(0), 512, 1024, 512, 256, 2, \"model.25\");\n\n    // avg-conv down fuse\n    // [-1, 1, ADown, [1024]],  # 26-P5/32\n    auto adown_26 = ADown(network, weightMap, *repncspelan_25->getOutput(0), 1024, \"model.26\");\n    // [[14, -1], 1, CBFuse, [[4]]], # 27\n    auto cbfuse_27 = CBFuse(network, {cblinear_14, std::vector<ILayer*>{adown_26}}, {4, 0}, {32, 32});\n\n    // csp-elan block\n    // [-1, 1, RepNCSPELAN4, [1024, 512, 256, 2]],  # 28\n    auto repncspelan_28 =\n            RepNCSPELAN4(network, weightMap, *cbfuse_27->getOutput(0), 512, 1024, 512, 256, 2, \"model.28\");\n\n    // elan-spp block\n    // [28, 1, SPPELAN, [512, 256]],  # 29\n    auto sppelan_29 = SPPELAN(network, weightMap, *repncspelan_28->getOutput(0), 1024, 512, 256, \"model.29\");\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_30 = network->addResize(*sppelan_29->getOutput(0));\n    upsample_30->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_30[] = {1.0, 2.0, 2.0};\n    upsample_30->setScales(scales_30, 3);\n    // [[-1, 25], 1, Concat, [1]],  # cat backbone P4\n    ITensor* input_tensor_31[] = {upsample_30->getOutput(0), repncspelan_25->getOutput(0)};\n    auto cat_31 = network->addConcatenation(input_tensor_31, 2);\n\n    // # csp-elan block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 2]],  # 32\n    auto repncspelan_32 = RepNCSPELAN4(network, weightMap, *cat_31->getOutput(0), 1536, 512, 512, 256, 2, \"model.32\");\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_33 = network->addResize(*repncspelan_32->getOutput(0));\n    upsample_33->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_33[] = {1.0, 2.0, 2.0};\n    upsample_33->setScales(scales_33, 3);\n    // [[-1, 22], 1, Concat, [1]],  # cat backbone P3\n    ITensor* input_tensor_34[] = {upsample_33->getOutput(0), repncspelan_22->getOutput(0)};\n    auto cat_34 = network->addConcatenation(input_tensor_34, 2);\n\n    // # csp-elan block\n    // [-1, 1, RepNCSPELAN4, [256, 256, 128, 2]],  # 35\n    auto repncspelan_35 = RepNCSPELAN4(network, weightMap, *cat_34->getOutput(0), 1024, 256, 256, 128, 2, \"model.35\");\n\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [256]],\n    auto adown_36 = ADown(network, weightMap, *repncspelan_35->getOutput(0), 256, \"model.36\");\n    // [[-1, 32], 1, Concat, [1]],  # cat head P4\n    ITensor* input_tensor_37[] = {adown_36->getOutput(0), repncspelan_32->getOutput(0)};\n    auto cat_37 = network->addConcatenation(input_tensor_37, 2);\n\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 2]],  # 38 (P4/16-medium)\n    auto repncspelan_38 = RepNCSPELAN4(network, weightMap, *cat_37->getOutput(0), 768, 512, 512, 256, 2, \"model.38\");\n\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [512]],\n    auto adown_39 = ADown(network, weightMap, *repncspelan_38->getOutput(0), 512, \"model.39\");\n    // [[-1, 29], 1, Concat, [1]],  # cat head P5\n    ITensor* input_tensor_40[] = {adown_39->getOutput(0), sppelan_29->getOutput(0)};\n    auto cat_40 = network->addConcatenation(input_tensor_40, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 1024, 512, 2]],  # 41 (P5/32-large)\n    auto repncspelan_41 = RepNCSPELAN4(network, weightMap, *cat_40->getOutput(0), 1024, 512, 1024, 512, 2, \"model.41\");\n\n    auto ddetect_42 = DDetect(network, weightMap, std::vector<ILayer*>{repncspelan_35, repncspelan_38, repncspelan_41},\n                              kNumClass, {256, 512, 512}, \"model.42\");\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(network, ddetect_42, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator =\n            new Int8EntropyCalibrator2(1, kInputW, kInputH, gCalibTablePath, \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return serialized_model;\n}\nIHostMemory* build_engine_gelan_c(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt,\n                                  std::string& wts_name) {\n    /* ------ Create the builder ------ */\n    INetworkDefinition* network = builder->createNetworkV2(0U);\n\n    ITensor* data = network->addInput(kInputTensorName, dt, Dims3{3, kInputH, kInputW});\n    assert(data);\n    std::map<std::string, Weights> weightMap = loadWeights(wts_name);\n\n    // # conv down\n    // [-1, 1, Conv, [64, 3, 2]],  # 1-P1/2\n    auto conv_1 = convBnSiLU(network, weightMap, *data, 64, 3, 2, 1, \"model.0\", 1);\n    // # conv down\n    // [-1, 1, Conv, [128, 3, 2]],  # 2-P2/4\n    auto conv_2 = convBnSiLU(network, weightMap, *conv_1->getOutput(0), 128, 3, 2, 1, \"model.1\");\n    // # elan-1 block\n    // [-1, 1, RepNCSPELAN4, [256, 128, 64, 1]],  # 3\n    auto repncspelan_3 = RepNCSPELAN4(network, weightMap, *conv_2->getOutput(0), 128, 256, 128, 64, 1, \"model.2\");\n    // # avg-conv down\n    // [-1, 1, ADown, [256]],  # 4-P3/8\n    auto adown_4 = ADown(network, weightMap, *repncspelan_3->getOutput(0), 256, \"model.3\");\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 256, 128, 1]],  # 5\n    auto repncspelan_5 = RepNCSPELAN4(network, weightMap, *adown_4->getOutput(0), 256, 512, 256, 128, 1, \"model.4\");\n    // # avg-conv down\n    // [-1, 1, ADown, [512]],  # 6-P4/16\n    auto adown_6 = ADown(network, weightMap, *repncspelan_5->getOutput(0), 512, \"model.5\");\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 7\n    auto repncspelan_7 = RepNCSPELAN4(network, weightMap, *adown_6->getOutput(0), 512, 512, 512, 256, 1, \"model.6\");\n    // # avg-conv down\n    // [-1, 1, ADown, [512]],  # 8-P5/32\n    auto adown_8 = ADown(network, weightMap, *repncspelan_7->getOutput(0), 512, \"model.7\");\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 9\n    auto repncspelan_9 = RepNCSPELAN4(network, weightMap, *adown_8->getOutput(0), 512, 512, 512, 256, 1, \"model.8\");\n    // # elan-spp block\n    // [-1, 1, SPPELAN, [512, 256]],  # 10\n    auto sppelan_10 = SPPELAN(network, weightMap, *repncspelan_9->getOutput(0), 512, 512, 256, \"model.9\");\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_11 = network->addResize(*sppelan_10->getOutput(0));\n    upsample_11->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_11[] = {1.0, 2.0, 2.0};\n    upsample_11->setScales(scales_11, 3);\n    // [[-1, 7], 1, Concat, [1]],  # cat backbone P4\n    ITensor* input_tensor_12[] = {upsample_11->getOutput(0), repncspelan_7->getOutput(0)};\n    auto cat_12 = network->addConcatenation(input_tensor_12, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 13\n    auto repncspelan_13 = RepNCSPELAN4(network, weightMap, *cat_12->getOutput(0), 1536, 512, 512, 256, 1, \"model.12\");\n\n    // # up-concat merge\n    // [-1, 1, nn.Upsample, [None, 2, 'nearest']],\n    auto upsample_14 = network->addResize(*repncspelan_13->getOutput(0));\n    upsample_14->setResizeMode(ResizeMode::kNEAREST);\n    const float scales_14[] = {1.0, 2.0, 2.0};\n    upsample_14->setScales(scales_14, 3);\n    // [[-1, 5], 1, Concat, [1]],  # cat backbone P3\n    ITensor* input_tensor_15[] = {upsample_14->getOutput(0), repncspelan_5->getOutput(0)};\n    auto cat_15 = network->addConcatenation(input_tensor_15, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [256, 256, 128, 1]],  # 16 (P3/8-small)\n    auto repncspelan_16 = RepNCSPELAN4(network, weightMap, *cat_15->getOutput(0), 1024, 256, 256, 128, 1, \"model.15\");\n\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [256]],\n    auto adown_17 = ADown(network, weightMap, *repncspelan_16->getOutput(0), 256, \"model.16\");\n    // [[-1, 13], 1, Concat, [1]],  # cat head P4\n    ITensor* input_tensor_18[] = {adown_17->getOutput(0), repncspelan_13->getOutput(0)};\n    auto cat_18 = network->addConcatenation(input_tensor_18, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 19 (P4/16-medium)\n    auto repncspelan_19 = RepNCSPELAN4(network, weightMap, *cat_18->getOutput(0), 768, 512, 512, 256, 1, \"model.18\");\n\n    // # avg-conv-down merge\n    // [-1, 1, ADown, [512]],\n    auto adown_20 = ADown(network, weightMap, *repncspelan_19->getOutput(0), 512, \"model.19\");\n    // [[-1, 10], 1, Concat, [1]],  # cat head P5\n    ITensor* input_tensor_21[] = {adown_20->getOutput(0), sppelan_10->getOutput(0)};\n    auto cat_21 = network->addConcatenation(input_tensor_21, 2);\n\n    // # elan-2 block\n    // [-1, 1, RepNCSPELAN4, [512, 512, 256, 1]],  # 22 (P5/32-large)\n    auto repncspelan_22 = RepNCSPELAN4(network, weightMap, *cat_21->getOutput(0), 1024, 512, 512, 256, 1, \"model.21\");\n\n    // # detection head\n    // # detect\n    // [[31, 34, 37, 16, 19, 22], 1, DualDDetect, [nc]],  # DualDDetect(A3, A4, A5, P3, P4, P5)\n    auto ddetect_23 = DDetect(network, weightMap, std::vector<ILayer*>{repncspelan_16, repncspelan_19, repncspelan_22},\n                              kNumClass, {256, 512, 512}, \"model.22\");\n\n    nvinfer1::IPluginV2Layer* yolo = addYoLoLayer(network, ddetect_23, false);\n    yolo->getOutput(0)->setName(kOutputTensorName);\n    network->markOutput(*yolo->getOutput(0));\n\n    builder->setMaxBatchSize(kBatchSize);\n    config->setMaxWorkspaceSize(16 * (1 << 20));\n\n#if defined(USE_FP16)\n    config->setFlag(nvinfer1::BuilderFlag::kFP16);\n#elif defined(USE_INT8)\n    std::cout << \"Your platform support int8: \" << (builder->platformHasFastInt8() ? \"true\" : \"false\") << std::endl;\n    assert(builder->platformHasFastInt8());\n    config->setFlag(nvinfer1::BuilderFlag::kINT8);\n    auto* calibrator =\n            new Int8EntropyCalibrator2(1, kInputW, kInputH, gCalibTablePath, \"int8calib.table\", kInputTensorName);\n    config->setInt8Calibrator(calibrator);\n#endif\n\n    std::cout << \"Building engine, please wait for a while...\" << std::endl;\n    IHostMemory* serialized_model = builder->buildSerializedNetwork(*network, *config);\n    std::cout << \"Build engine successfully!\" << std::endl;\n\n    delete network;\n\n    // Release host memory\n    for (auto& mem : weightMap) {\n        free((void*)(mem.second.values));\n    }\n\n    return serialized_model;\n}\n"
  },
  {
    "path": "yolov9/src/postprocess.cpp",
    "content": "#include \"postprocess.h\"\n#include \"utils.h\"\ncv::Rect get_rect(cv::Mat& img, float bbox[4]) {\n    float l, r, t, b;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n    if (r_h > r_w) {\n        l = bbox[0] - bbox[2] / 2.f;\n        r = bbox[0] + bbox[2] / 2.f;\n        t = bbox[1] - bbox[3] / 2.f - (kInputH - r_w * img.rows) / 2;\n        b = bbox[1] + bbox[3] / 2.f - (kInputH - r_w * img.rows) / 2;\n        l = l / r_w;\n        r = r / r_w;\n        t = t / r_w;\n        b = b / r_w;\n    } else {\n        l = bbox[0] - bbox[2] / 2.f - (kInputW - r_h * img.cols) / 2;\n        r = bbox[0] + bbox[2] / 2.f - (kInputW - r_h * img.cols) / 2;\n        t = bbox[1] - bbox[3] / 2.f;\n        b = bbox[1] + bbox[3] / 2.f;\n        l = l / r_h;\n        r = r / r_h;\n        t = t / r_h;\n        b = b / r_h;\n    }\n    return cv::Rect(round(l), round(t), round(r - l), round(b - t));\n}\n\nstatic float iou(float lbox[4], float rbox[4]) {\n    float interBox[] = {\n            (std::max)(lbox[0] - lbox[2] / 2.f, rbox[0] - rbox[2] / 2.f),  //left\n            (std::min)(lbox[0] + lbox[2] / 2.f, rbox[0] + rbox[2] / 2.f),  //right\n            (std::max)(lbox[1] - lbox[3] / 2.f, rbox[1] - rbox[3] / 2.f),  //top\n            (std::min)(lbox[1] + lbox[3] / 2.f, rbox[1] + rbox[3] / 2.f),  //bottom\n    };\n\n    if (interBox[2] > interBox[3] || interBox[0] > interBox[1])\n        return 0.0f;\n\n    float interBoxS = (interBox[1] - interBox[0]) * (interBox[3] - interBox[2]);\n    return interBoxS / (lbox[2] * lbox[3] + rbox[2] * rbox[3] - interBoxS);\n}\n\nstatic bool cmp(const Detection& a, const Detection& b) {\n    return a.conf > b.conf;\n}\n\nvoid nms(std::vector<Detection>& res, float* output, float conf_thresh, float nms_thresh) {\n    int det_size = sizeof(Detection) / sizeof(float);\n    std::map<float, std::vector<Detection>> m;\n    for (int i = 0; i < output[0] && i < kMaxNumOutputBbox; i++) {\n        if (output[1 + det_size * i + 4] <= conf_thresh)\n            continue;\n        Detection det;\n        memcpy(&det, &output[1 + det_size * i], det_size * sizeof(float));\n        if (m.count(det.class_id) == 0)\n            m.emplace(det.class_id, std::vector<Detection>());\n        // x1x2y1y2 -> xywh\n        float c_x = (det.bbox[0] + det.bbox[2]) / 2;\n        float c_y = (det.bbox[1] + det.bbox[3]) / 2;\n        float w = det.bbox[2] - det.bbox[0];\n        float h = det.bbox[3] - det.bbox[1];\n        det.bbox[0] = c_x;\n        det.bbox[1] = c_y;\n        det.bbox[2] = w;\n        det.bbox[3] = h;\n        m[det.class_id].push_back(det);\n    }\n    for (auto it = m.begin(); it != m.end(); it++) {\n        auto& dets = it->second;\n        std::sort(dets.begin(), dets.end(), cmp);\n        for (size_t m = 0; m < dets.size(); ++m) {\n            auto& item = dets[m];\n            res.push_back(item);\n            for (size_t n = m + 1; n < dets.size(); ++n) {\n                if (iou(item.bbox, dets[n].bbox) > nms_thresh) {\n                    dets.erase(dets.begin() + n);\n                    --n;\n                }\n            }\n        }\n    }\n}\n\nvoid batch_nms(std::vector<std::vector<Detection>>& res_batch, float* output, int batch_size, int output_size,\n               float conf_thresh, float nms_thresh) {\n    res_batch.resize(batch_size);\n    for (int i = 0; i < batch_size; i++) {\n        nms(res_batch[i], &output[i * output_size], conf_thresh, nms_thresh);\n    }\n}\n\nvoid draw_bbox(std::vector<cv::Mat>& img_batch, std::vector<std::vector<Detection>>& res_batch) {\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        auto& res = res_batch[i];\n        cv::Mat img = img_batch[i];\n        for (size_t j = 0; j < res.size(); j++) {\n            cv::Rect r = get_rect(img, res[j].bbox);\n            cv::rectangle(img, r, cv::Scalar(0x27, 0xC1, 0x36), 2);\n            cv::putText(img, std::to_string((int)res[j].class_id), cv::Point(r.x, r.y - 1), cv::FONT_HERSHEY_PLAIN, 1.2,\n                        cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n        }\n    }\n    // draw num of objets to img\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        cv::putText(img_batch[i], std::to_string(res_batch[i].size()), cv::Point(0, 20), cv::FONT_HERSHEY_PLAIN, 1.2,\n                    cv::Scalar(0xFF, 0xFF, 0xFF), 2);\n    }\n}\n\nstatic cv::Rect get_downscale_rect(float bbox[4], float scale) {\n    float left = bbox[0] - bbox[2] / 2;\n    float top = bbox[1] - bbox[3] / 2;\n    float right = bbox[0] + bbox[2] / 2;\n    float bottom = bbox[1] + bbox[3] / 2;\n    left /= scale;\n    top /= scale;\n    right /= scale;\n    bottom /= scale;\n    return cv::Rect(round(left), round(top), round(right - left), round(bottom - top));\n}\n\n// std::vector<cv::Mat> process_mask(const float* proto, int proto_size, std::vector<Detection>& dets) {\n//     std::vector<cv::Mat> masks;\n//     for (size_t i = 0; i < dets.size(); i++) {\n//         cv::Mat mask_mat = cv::Mat::zeros(kInputH / 4, kInputW / 4, CV_32FC1);\n//         auto r = get_downscale_rect(dets[i].bbox, 4);\n//         for (int x = r.x; x < r.x + r.width; x++) {\n//             for (int y = r.y; y < r.y + r.height; y++) {\n//                 float e = 0.0f;\n//                 for (int j = 0; j < 32; j++) {\n//                     e += dets[i].mask[j] * proto[j * proto_size / 32 + y * mask_mat.cols + x];\n//                 }\n//                 e = 1.0f / (1.0f + expf(-e));\n//                 mask_mat.at<float>(y, x) = e;\n//             }\n//         }\n//         cv::resize(mask_mat, mask_mat, cv::Size(kInputW, kInputH));\n//         masks.push_back(mask_mat);\n//     }\n//     return masks;\n// }\n\ncv::Mat scale_mask(cv::Mat mask, cv::Mat img) {\n    int x, y, w, h;\n    float r_w = kInputW / (img.cols * 1.0);\n    float r_h = kInputH / (img.rows * 1.0);\n    if (r_h > r_w) {\n        w = kInputW;\n        h = r_w * img.rows;\n        x = 0;\n        y = (kInputH - h) / 2;\n    } else {\n        w = r_h * img.cols;\n        h = kInputH;\n        x = (kInputW - w) / 2;\n        y = 0;\n    }\n    cv::Rect r(x, y, w, h);\n    cv::Mat res;\n    cv::resize(mask(r), res, img.size());\n    return res;\n}\n\nvoid draw_mask_bbox(cv::Mat& img, std::vector<Detection>& dets, std::vector<cv::Mat>& masks,\n                    std::unordered_map<int, std::string>& labels_map) {\n    static std::vector<uint32_t> colors = {0xFF3838, 0xFF9D97, 0xFF701F, 0xFFB21D, 0xCFD231, 0x48F90A, 0x92CC17,\n                                           0x3DDB86, 0x1A9334, 0x00D4BB, 0x2C99A8, 0x00C2FF, 0x344593, 0x6473FF,\n                                           0x0018EC, 0x8438FF, 0x520085, 0xCB38FF, 0xFF95C8, 0xFF37C7};\n    for (size_t i = 0; i < dets.size(); i++) {\n        cv::Mat img_mask = scale_mask(masks[i], img);\n        auto color = colors[(int)dets[i].class_id % colors.size()];\n        auto bgr = cv::Scalar(color & 0xFF, color >> 8 & 0xFF, color >> 16 & 0xFF);\n\n        cv::Rect r = get_rect(img, dets[i].bbox);\n        for (int x = r.x; x < r.x + r.width; x++) {\n            for (int y = r.y; y < r.y + r.height; y++) {\n                float val = img_mask.at<float>(y, x);\n                if (val <= 0.5)\n                    continue;\n                img.at<cv::Vec3b>(y, x)[0] = img.at<cv::Vec3b>(y, x)[0] / 2 + bgr[0] / 2;\n                img.at<cv::Vec3b>(y, x)[1] = img.at<cv::Vec3b>(y, x)[1] / 2 + bgr[1] / 2;\n                img.at<cv::Vec3b>(y, x)[2] = img.at<cv::Vec3b>(y, x)[2] / 2 + bgr[2] / 2;\n            }\n        }\n\n        cv::rectangle(img, r, bgr, 2);\n\n        // Get the size of the text\n        cv::Size textSize =\n                cv::getTextSize(labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf),\n                                cv::FONT_HERSHEY_PLAIN, 1.2, 2, NULL);\n        // Set the top left corner of the rectangle\n        cv::Point topLeft(r.x, r.y - textSize.height);\n\n        // Set the bottom right corner of the rectangle\n        cv::Point bottomRight(r.x + textSize.width, r.y + textSize.height);\n\n        // Set the thickness of the rectangle lines\n        int lineThickness = 2;\n\n        // Draw the rectangle on the image\n        cv::rectangle(img, topLeft, bottomRight, bgr, -1);\n\n        cv::putText(img, labels_map[(int)dets[i].class_id] + \" \" + to_string_with_precision(dets[i].conf),\n                    cv::Point(r.x, r.y + 4), cv::FONT_HERSHEY_PLAIN, 1.2, cv::Scalar::all(0xFF), 2);\n    }\n}\nvoid process_decode_ptr_host(std::vector<Detection>& res, const float* decode_ptr_host, int bbox_element, cv::Mat& img,\n                             int count) {\n    Detection det;\n    for (int i = 0; i < count; i++) {\n        int basic_pos = 1 + i * bbox_element;\n        int keep_flag = decode_ptr_host[basic_pos + 6];\n        if (keep_flag == 1) {\n            det.bbox[0] = decode_ptr_host[basic_pos + 0];\n            det.bbox[1] = decode_ptr_host[basic_pos + 1];\n            det.bbox[2] = decode_ptr_host[basic_pos + 2];\n            det.bbox[3] = decode_ptr_host[basic_pos + 3];\n            det.conf = decode_ptr_host[basic_pos + 4];\n            det.class_id = decode_ptr_host[basic_pos + 5];\n            res.push_back(det);\n        }\n    }\n}\nvoid batch_process(std::vector<std::vector<Detection>>& res_batch, const float* decode_ptr_host, int batch_size,\n                   int bbox_element, const std::vector<cv::Mat>& img_batch) {\n    res_batch.resize(batch_size);\n    int count = static_cast<int>(*decode_ptr_host);\n    count = count > kMaxNumOutputBbox ? kMaxNumOutputBbox : count;\n    // std::min(count, kMaxNumOutputBbox);\n    for (int i = 0; i < batch_size; i++) {\n        auto& img = const_cast<cv::Mat&>(img_batch[i]);\n        process_decode_ptr_host(res_batch[i], &decode_ptr_host[i * count], bbox_element, img, count);\n    }\n}\n"
  },
  {
    "path": "yolov9/src/postprocess.cu",
    "content": "//\n// Created by lindsay on 23-7-17.\n//\n#include \"postprocess.h\"\n\nstatic __global__ void decode_kernel(float* predict, int num_bboxes, float confidence_threshold, float* parray,\n                                     int max_objects) {\n\n    float count = predict[0];\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    if (position >= count)\n        return;\n    float* pitem = predict + 1 + position * 6;\n    int index = atomicAdd(parray, 1);\n    if (index >= max_objects)\n        return;\n    float confidence = pitem[4];\n    if (confidence < confidence_threshold)\n        return;\n    float* pout_item = parray + 1 + index * bbox_element;\n    float left = pitem[0];\n    float top = pitem[1];\n    float right = pitem[2];\n    float bottom = pitem[3];\n    float label = pitem[5];\n    *pout_item++ = left;\n    *pout_item++ = top;\n    *pout_item++ = right;\n    *pout_item++ = bottom;\n    *pout_item++ = confidence;\n    *pout_item++ = label;\n    *pout_item++ = 1;  // 1 = keep, 0 = ignore\n}\n\nstatic __device__ float box_iou(float aleft, float atop, float aright, float abottom, float bleft, float btop,\n                                float bright, float bbottom) {\n\n    float cleft = max(aleft, bleft);\n    float ctop = max(atop, btop);\n    float cright = min(aright, bright);\n    float cbottom = min(abottom, bbottom);\n\n    float c_area = max(cright - cleft, 0.0f) * max(cbottom - ctop, 0.0f);\n    if (c_area == 0.0f)\n        return 0.0f;\n\n    float a_area = max(0.0f, aright - aleft) * max(0.0f, abottom - atop);\n    float b_area = max(0.0f, bright - bleft) * max(0.0f, bbottom - btop);\n    return c_area / (a_area + b_area - c_area);\n}\n\nstatic __global__ void nms_kernel(float* bboxes, int max_objects, float threshold) {\n\n    int position = (blockDim.x * blockIdx.x + threadIdx.x);\n    int count = min(static_cast<int>(bboxes[0]), max_objects);\n\n    // float count = 0.0f;\n    if (position >= count)\n        return;\n\n    float* pcurrent = bboxes + 1 + position * bbox_element;\n    for (int i = 1; i < count; ++i) {\n        float* pitem = bboxes + 1 + i * bbox_element;\n        if (i == position || pcurrent[5] != pitem[5])\n            continue;\n\n        if (pitem[4] >= pcurrent[4]) {\n            if (pitem[4] == pcurrent[4] && i < position)\n                continue;\n\n            float iou =\n                    box_iou(pcurrent[0], pcurrent[1], pcurrent[2], pcurrent[3], pitem[0], pitem[1], pitem[2], pitem[3]);\n\n            if (iou > threshold) {\n                pcurrent[6] = 0;\n                return;\n            }\n        }\n    }\n}\n// 置信度过滤\nvoid cuda_decode(float* predict, int num_bboxes, float confidence_threshold, float* parray, int max_objects,\n                 cudaStream_t stream) {\n    int block = 256;\n    int grid = ceil(num_bboxes / (float)block);\n    decode_kernel<<<grid, block, 0, stream>>>((float*)predict, num_bboxes, confidence_threshold, parray, max_objects);\n}\n\nvoid cuda_nms(float* parray, float nms_threshold, int max_objects, cudaStream_t stream) {\n    int block = max_objects < 256 ? max_objects : 256;\n    int grid = ceil(max_objects / (float)block);\n    nms_kernel<<<grid, block, 0, stream>>>(parray, max_objects, nms_threshold);\n}\n"
  },
  {
    "path": "yolov9/src/preprocess.cu",
    "content": "#include \"preprocess.h\"\n#include \"cuda_utils.h\"\n\nstatic uint8_t* img_buffer_host = nullptr;\nstatic uint8_t* img_buffer_device = nullptr;\n\nstruct AffineMatrix {\n    float value[6];\n};\n\n__global__ void warpaffine_kernel(\n    uint8_t* src, int src_line_size, int src_width,\n    int src_height, float* dst, int dst_width,\n    int dst_height, uint8_t const_value_st,\n    AffineMatrix d2s, int edge) {\n    int position = blockDim.x * blockIdx.x + threadIdx.x;\n    if (position >= edge) return;\n\n    float m_x1 = d2s.value[0];\n    float m_y1 = d2s.value[1];\n    float m_z1 = d2s.value[2];\n    float m_x2 = d2s.value[3];\n    float m_y2 = d2s.value[4];\n    float m_z2 = d2s.value[5];\n\n    int dx = position % dst_width;\n    int dy = position / dst_width;\n    float src_x = m_x1 * dx + m_y1 * dy + m_z1 + 0.5f;\n    float src_y = m_x2 * dx + m_y2 * dy + m_z2 + 0.5f;\n    float c0, c1, c2;\n\n    if (src_x <= -1 || src_x >= src_width || src_y <= -1 || src_y >= src_height) {\n        // out of range\n        c0 = const_value_st;\n        c1 = const_value_st;\n        c2 = const_value_st;\n    } else {\n        int y_low = floorf(src_y);\n        int x_low = floorf(src_x);\n        int y_high = y_low + 1;\n        int x_high = x_low + 1;\n\n        uint8_t const_value[] = {const_value_st, const_value_st, const_value_st};\n        float ly = src_y - y_low;\n        float lx = src_x - x_low;\n        float hy = 1 - ly;\n        float hx = 1 - lx;\n        float w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;\n        uint8_t* v1 = const_value;\n        uint8_t* v2 = const_value;\n        uint8_t* v3 = const_value;\n        uint8_t* v4 = const_value;\n\n        if (y_low >= 0) {\n            if (x_low >= 0)\n                v1 = src + y_low * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v2 = src + y_low * src_line_size + x_high * 3;\n        }\n\n        if (y_high < src_height) {\n            if (x_low >= 0)\n                v3 = src + y_high * src_line_size + x_low * 3;\n\n            if (x_high < src_width)\n                v4 = src + y_high * src_line_size + x_high * 3;\n        }\n\n        c0 = w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0];\n        c1 = w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1];\n        c2 = w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2];\n    }\n\n    // bgr to rgb \n    float t = c2;\n    c2 = c0;\n    c0 = t;\n\n    // normalization\n    c0 = c0 / 255.0f;\n    c1 = c1 / 255.0f;\n    c2 = c2 / 255.0f;\n\n    // rgbrgbrgb to rrrgggbbb\n    int area = dst_width * dst_height;\n    float* pdst_c0 = dst + dy * dst_width + dx;\n    float* pdst_c1 = pdst_c0 + area;\n    float* pdst_c2 = pdst_c1 + area;\n    *pdst_c0 = c0;\n    *pdst_c1 = c1;\n    *pdst_c2 = c2;\n}\n\nvoid cuda_preprocess(\n    uint8_t* src, int src_width, int src_height,\n    float* dst, int dst_width, int dst_height,\n    cudaStream_t stream) {\n\n    int img_size = src_width * src_height * 3;\n    // copy data to pinned memory\n    memcpy(img_buffer_host, src, img_size);\n    // copy data to device memory\n    CUDA_CHECK(cudaMemcpyAsync(img_buffer_device, img_buffer_host, img_size, cudaMemcpyHostToDevice, stream));\n\n    AffineMatrix s2d, d2s;\n    float scale = std::min(dst_height / (float)src_height, dst_width / (float)src_width);\n\n    s2d.value[0] = scale;\n    s2d.value[1] = 0;\n    s2d.value[2] = -scale * src_width  * 0.5  + dst_width * 0.5;\n    s2d.value[3] = 0;\n    s2d.value[4] = scale;\n    s2d.value[5] = -scale * src_height * 0.5 + dst_height * 0.5;\n\n    cv::Mat m2x3_s2d(2, 3, CV_32F, s2d.value);\n    cv::Mat m2x3_d2s(2, 3, CV_32F, d2s.value);\n    cv::invertAffineTransform(m2x3_s2d, m2x3_d2s);\n\n    memcpy(d2s.value, m2x3_d2s.ptr<float>(0), sizeof(d2s.value));\n\n    int jobs = dst_height * dst_width;\n    int threads = 256;\n    int blocks = ceil(jobs / (float)threads);\n\n    warpaffine_kernel<<<blocks, threads, 0, stream>>>(\n        img_buffer_device, src_width * 3, src_width,\n        src_height, dst, dst_width,\n        dst_height, 128, d2s, jobs);\n}\n\nvoid cuda_batch_preprocess(std::vector<cv::Mat>& img_batch,\n                           float* dst, int dst_width, int dst_height,\n                           cudaStream_t stream) {\n    int dst_size = dst_width * dst_height * 3;\n    for (size_t i = 0; i < img_batch.size(); i++) {\n        cuda_preprocess(img_batch[i].ptr(), img_batch[i].cols, img_batch[i].rows, &dst[dst_size * i], dst_width, dst_height, stream);\n        CUDA_CHECK(cudaStreamSynchronize(stream));\n    }\n}\n\nvoid cuda_preprocess_init(int max_image_size) {\n    // prepare input data in pinned memory\n    CUDA_CHECK(cudaMallocHost((void**)&img_buffer_host, max_image_size * 3));\n    // prepare input data in device memory\n    CUDA_CHECK(cudaMalloc((void**)&img_buffer_device, max_image_size * 3));\n}\n\nvoid cuda_preprocess_destroy() {\n    CUDA_CHECK(cudaFree(img_buffer_device));\n    CUDA_CHECK(cudaFreeHost(img_buffer_host));\n}\n\n"
  },
  {
    "path": "yolov9/windows/dirent.h",
    "content": "/*\n * Dirent interface for Microsoft Visual Studio\n *\n * Copyright (C) 1998-2019 Toni Ronkko\n * This file is part of dirent.  Dirent may be freely distributed\n * under the MIT license.  For all details and documentation, see\n * https://github.com/tronkko/dirent\n */\n#ifndef DIRENT_H\n#define DIRENT_H\n\n/* Hide warnings about unreferenced local functions */\n#if defined(__clang__)\n#   pragma clang diagnostic ignored \"-Wunused-function\"\n#elif defined(_MSC_VER)\n#   pragma warning(disable:4505)\n#elif defined(__GNUC__)\n#   pragma GCC diagnostic ignored \"-Wunused-function\"\n#endif\n\n/*\n * Include windows.h without Windows Sockets 1.1 to prevent conflicts with\n * Windows Sockets 2.0.\n */\n#ifndef WIN32_LEAN_AND_MEAN\n#   define WIN32_LEAN_AND_MEAN\n#endif\n#include <windows.h>\n\n#include <stdio.h>\n#include <stdarg.h>\n#include <wchar.h>\n#include <string.h>\n#include <stdlib.h>\n#include <malloc.h>\n#include <sys/types.h>\n#include <sys/stat.h>\n#include <errno.h>\n\n/* Indicates that d_type field is available in dirent structure */\n#define _DIRENT_HAVE_D_TYPE\n\n/* Indicates that d_namlen field is available in dirent structure */\n#define _DIRENT_HAVE_D_NAMLEN\n\n/* Entries missing from MSVC 6.0 */\n#if !defined(FILE_ATTRIBUTE_DEVICE)\n#   define FILE_ATTRIBUTE_DEVICE 0x40\n#endif\n\n/* File type and permission flags for stat(), general mask */\n#if !defined(S_IFMT)\n#   define S_IFMT _S_IFMT\n#endif\n\n/* Directory bit */\n#if !defined(S_IFDIR)\n#   define S_IFDIR _S_IFDIR\n#endif\n\n/* Character device bit */\n#if !defined(S_IFCHR)\n#   define S_IFCHR _S_IFCHR\n#endif\n\n/* Pipe bit */\n#if !defined(S_IFFIFO)\n#   define S_IFFIFO _S_IFFIFO\n#endif\n\n/* Regular file bit */\n#if !defined(S_IFREG)\n#   define S_IFREG _S_IFREG\n#endif\n\n/* Read permission */\n#if !defined(S_IREAD)\n#   define S_IREAD _S_IREAD\n#endif\n\n/* Write permission */\n#if !defined(S_IWRITE)\n#   define S_IWRITE _S_IWRITE\n#endif\n\n/* Execute permission */\n#if !defined(S_IEXEC)\n#   define S_IEXEC _S_IEXEC\n#endif\n\n/* Pipe */\n#if !defined(S_IFIFO)\n#   define S_IFIFO _S_IFIFO\n#endif\n\n/* Block device */\n#if !defined(S_IFBLK)\n#   define S_IFBLK 0\n#endif\n\n/* Link */\n#if !defined(S_IFLNK)\n#   define S_IFLNK 0\n#endif\n\n/* Socket */\n#if !defined(S_IFSOCK)\n#   define S_IFSOCK 0\n#endif\n\n/* Read user permission */\n#if !defined(S_IRUSR)\n#   define S_IRUSR S_IREAD\n#endif\n\n/* Write user permission */\n#if !defined(S_IWUSR)\n#   define S_IWUSR S_IWRITE\n#endif\n\n/* Execute user permission */\n#if !defined(S_IXUSR)\n#   define S_IXUSR 0\n#endif\n\n/* Read group permission */\n#if !defined(S_IRGRP)\n#   define S_IRGRP 0\n#endif\n\n/* Write group permission */\n#if !defined(S_IWGRP)\n#   define S_IWGRP 0\n#endif\n\n/* Execute group permission */\n#if !defined(S_IXGRP)\n#   define S_IXGRP 0\n#endif\n\n/* Read others permission */\n#if !defined(S_IROTH)\n#   define S_IROTH 0\n#endif\n\n/* Write others permission */\n#if !defined(S_IWOTH)\n#   define S_IWOTH 0\n#endif\n\n/* Execute others permission */\n#if !defined(S_IXOTH)\n#   define S_IXOTH 0\n#endif\n\n/* Maximum length of file name */\n#if !defined(PATH_MAX)\n#   define PATH_MAX MAX_PATH\n#endif\n#if !defined(FILENAME_MAX)\n#   define FILENAME_MAX MAX_PATH\n#endif\n#if !defined(NAME_MAX)\n#   define NAME_MAX FILENAME_MAX\n#endif\n\n/* File type flags for d_type */\n#define DT_UNKNOWN 0\n#define DT_REG S_IFREG\n#define DT_DIR S_IFDIR\n#define DT_FIFO S_IFIFO\n#define DT_SOCK S_IFSOCK\n#define DT_CHR S_IFCHR\n#define DT_BLK S_IFBLK\n#define DT_LNK S_IFLNK\n\n/* Macros for converting between st_mode and d_type */\n#define IFTODT(mode) ((mode) & S_IFMT)\n#define DTTOIF(type) (type)\n\n/*\n * File type macros.  Note that block devices, sockets and links cannot be\n * distinguished on Windows and the macros S_ISBLK, S_ISSOCK and S_ISLNK are\n * only defined for compatibility.  These macros should always return false\n * on Windows.\n */\n#if !defined(S_ISFIFO)\n#   define S_ISFIFO(mode) (((mode) & S_IFMT) == S_IFIFO)\n#endif\n#if !defined(S_ISDIR)\n#   define S_ISDIR(mode) (((mode) & S_IFMT) == S_IFDIR)\n#endif\n#if !defined(S_ISREG)\n#   define S_ISREG(mode) (((mode) & S_IFMT) == S_IFREG)\n#endif\n#if !defined(S_ISLNK)\n#   define S_ISLNK(mode) (((mode) & S_IFMT) == S_IFLNK)\n#endif\n#if !defined(S_ISSOCK)\n#   define S_ISSOCK(mode) (((mode) & S_IFMT) == S_IFSOCK)\n#endif\n#if !defined(S_ISCHR)\n#   define S_ISCHR(mode) (((mode) & S_IFMT) == S_IFCHR)\n#endif\n#if !defined(S_ISBLK)\n#   define S_ISBLK(mode) (((mode) & S_IFMT) == S_IFBLK)\n#endif\n\n/* Return the exact length of the file name without zero terminator */\n#define _D_EXACT_NAMLEN(p) ((p)->d_namlen)\n\n/* Return the maximum size of a file name */\n#define _D_ALLOC_NAMLEN(p) ((PATH_MAX)+1)\n\n\n#ifdef __cplusplus\nextern \"C\" {\n#endif\n\n\n/* Wide-character version */\nstruct _wdirent {\n    /* Always zero */\n    long d_ino;\n\n    /* File position within stream */\n    long d_off;\n\n    /* Structure size */\n    unsigned short d_reclen;\n\n    /* Length of name without \\0 */\n    size_t d_namlen;\n\n    /* File type */\n    int d_type;\n\n    /* File name */\n    wchar_t d_name[PATH_MAX+1];\n};\ntypedef struct _wdirent _wdirent;\n\nstruct _WDIR {\n    /* Current directory entry */\n    struct _wdirent ent;\n\n    /* Private file data */\n    WIN32_FIND_DATAW data;\n\n    /* True if data is valid */\n    int cached;\n\n    /* Win32 search handle */\n    HANDLE handle;\n\n    /* Initial directory name */\n    wchar_t *patt;\n};\ntypedef struct _WDIR _WDIR;\n\n/* Multi-byte character version */\nstruct dirent {\n    /* Always zero */\n    long d_ino;\n\n    /* File position within stream */\n    long d_off;\n\n    /* Structure size */\n    unsigned short d_reclen;\n\n    /* Length of name without \\0 */\n    size_t d_namlen;\n\n    /* File type */\n    int d_type;\n\n    /* File name */\n    char d_name[PATH_MAX+1];\n};\ntypedef struct dirent dirent;\n\nstruct DIR {\n    struct dirent ent;\n    struct _WDIR *wdirp;\n};\ntypedef struct DIR DIR;\n\n\n/* Dirent functions */\nstatic DIR *opendir (const char *dirname);\nstatic _WDIR *_wopendir (const wchar_t *dirname);\n\nstatic struct dirent *readdir (DIR *dirp);\nstatic struct _wdirent *_wreaddir (_WDIR *dirp);\n\nstatic int readdir_r(\n        DIR *dirp, struct dirent *entry, struct dirent **result);\nstatic int _wreaddir_r(\n        _WDIR *dirp, struct _wdirent *entry, struct _wdirent **result);\n\nstatic int closedir (DIR *dirp);\nstatic int _wclosedir (_WDIR *dirp);\n\nstatic void rewinddir (DIR* dirp);\nstatic void _wrewinddir (_WDIR* dirp);\n\nstatic int scandir (const char *dirname, struct dirent ***namelist,\n                    int (*filter)(const struct dirent*),\n                    int (*compare)(const struct dirent**, const struct dirent**));\n\nstatic int alphasort (const struct dirent **a, const struct dirent **b);\n\nstatic int versionsort (const struct dirent **a, const struct dirent **b);\n\n\n/* For compatibility with Symbian */\n#define wdirent _wdirent\n#define WDIR _WDIR\n#define wopendir _wopendir\n#define wreaddir _wreaddir\n#define wclosedir _wclosedir\n#define wrewinddir _wrewinddir\n\n\n/* Internal utility functions */\nstatic WIN32_FIND_DATAW *dirent_first (_WDIR *dirp);\nstatic WIN32_FIND_DATAW *dirent_next (_WDIR *dirp);\n\nstatic int dirent_mbstowcs_s(\n        size_t *pReturnValue,\n        wchar_t *wcstr,\n        size_t sizeInWords,\n        const char *mbstr,\n        size_t count);\n\nstatic int dirent_wcstombs_s(\n        size_t *pReturnValue,\n        char *mbstr,\n        size_t sizeInBytes,\n        const wchar_t *wcstr,\n        size_t count);\n\nstatic void dirent_set_errno (int error);\n\n\n/*\n * Open directory stream DIRNAME for read and return a pointer to the\n * internal working area that is used to retrieve individual directory\n * entries.\n */\nstatic _WDIR*\n_wopendir(\n        const wchar_t *dirname)\n{\n    _WDIR *dirp;\n#if WINAPI_FAMILY_PARTITION(WINAPI_PARTITION_DESKTOP)\n    /* Desktop */\n    DWORD n;\n#else\n    /* WinRT */\n    size_t n;\n#endif\n    wchar_t *p;\n\n    /* Must have directory name */\n    if (dirname == NULL  ||  dirname[0] == '\\0') {\n        dirent_set_errno (ENOENT);\n        return NULL;\n    }\n\n    /* Allocate new _WDIR structure */\n    dirp = (_WDIR*) malloc (sizeof (struct _WDIR));\n    if (!dirp) {\n        return NULL;\n    }\n\n    /* Reset _WDIR structure */\n    dirp->handle = INVALID_HANDLE_VALUE;\n    dirp->patt = NULL;\n    dirp->cached = 0;\n\n    /*\n     * Compute the length of full path plus zero terminator\n     *\n     * Note that on WinRT there's no way to convert relative paths\n     * into absolute paths, so just assume it is an absolute path.\n     */\n#if WINAPI_FAMILY_PARTITION(WINAPI_PARTITION_DESKTOP)\n    /* Desktop */\n    n = GetFullPathNameW (dirname, 0, NULL, NULL);\n#else\n    /* WinRT */\n    n = wcslen (dirname);\n#endif\n\n    /* Allocate room for absolute directory name and search pattern */\n    dirp->patt = (wchar_t*) malloc (sizeof (wchar_t) * n + 16);\n    if (dirp->patt == NULL) {\n        goto exit_closedir;\n    }\n\n    /*\n     * Convert relative directory name to an absolute one.  This\n     * allows rewinddir() to function correctly even when current\n     * working directory is changed between opendir() and rewinddir().\n     *\n     * Note that on WinRT there's no way to convert relative paths\n     * into absolute paths, so just assume it is an absolute path.\n     */\n#if WINAPI_FAMILY_PARTITION(WINAPI_PARTITION_DESKTOP)\n    /* Desktop */\n    n = GetFullPathNameW (dirname, n, dirp->patt, NULL);\n    if (n <= 0) {\n        goto exit_closedir;\n    }\n#else\n    /* WinRT */\n    wcsncpy_s (dirp->patt, n+1, dirname, n);\n#endif\n\n    /* Append search pattern \\* to the directory name */\n    p = dirp->patt + n;\n    switch (p[-1]) {\n        case '\\\\':\n        case '/':\n        case ':':\n            /* Directory ends in path separator, e.g. c:\\temp\\ */\n            /*NOP*/;\n            break;\n\n        default:\n            /* Directory name doesn't end in path separator */\n            *p++ = '\\\\';\n    }\n    *p++ = '*';\n    *p = '\\0';\n\n    /* Open directory stream and retrieve the first entry */\n    if (!dirent_first (dirp)) {\n        goto exit_closedir;\n    }\n\n    /* Success */\n    return dirp;\n\n    /* Failure */\n    exit_closedir:\n    _wclosedir (dirp);\n    return NULL;\n}\n\n/*\n * Read next directory entry.\n *\n * Returns pointer to static directory entry which may be overwritten by\n * subsequent calls to _wreaddir().\n */\nstatic struct _wdirent*\n_wreaddir(\n        _WDIR *dirp)\n{\n    struct _wdirent *entry;\n\n    /*\n     * Read directory entry to buffer.  We can safely ignore the return value\n     * as entry will be set to NULL in case of error.\n     */\n    (void) _wreaddir_r (dirp, &dirp->ent, &entry);\n\n    /* Return pointer to statically allocated directory entry */\n    return entry;\n}\n\n/*\n * Read next directory entry.\n *\n * Returns zero on success.  If end of directory stream is reached, then sets\n * result to NULL and returns zero.\n */\nstatic int\n_wreaddir_r(\n        _WDIR *dirp,\n        struct _wdirent *entry,\n        struct _wdirent **result)\n{\n    WIN32_FIND_DATAW *datap;\n\n    /* Read next directory entry */\n    datap = dirent_next (dirp);\n    if (datap) {\n        size_t n;\n        DWORD attr;\n\n        /*\n         * Copy file name as wide-character string.  If the file name is too\n         * long to fit in to the destination buffer, then truncate file name\n         * to PATH_MAX characters and zero-terminate the buffer.\n         */\n        n = 0;\n        while (n < PATH_MAX  &&  datap->cFileName[n] != 0) {\n            entry->d_name[n] = datap->cFileName[n];\n            n++;\n        }\n        entry->d_name[n] = 0;\n\n        /* Length of file name excluding zero terminator */\n        entry->d_namlen = n;\n\n        /* File type */\n        attr = datap->dwFileAttributes;\n        if ((attr & FILE_ATTRIBUTE_DEVICE) != 0) {\n            entry->d_type = DT_CHR;\n        } else if ((attr & FILE_ATTRIBUTE_DIRECTORY) != 0) {\n            entry->d_type = DT_DIR;\n        } else {\n            entry->d_type = DT_REG;\n        }\n\n        /* Reset dummy fields */\n        entry->d_ino = 0;\n        entry->d_off = 0;\n        entry->d_reclen = sizeof (struct _wdirent);\n\n        /* Set result address */\n        *result = entry;\n\n    } else {\n\n        /* Return NULL to indicate end of directory */\n        *result = NULL;\n\n    }\n\n    return /*OK*/0;\n}\n\n/*\n * Close directory stream opened by opendir() function.  This invalidates the\n * DIR structure as well as any directory entry read previously by\n * _wreaddir().\n */\nstatic int\n_wclosedir(\n        _WDIR *dirp)\n{\n    int ok;\n    if (dirp) {\n\n        /* Release search handle */\n        if (dirp->handle != INVALID_HANDLE_VALUE) {\n            FindClose (dirp->handle);\n        }\n\n        /* Release search pattern */\n        free (dirp->patt);\n\n        /* Release directory structure */\n        free (dirp);\n        ok = /*success*/0;\n\n    } else {\n\n        /* Invalid directory stream */\n        dirent_set_errno (EBADF);\n        ok = /*failure*/-1;\n\n    }\n    return ok;\n}\n\n/*\n * Rewind directory stream such that _wreaddir() returns the very first\n * file name again.\n */\nstatic void\n_wrewinddir(\n        _WDIR* dirp)\n{\n    if (dirp) {\n        /* Release existing search handle */\n        if (dirp->handle != INVALID_HANDLE_VALUE) {\n            FindClose (dirp->handle);\n        }\n\n        /* Open new search handle */\n        dirent_first (dirp);\n    }\n}\n\n/* Get first directory entry (internal) */\nstatic WIN32_FIND_DATAW*\ndirent_first(\n        _WDIR *dirp)\n{\n    WIN32_FIND_DATAW *datap;\n    DWORD error;\n\n    /* Open directory and retrieve the first entry */\n    dirp->handle = FindFirstFileExW(\n            dirp->patt, FindExInfoStandard, &dirp->data,\n            FindExSearchNameMatch, NULL, 0);\n    if (dirp->handle != INVALID_HANDLE_VALUE) {\n\n        /* a directory entry is now waiting in memory */\n        datap = &dirp->data;\n        dirp->cached = 1;\n\n    } else {\n\n        /* Failed to open directory: no directory entry in memory */\n        dirp->cached = 0;\n        datap = NULL;\n\n        /* Set error code */\n        error = GetLastError ();\n        switch (error) {\n            case ERROR_ACCESS_DENIED:\n                /* No read access to directory */\n                dirent_set_errno (EACCES);\n                break;\n\n            case ERROR_DIRECTORY:\n                /* Directory name is invalid */\n                dirent_set_errno (ENOTDIR);\n                break;\n\n            case ERROR_PATH_NOT_FOUND:\n            default:\n                /* Cannot find the file */\n                dirent_set_errno (ENOENT);\n        }\n\n    }\n    return datap;\n}\n\n/*\n * Get next directory entry (internal).\n *\n * Returns\n */\nstatic WIN32_FIND_DATAW*\ndirent_next(\n        _WDIR *dirp)\n{\n    WIN32_FIND_DATAW *p;\n\n    /* Get next directory entry */\n    if (dirp->cached != 0) {\n\n        /* A valid directory entry already in memory */\n        p = &dirp->data;\n        dirp->cached = 0;\n\n    } else if (dirp->handle != INVALID_HANDLE_VALUE) {\n\n        /* Get the next directory entry from stream */\n        if (FindNextFileW (dirp->handle, &dirp->data) != FALSE) {\n            /* Got a file */\n            p = &dirp->data;\n        } else {\n            /* The very last entry has been processed or an error occurred */\n            FindClose (dirp->handle);\n            dirp->handle = INVALID_HANDLE_VALUE;\n            p = NULL;\n        }\n\n    } else {\n\n        /* End of directory stream reached */\n        p = NULL;\n\n    }\n\n    return p;\n}\n\n/*\n * Open directory stream using plain old C-string.\n */\nstatic DIR*\nopendir(\n        const char *dirname)\n{\n    struct DIR *dirp;\n\n    /* Must have directory name */\n    if (dirname == NULL  ||  dirname[0] == '\\0') {\n        dirent_set_errno (ENOENT);\n        return NULL;\n    }\n\n    /* Allocate memory for DIR structure */\n    dirp = (DIR*) malloc (sizeof (struct DIR));\n    if (!dirp) {\n        return NULL;\n    }\n    {\n        int error;\n        wchar_t wname[PATH_MAX + 1];\n        size_t n;\n\n        /* Convert directory name to wide-character string */\n        error = dirent_mbstowcs_s(\n                &n, wname, PATH_MAX + 1, dirname, PATH_MAX + 1);\n        if (error) {\n            /*\n             * Cannot convert file name to wide-character string.  This\n             * occurs if the string contains invalid multi-byte sequences or\n             * the output buffer is too small to contain the resulting\n             * string.\n             */\n            goto exit_free;\n        }\n\n\n        /* Open directory stream using wide-character name */\n        dirp->wdirp = _wopendir (wname);\n        if (!dirp->wdirp) {\n            goto exit_free;\n        }\n\n    }\n\n    /* Success */\n    return dirp;\n\n    /* Failure */\n    exit_free:\n    free (dirp);\n    return NULL;\n}\n\n/*\n * Read next directory entry.\n */\nstatic struct dirent*\nreaddir(\n        DIR *dirp)\n{\n    struct dirent *entry;\n\n    /*\n     * Read directory entry to buffer.  We can safely ignore the return value\n     * as entry will be set to NULL in case of error.\n     */\n    (void) readdir_r (dirp, &dirp->ent, &entry);\n\n    /* Return pointer to statically allocated directory entry */\n    return entry;\n}\n\n/*\n * Read next directory entry into called-allocated buffer.\n *\n * Returns zero on success.  If the end of directory stream is reached, then\n * sets result to NULL and returns zero.\n */\nstatic int\nreaddir_r(\n        DIR *dirp,\n        struct dirent *entry,\n        struct dirent **result)\n{\n    WIN32_FIND_DATAW *datap;\n\n    /* Read next directory entry */\n    datap = dirent_next (dirp->wdirp);\n    if (datap) {\n        size_t n;\n        int error;\n\n        /* Attempt to convert file name to multi-byte string */\n        error = dirent_wcstombs_s(\n                &n, entry->d_name, PATH_MAX + 1, datap->cFileName, PATH_MAX + 1);\n\n        /*\n         * If the file name cannot be represented by a multi-byte string,\n         * then attempt to use old 8+3 file name.  This allows traditional\n         * Unix-code to access some file names despite of unicode\n         * characters, although file names may seem unfamiliar to the user.\n         *\n         * Be ware that the code below cannot come up with a short file\n         * name unless the file system provides one.  At least\n         * VirtualBox shared folders fail to do this.\n         */\n        if (error  &&  datap->cAlternateFileName[0] != '\\0') {\n            error = dirent_wcstombs_s(\n                    &n, entry->d_name, PATH_MAX + 1,\n                    datap->cAlternateFileName, PATH_MAX + 1);\n        }\n\n        if (!error) {\n            DWORD attr;\n\n            /* Length of file name excluding zero terminator */\n            entry->d_namlen = n - 1;\n\n            /* File attributes */\n            attr = datap->dwFileAttributes;\n            if ((attr & FILE_ATTRIBUTE_DEVICE) != 0) {\n                entry->d_type = DT_CHR;\n            } else if ((attr & FILE_ATTRIBUTE_DIRECTORY) != 0) {\n                entry->d_type = DT_DIR;\n            } else {\n                entry->d_type = DT_REG;\n            }\n\n            /* Reset dummy fields */\n            entry->d_ino = 0;\n            entry->d_off = 0;\n            entry->d_reclen = sizeof (struct dirent);\n\n        } else {\n\n            /*\n             * Cannot convert file name to multi-byte string so construct\n             * an erroneous directory entry and return that.  Note that\n             * we cannot return NULL as that would stop the processing\n             * of directory entries completely.\n             */\n            entry->d_name[0] = '?';\n            entry->d_name[1] = '\\0';\n            entry->d_namlen = 1;\n            entry->d_type = DT_UNKNOWN;\n            entry->d_ino = 0;\n            entry->d_off = -1;\n            entry->d_reclen = 0;\n\n        }\n\n        /* Return pointer to directory entry */\n        *result = entry;\n\n    } else {\n\n        /* No more directory entries */\n        *result = NULL;\n\n    }\n\n    return /*OK*/0;\n}\n\n/*\n * Close directory stream.\n */\nstatic int\nclosedir(\n        DIR *dirp)\n{\n    int ok;\n    if (dirp) {\n\n        /* Close wide-character directory stream */\n        ok = _wclosedir (dirp->wdirp);\n        dirp->wdirp = NULL;\n\n        /* Release multi-byte character version */\n        free (dirp);\n\n    } else {\n\n        /* Invalid directory stream */\n        dirent_set_errno (EBADF);\n        ok = /*failure*/-1;\n\n    }\n    return ok;\n}\n\n/*\n * Rewind directory stream to beginning.\n */\nstatic void\nrewinddir(\n        DIR* dirp)\n{\n    /* Rewind wide-character string directory stream */\n    _wrewinddir (dirp->wdirp);\n}\n\n/*\n * Scan directory for entries.\n */\nstatic int\nscandir(\n        const char *dirname,\n        struct dirent ***namelist,\n        int (*filter)(const struct dirent*),\n        int (*compare)(const struct dirent**, const struct dirent**))\n{\n    struct dirent **files = NULL;\n    size_t size = 0;\n    size_t allocated = 0;\n    const size_t init_size = 1;\n    DIR *dir = NULL;\n    struct dirent *entry;\n    struct dirent *tmp = NULL;\n    size_t i;\n    int result = 0;\n\n    /* Open directory stream */\n    dir = opendir (dirname);\n    if (dir) {\n\n        /* Read directory entries to memory */\n        while (1) {\n\n            /* Enlarge pointer table to make room for another pointer */\n            if (size >= allocated) {\n                void *p;\n                size_t num_entries;\n\n                /* Compute number of entries in the enlarged pointer table */\n                if (size < init_size) {\n                    /* Allocate initial pointer table */\n                    num_entries = init_size;\n                } else {\n                    /* Double the size */\n                    num_entries = size * 2;\n                }\n\n                /* Allocate first pointer table or enlarge existing table */\n                p = realloc (files, sizeof (void*) * num_entries);\n                if (p != NULL) {\n                    /* Got the memory */\n                    files = (dirent**) p;\n                    allocated = num_entries;\n                } else {\n                    /* Out of memory */\n                    result = -1;\n                    break;\n                }\n\n            }\n\n            /* Allocate room for temporary directory entry */\n            if (tmp == NULL) {\n                tmp = (struct dirent*) malloc (sizeof (struct dirent));\n                if (tmp == NULL) {\n                    /* Cannot allocate temporary directory entry */\n                    result = -1;\n                    break;\n                }\n            }\n\n            /* Read directory entry to temporary area */\n            if (readdir_r (dir, tmp, &entry) == /*OK*/0) {\n\n                /* Did we get an entry? */\n                if (entry != NULL) {\n                    int pass;\n\n                    /* Determine whether to include the entry in result */\n                    if (filter) {\n                        /* Let the filter function decide */\n                        pass = filter (tmp);\n                    } else {\n                        /* No filter function, include everything */\n                        pass = 1;\n                    }\n\n                    if (pass) {\n                        /* Store the temporary entry to pointer table */\n                        files[size++] = tmp;\n                        tmp = NULL;\n\n                        /* Keep up with the number of files */\n                        result++;\n                    }\n\n                } else {\n\n                    /*\n                     * End of directory stream reached => sort entries and\n                     * exit.\n                     */\n                    qsort (files, size, sizeof (void*),\n                           (int (*) (const void*, const void*)) compare);\n                    break;\n\n                }\n\n            } else {\n                /* Error reading directory entry */\n                result = /*Error*/ -1;\n                break;\n            }\n\n        }\n\n    } else {\n        /* Cannot open directory */\n        result = /*Error*/ -1;\n    }\n\n    /* Release temporary directory entry */\n    free (tmp);\n\n    /* Release allocated memory on error */\n    if (result < 0) {\n        for (i = 0; i < size; i++) {\n            free (files[i]);\n        }\n        free (files);\n        files = NULL;\n    }\n\n    /* Close directory stream */\n    if (dir) {\n        closedir (dir);\n    }\n\n    /* Pass pointer table to caller */\n    if (namelist) {\n        *namelist = files;\n    }\n    return result;\n}\n\n/* Alphabetical sorting */\nstatic int\nalphasort(\n        const struct dirent **a, const struct dirent **b)\n{\n    return strcoll ((*a)->d_name, (*b)->d_name);\n}\n\n/* Sort versions */\nstatic int\nversionsort(\n        const struct dirent **a, const struct dirent **b)\n{\n    /* FIXME: implement strverscmp and use that */\n    return alphasort (a, b);\n}\n\n/* Convert multi-byte string to wide character string */\nstatic int\ndirent_mbstowcs_s(\n        size_t *pReturnValue,\n        wchar_t *wcstr,\n        size_t sizeInWords,\n        const char *mbstr,\n        size_t count)\n{\n    int error;\n\n#if defined(_MSC_VER)  &&  _MSC_VER >= 1400\n\n    /* Microsoft Visual Studio 2005 or later */\n    error = mbstowcs_s (pReturnValue, wcstr, sizeInWords, mbstr, count);\n\n#else\n\n    /* Older Visual Studio or non-Microsoft compiler */\n    size_t n;\n\n    /* Convert to wide-character string (or count characters) */\n    n = mbstowcs (wcstr, mbstr, sizeInWords);\n    if (!wcstr  ||  n < count) {\n\n        /* Zero-terminate output buffer */\n        if (wcstr  &&  sizeInWords) {\n            if (n >= sizeInWords) {\n                n = sizeInWords - 1;\n            }\n            wcstr[n] = 0;\n        }\n\n        /* Length of resulting multi-byte string WITH zero terminator */\n        if (pReturnValue) {\n            *pReturnValue = n + 1;\n        }\n\n        /* Success */\n        error = 0;\n\n    } else {\n\n        /* Could not convert string */\n        error = 1;\n\n    }\n\n#endif\n    return error;\n}\n\n/* Convert wide-character string to multi-byte string */\nstatic int\ndirent_wcstombs_s(\n        size_t *pReturnValue,\n        char *mbstr,\n        size_t sizeInBytes, /* max size of mbstr */\n        const wchar_t *wcstr,\n        size_t count)\n{\n    int error;\n\n#if defined(_MSC_VER)  &&  _MSC_VER >= 1400\n\n    /* Microsoft Visual Studio 2005 or later */\n    error = wcstombs_s (pReturnValue, mbstr, sizeInBytes, wcstr, count);\n\n#else\n\n    /* Older Visual Studio or non-Microsoft compiler */\n    size_t n;\n\n    /* Convert to multi-byte string (or count the number of bytes needed) */\n    n = wcstombs (mbstr, wcstr, sizeInBytes);\n    if (!mbstr  ||  n < count) {\n\n        /* Zero-terminate output buffer */\n        if (mbstr  &&  sizeInBytes) {\n            if (n >= sizeInBytes) {\n                n = sizeInBytes - 1;\n            }\n            mbstr[n] = '\\0';\n        }\n\n        /* Length of resulting multi-bytes string WITH zero-terminator */\n        if (pReturnValue) {\n            *pReturnValue = n + 1;\n        }\n\n        /* Success */\n        error = 0;\n\n    } else {\n\n        /* Cannot convert string */\n        error = 1;\n\n    }\n\n#endif\n    return error;\n}\n\n/* Set errno variable */\nstatic void\ndirent_set_errno(\n        int error)\n{\n#if defined(_MSC_VER)  &&  _MSC_VER >= 1400\n\n    /* Microsoft Visual Studio 2005 and later */\n    _set_errno (error);\n\n#else\n\n    /* Non-Microsoft compiler or older Microsoft compiler */\n    errno = error;\n\n#endif\n}\n\n\n#ifdef __cplusplus\n}\n#endif\n#endif /*DIRENT_H*/\n\n"
  },
  {
    "path": "yolov9/yolov9_trt.py",
    "content": "\"\"\"\nAn example that uses TensorRT's Python api to make inferences.\n\"\"\"\nimport ctypes\nimport os\nimport shutil\nimport random\nimport sys\nimport threading\nimport time\nimport cv2\nimport numpy as np\nimport pycuda.autoinit  # noqa: F401\nimport pycuda.driver as cuda\nimport tensorrt as trt\n\nCONF_THRESH = 0.5\nIOU_THRESHOLD = 0.4\n\n\ndef get_img_path_batches(batch_size, img_dir):\n    ret = []\n    batch = []\n    for root, dirs, files in os.walk(img_dir):\n        for name in files:\n            if len(batch) == batch_size:\n                ret.append(batch)\n                batch = []\n            batch.append(os.path.join(root, name))\n    if len(batch) > 0:\n        ret.append(batch)\n    return ret\n\n\ndef plot_one_box(x, img, color=None, label=None, line_thickness=None):\n    \"\"\"\n    description: Plots one bounding box on image img,\n                 this function comes from yolov9 project.\n    param:\n        x:      a box likes [x1,y1,x2,y2]\n        img:    a opencv image object\n        color:  color to draw rectangle, such as (0,255,0)\n        label:  str\n        line_thickness: int\n    return:\n        no return\n\n    \"\"\"\n    tl = (\n            line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1\n    )  # line/font thickness\n    color = color or [random.randint(0, 255) for _ in range(3)]\n    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))\n    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)\n    if label:\n        tf = max(tl - 1, 1)  # font thickness\n        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]\n        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3\n        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)  # filled\n        cv2.putText(\n            img,\n            label,\n            (c1[0], c1[1] - 2),\n            0,\n            tl / 3,\n            [225, 255, 255],\n            thickness=tf,\n            lineType=cv2.LINE_AA,\n        )\n\n\nclass yolov9TRT(object):\n    \"\"\"\n    description: A yolov9 class that warps TensorRT ops, preprocess and postprocess ops.\n    \"\"\"\n\n    def __init__(self, engine_file_path):\n        # Create a Context on this device,\n        self.ctx = cuda.Device(0).make_context()\n        stream = cuda.Stream()\n        TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n        runtime = trt.Runtime(TRT_LOGGER)\n\n        # Deserialize the engine from file\n        with open(engine_file_path, \"rb\") as f:\n            engine = runtime.deserialize_cuda_engine(f.read())\n        context = engine.create_execution_context()\n\n        host_inputs = []\n        cuda_inputs = []\n        host_outputs = []\n        cuda_outputs = []\n        bindings = []\n\n        for binding in engine:\n            print('bingding:', binding, engine.get_binding_shape(binding))\n            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size\n            dtype = trt.nptype(engine.get_binding_dtype(binding))\n            # Allocate host and device buffers\n            host_mem = cuda.pagelocked_empty(size, dtype)\n            cuda_mem = cuda.mem_alloc(host_mem.nbytes)\n            # Append the device buffer to device bindings.\n            bindings.append(int(cuda_mem))\n            # Append to the appropriate list.\n            if engine.binding_is_input(binding):\n                self.input_w = engine.get_binding_shape(binding)[-1]\n                self.input_h = engine.get_binding_shape(binding)[-2]\n                host_inputs.append(host_mem)\n                cuda_inputs.append(cuda_mem)\n            else:\n                host_outputs.append(host_mem)\n                cuda_outputs.append(cuda_mem)\n\n        # Store\n        self.stream = stream\n        self.context = context\n        self.engine = engine\n        self.host_inputs = host_inputs\n        self.cuda_inputs = cuda_inputs\n        self.host_outputs = host_outputs\n        self.cuda_outputs = cuda_outputs\n        self.bindings = bindings\n        self.batch_size = engine.max_batch_size\n\n    def infer(self, raw_image_generator):\n        threading.Thread.__init__(self)\n        # Make self the active context, pushing it on top of the context stack.\n        self.ctx.push()\n        # Restore\n        stream = self.stream\n        context = self.context\n        host_inputs = self.host_inputs\n        cuda_inputs = self.cuda_inputs\n        host_outputs = self.host_outputs\n        cuda_outputs = self.cuda_outputs\n        bindings = self.bindings\n        # Do image preprocess\n        batch_image_raw = []\n        batch_origin_h = []\n        batch_origin_w = []\n        batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])\n        for i, image_raw in enumerate(raw_image_generator):\n            input_image, image_raw, origin_h, origin_w = self.preprocess_image(image_raw)\n            batch_image_raw.append(image_raw)\n            batch_origin_h.append(origin_h)\n            batch_origin_w.append(origin_w)\n            np.copyto(batch_input_image[i], input_image)\n        batch_input_image = np.ascontiguousarray(batch_input_image)\n\n        # Copy input image to host buffer\n        np.copyto(host_inputs[0], batch_input_image.ravel())\n        start = time.time()\n        # Transfer input data  to the GPU.\n        cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)\n        # Run inference.\n        context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)\n        # Transfer predictions back from the GPU.\n        cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)\n        # Synchronize the stream\n        stream.synchronize()\n        end = time.time()\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n        # Here we use the first row of output in that batch_size = 1\n        output = host_outputs[0]\n        # Do postprocess\n        for i in range(self.batch_size):\n            result_boxes, result_scores, result_classid = self.post_process(\n                output[i * 38001: (i + 1) * 38001], batch_origin_h[i], batch_origin_w[i]\n            )\n            # Draw rectangles and labels on the original image\n            for j in range(len(result_boxes)):\n                box = result_boxes[j]\n                plot_one_box(\n                    box,\n                    batch_image_raw[i],\n                    label=\"{}:{:.2f}\".format(\n                        categories[int(result_classid[j])], result_scores[j]\n                    ),\n                )\n        return batch_image_raw, end - start\n\n    def destroy(self):\n        # Remove any context from the top of the context stack, deactivating it.\n        self.ctx.pop()\n\n    def get_raw_image(self, image_path_batch):\n        \"\"\"\n        description: Read an image from image path\n        \"\"\"\n        for img_path in image_path_batch:\n            yield cv2.imread(img_path)\n\n    def get_raw_image_zeros(self, image_path_batch=None):\n        \"\"\"\n        description: Ready data for warmup\n        \"\"\"\n        for _ in range(self.batch_size):\n            yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)\n\n    def preprocess_image(self, raw_bgr_image):\n        \"\"\"\n        description: Convert BGR image to RGB,\n                     resize and pad it to target size, normalize to [0,1],\n                     transform to NCHW format.\n        param:\n            input_image_path: str, image path\n        return:\n            image:  the processed image\n            image_raw: the original image\n            h: original height\n            w: original width\n        \"\"\"\n        image_raw = raw_bgr_image\n        h, w, c = image_raw.shape\n        image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)\n        # Calculate widht and height and paddings\n        r_w = self.input_w / w\n        r_h = self.input_h / h\n        if r_h > r_w:\n            tw = self.input_w\n            th = int(r_w * h)\n            tx1 = tx2 = 0\n            ty1 = int((self.input_h - th) / 2)\n            ty2 = self.input_h - th - ty1\n        else:\n            tw = int(r_h * w)\n            th = self.input_h\n            tx1 = int((self.input_w - tw) / 2)\n            tx2 = self.input_w - tw - tx1\n            ty1 = ty2 = 0\n        # Resize the image with long side while maintaining ratio\n        image = cv2.resize(image, (tw, th))\n        # Pad the short side with (128,128,128)\n        image = cv2.copyMakeBorder(\n            image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, None, (128, 128, 128)\n        )\n        image = image.astype(np.float32)\n        # Normalize to [0,1]\n        image /= 255.0\n        # HWC to CHW format:\n        image = np.transpose(image, [2, 0, 1])\n        # CHW to NCHW format\n        image = np.expand_dims(image, axis=0)\n        # Convert the image to row-major order, also known as \"C order\":\n        image = np.ascontiguousarray(image)\n        return image, image_raw, h, w\n\n    def xywh2xyxy(self, origin_h, origin_w, x):\n        \"\"\"\n        description:    Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n        param:\n            origin_h:   height of original image\n            origin_w:   width of original image\n            x:          A boxes numpy, each row is a box [center_x, center_y, w, h]\n        return:\n            y:          A boxes numpy, each row is a box [x1, y1, x2, y2]\n        \"\"\"\n        y = np.zeros_like(x)\n        r_w = self.input_w / origin_w\n        r_h = self.input_h / origin_h\n        if r_h > r_w:\n            y[:, 0] = x[:, 0]\n            y[:, 2] = x[:, 2]\n            y[:, 1] = x[:, 1] - (self.input_h - r_w * origin_h) / 2\n            y[:, 3] = x[:, 3] - (self.input_h - r_w * origin_h) / 2\n            y /= r_w\n        else:\n            y[:, 0] = x[:, 0] - (self.input_w - r_h * origin_w) / 2\n            y[:, 2] = x[:, 2] - (self.input_w - r_h * origin_w) / 2\n            y[:, 1] = x[:, 1]\n            y[:, 3] = x[:, 3]\n            y /= r_h\n\n        return y\n\n    def post_process(self, output, origin_h, origin_w):\n        \"\"\"\n        description: postprocess the prediction\n        param:\n            output:     A numpy likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...]\n            origin_h:   height of original image\n            origin_w:   width of original image\n        return:\n            result_boxes: finally boxes, a boxes numpy, each row is a box [x1, y1, x2, y2]\n            result_scores: finally scores, a numpy, each element is the score correspoing to box\n            result_classid: finally classid, a numpy, each element is the classid correspoing to box\n        \"\"\"\n        # Get the num of boxes detected\n        num = int(output[0])\n        # Reshape to a two dimentional ndarray\n        pred = np.reshape(output[1:], (-1, 38))[:num, :]\n        # Do nms\n        boxes = self.non_max_suppression(pred, origin_h, origin_w, conf_thres=CONF_THRESH, nms_thres=IOU_THRESHOLD)\n        result_boxes = boxes[:, :4] if len(boxes) else np.array([])\n        result_scores = boxes[:, 4] if len(boxes) else np.array([])\n        result_classid = boxes[:, 5] if len(boxes) else np.array([])\n        return result_boxes, result_scores, result_classid\n\n    def bbox_iou(self, box1, box2, x1y1x2y2=True):\n        \"\"\"\n        description: compute the IoU of two bounding boxes\n        param:\n            box1: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            box2: A box coordinate (can be (x1, y1, x2, y2) or (x, y, w, h))\n            x1y1x2y2: select the coordinate format\n        return:\n            iou: computed iou\n        \"\"\"\n        if not x1y1x2y2:\n            # Transform from center and width to exact coordinates\n            b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2\n            b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2\n            b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2\n            b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2\n        else:\n            # Get the coordinates of bounding boxes\n            b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]\n            b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]\n\n        # Get the coordinates of the intersection rectangle\n        inter_rect_x1 = np.maximum(b1_x1, b2_x1)\n        inter_rect_y1 = np.maximum(b1_y1, b2_y1)\n        inter_rect_x2 = np.minimum(b1_x2, b2_x2)\n        inter_rect_y2 = np.minimum(b1_y2, b2_y2)\n        # Intersection area\n        inter_area = np.clip(inter_rect_x2 - inter_rect_x1 + 1, 0, None) * \\\n            np.clip(inter_rect_y2 - inter_rect_y1 + 1, 0, None)\n        # Union Area\n        b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)\n        b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)\n\n        iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)\n\n        return iou\n\n    def non_max_suppression(self, prediction, origin_h, origin_w, conf_thres=0.5, nms_thres=0.4):\n        \"\"\"\n        description: Removes detections with lower object confidence score than 'conf_thres' and performs\n        Non-Maximum Suppression to further filter detections.\n        param:\n            prediction: detections, (x1, y1, x2, y2, conf, cls_id)\n            origin_h: original image height\n            origin_w: original image width\n            conf_thres: a confidence threshold to filter detections\n            nms_thres: a iou threshold to filter detections\n        return:\n            boxes: output after nms with the shape (x1, y1, x2, y2, conf, cls_id)\n        \"\"\"\n        # Get the boxes that score > CONF_THRESH\n        boxes = prediction[prediction[:, 4] >= conf_thres]\n        # Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]\n        boxes[:, :4] = self.xywh2xyxy(origin_h, origin_w, boxes[:, :4])\n        # clip the coordinates\n        boxes[:, 0] = np.clip(boxes[:, 0], 0, origin_w - 1)\n        boxes[:, 2] = np.clip(boxes[:, 2], 0, origin_w - 1)\n        boxes[:, 1] = np.clip(boxes[:, 1], 0, origin_h - 1)\n        boxes[:, 3] = np.clip(boxes[:, 3], 0, origin_h - 1)\n        # Object confidence\n        confs = boxes[:, 4]\n        # Sort by the confs\n        boxes = boxes[np.argsort(-confs)]\n        # Perform non-maximum suppression\n        keep_boxes = []\n        while boxes.shape[0]:\n            large_overlap = self.bbox_iou(np.expand_dims(boxes[0, :4], 0), boxes[:, :4]) > nms_thres\n            label_match = boxes[0, -1] == boxes[:, -1]\n            # Indices of boxes with lower confidence scores, large IOUs and matching labels\n            invalid = large_overlap & label_match\n            keep_boxes += [boxes[0]]\n            boxes = boxes[~invalid]\n        boxes = np.stack(keep_boxes, 0) if len(keep_boxes) else np.array([])\n        return boxes\n\n\nclass inferThread(threading.Thread):\n    def __init__(self, yolov9_wrapper, image_path_batch):\n        threading.Thread.__init__(self)\n        self.yolov9_wrapper = yolov9_wrapper\n        self.image_path_batch = image_path_batch\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov9_wrapper.infer(self.yolov9_wrapper.get_raw_image(self.image_path_batch))\n        for i, img_path in enumerate(self.image_path_batch):\n            parent, filename = os.path.split(img_path)\n            save_name = os.path.join('output', filename)\n            # Save image\n            cv2.imwrite(save_name, batch_image_raw[i])\n        print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))\n\n\nclass warmUpThread(threading.Thread):\n    def __init__(self, yolov9_wrapper):\n        threading.Thread.__init__(self)\n        self.yolov9_wrapper = yolov9_wrapper\n\n    def run(self):\n        batch_image_raw, use_time = self.yolov9_wrapper.infer(self.yolov9_wrapper.get_raw_image_zeros())\n        print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))\n\n\nif __name__ == \"__main__\":\n    # load custom plugin and engine\n    PLUGIN_LIBRARY = \"build/libmyplugins.so\"\n    engine_file_path = \"yolov9-c.engine\"\n\n    if len(sys.argv) > 1:\n        engine_file_path = sys.argv[1]\n    if len(sys.argv) > 2:\n        PLUGIN_LIBRARY = sys.argv[2]\n\n    ctypes.CDLL(PLUGIN_LIBRARY)\n\n    # load coco labels\n\n    categories = [\"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\",\n                  \"traffic light\",\n                  \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n                  \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\n                  \"frisbee\",\n                  \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\",\n                  \"surfboard\",\n                  \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n                  \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n                  \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\n                  \"cell phone\",\n                  \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\n                  \"teddy bear\",\n                  \"hair drier\", \"toothbrush\"]\n\n    if os.path.exists('output/'):\n        shutil.rmtree('output/')\n    os.makedirs('output/')\n    # a yolov9TRT instance\n    yolov9_wrapper = yolov9TRT(engine_file_path)\n    try:\n        print('batch size is', yolov9_wrapper.batch_size)\n\n        image_dir = \"images/\"\n        image_path_batches = get_img_path_batches(yolov9_wrapper.batch_size, image_dir)\n\n        for i in range(10):\n            # create a new thread to do warm_up\n            thread1 = warmUpThread(yolov9_wrapper)\n            thread1.start()\n            thread1.join()\n        for batch in image_path_batches:\n            # create a new thread to do inference\n            thread1 = inferThread(yolov9_wrapper, batch)\n            thread1.start()\n            thread1.join()\n    finally:\n        # destroy the instance\n        yolov9_wrapper.destroy()\n"
  }
]